WIP: WAL prefetch (another approach)

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Thomas Munro (#1)

Re: WIP: WAL prefetch (another approach)

On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:

Hello hackers,

Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).

Thanks, I only did a very quick review so far, but the patch looks fine.

In general, I find it somewhat non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know.

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.

OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.

...

Earlier work, and how this patch compares:

* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.

How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.

What other common filesystems are missing support for this?

Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.

...

Here are some cases where I expect this patch to perform badly:

* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.

Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.

* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.

I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).

* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall. That might imply that you should probably be
running with a larger shared_buffers (?). It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).

Makes sense.

* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").

In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.

But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

https://github.com/postgres/postgres/compare/master...macdice:bgreader

thomas.munro@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#2)

Re: WIP: WAL prefetch (another approach)

On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:

Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).

Thanks, I only did a very quick review so far, but the patch looks fine.

Thanks for looking!

Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.

OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.

Using a 4GB RAM 16 thread virtual machine running Linux debian10
4.19.0-6-amd64 with an ext4 filesystem on NVMe storage:

postgres -D pgdata \
-c full_page_writes=off \
-c checkpoint_timeout=60min \
-c max_wal_size=10GB \
-c synchronous_commit=off

# in another shell
pgbench -i -s300 postgres
psql postgres -c checkpoint
pgbench -T60 -Mprepared -c4 -j4 postgres
killall -9 postgres

# save the crashed pgdata dir for repeated experiments
mv pgdata pgdata-save

# repeat this with values like wal_prefetch_distance=-1, 1kB, 8kB, 64kB, ...
rm -fr pgdata
cp -r pgdata-save pgdata
postgres -D pgdata -c wal_prefetch_distance=-1

What I see on my desktop machine is around 10x speed-up:

wal_prefetch_distance=-1 -> 62s (same number for unpatched)
wal_prefetch_distance=8kb -> 6s
wal_prefetch_distance=64kB -> 5s

On another dev machine I managed to get a 20x speedup, using a much
longer test. It's probably more interesting to try out some more
realistic workloads rather than this cache-destroying uniform random
stuff, though. It might be interesting to test on systems with high
random read latency, but high concurrency; I can think of a bunch of
network storage environments where that's the case, but I haven't
looked into them, beyond some toy testing with (non-Linux) NFS over a
slow network (results were promising).

Earlier work, and how this patch compares:

* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.

How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.

Assuming they like it enough to commit it (and initial informal
feedback on the general concept has been positive -- it's not messing
with their code at all, it's just boilerplate code to connect the
relevant Linux and FreeBSD VFS callbacks), it could indeed be quite a
while before it appear in conservative package repos, but I don't
know, it depends where you get your OpenZFS/ZoL module from.

What other common filesystems are missing support for this?

Using our build farm as a way to know which operating systems we care
about as a community, in no particular order:

* I don't know for exotic or network filesystems on Linux
* AIX 7.2's manual says "Valid option, but this value does not perform
any action" for every kind of advice except POSIX_FADV_NOWRITEBEHIND
(huh, nonstandard advice).
* Solaris's posix_fadvise() was a dummy libc function, as of 10 years
ago when they closed the source; who knows after that.
* FreeBSD's UFS and NFS support other advice through a default handler
but unfortunately ignore WILLNEED (I have patches for those too, not
good enough to send anywhere yet).
* OpenBSD has no such syscall
* NetBSD has the syscall, and I can see that it's hooked up to
readahead code, so that's probably the only unqualified yes in this
list
* Windows has no equivalent syscall; the closest thing might be to use
ReadFileEx() to initiate an async read into a dummy buffer; maybe you
can use a zero event so it doesn't even try to tell you when the I/O
completes, if you don't care?
* macOS has no such syscall, but you could in theory do an aio_read()
into a dummy buffer. On the other hand I don't think that interface
is a general solution for POSIX systems, because on at least Linux and
Solaris, aio_read() is emulated by libc with a whole bunch of threads
and we are allergic to those things (and even if we weren't, we
wouldn't want a whole threadpool in every PostgreSQL process, so you'd
need to hand off to a worker process, and then why bother?).
* HPUX, I don't know

We could test any of those with a simple test I wrote[1]https://github.com/macdice/some-io-tests, but I'm not
likely to test any non-open-source OS myself due to lack of access.
Amazingly, HPUX's posix_fadvise() doesn't appear to conform to POSIX:
it sets errno and returns -1, while POSIX says that it should return
an error number. Checking our source tree, I see that in
pg_flush_data(), we also screwed that up and expect errno to be set,
though we got it right in FilePrefetch().

In any case, Linux must be at the very least 90% of PostgreSQL
installations. Incidentally, sync_file_range() without wait is a sort
of opposite of WILLNEED (it means something like
"POSIX_FADV_WILLSYNC"), and no one seem terribly upset that we really
only have that on Linux (the emulations are pretty poor AFAICS).

Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.

That is a good idea, and I agree. I have a patch set that does
exactly that. It's nearly independent of the WAL prefetch work; it
just changes how PrefetchBuffer() is implemented, affecting bitmap
index scans, vacuum and any future user of PrefetchBuffer. If you
apply these patches too then WAL prefetch will use it (just set
max_background_readers = 4 or whatever):

That's simplified from an abandoned patch I had lying around because I
was experimenting with prefetching all the way into shared buffers
this way. The simplified version just does pread() into a dummy
buffer, for the side effect of warming the kernel's cache, pretty much
like pg_prefaulter. There are some tricky questions around whether
it's better to wait or not when the request queue is full; the way I
have that is far too naive, and that question is probably related to
your point about being cleverer about how many prefetch blocks you
should try to have in flight. A future version of PrefetchBuffer()
might lock the buffer then tell the worker (or some kernel async I/O
facility) to write the data into the buffer. If I understand
correctly, to make that work we need Robert's IO lock/condition
variable transplant[2]/messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com, and Andres's scheme for a suitable
interlocking protocol, and no doubt some bulletproof cleanup
machinery. I'm not working on any of that myself right now because I
don't want to step on Andres's toes.

Here are some cases where I expect this patch to perform badly:

* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.

Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.

Yeah, but I don't want to do too much work in the startup process, or
get too opinionated about how the underlying I/O stack works. I think
we'd need to do things like that in a direct I/O future, but we'd
probably offload it (?). I figured the best approach for early work
in this space would be to just get out of the way if we detect
sequential access.

* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.

I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).

Hmm. It's basically a buffer map probe. I think the adaptive logic
would probably be some kind of periodically resetting counter scheme,
but you're probably right to suspect that it might not even be worth
bothering with, especially if a single XLogReader can be made to do
the readahead with no real extra cost. Perhaps we should work on
making the cost of all prefetching overheads as low as possible first,
before trying to figure out whether it's worth building a system for
avoiding it.

* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").

In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.

FWIW, on pgbench tests on flash storage I've found that 1KB only helps
a bit, 8KB is great, and more than that doesn't get any better. Of
course, this is meaningless in general; a zipfian workload might need
to look a lot further head than a uniform one to find anything worth
prefetching, and that's exactly what you're complaining about, and I
agree.

But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?

Yeah, so you need to be touching blocks more than once between
checkpoints, if you want to see speed-up on a system with blocks <=
BLCKSZ and FPW on. If checkpoints are far enough apart you'll
eventually run out of FPWs and start replaying non-FPW stuff. Or you
could be on a filesystem with larger blocks than PostgreSQL.

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Yeah, I think you're right, we should probably try to make a little
queue to track LSNs and count prefetch requests in and out. I think
you'd also want PrefetchBuffer() to tell you if the block was already
in the buffer pool, so that you don't count blocks that it decided not
to prefetch. I guess PrefetchBuffer() needs to return an enum (I
already had it returning a bool for another purpose relating to an
edge case in crash recovery, when relations have been dropped by a
later WAL record). I will think about that.

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

There are two levels of defence against repeatedly prefetching the
same block: PrefetchBuffer() checks for blocks that are already in our
cache, and before that, PrefetchState remembers the last block so that
we can avoid fetching that block (or the following block).

[1]: https://github.com/macdice/some-io-tests
[2]: /messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com

[1]: /messages/by-id/13619.1557935593@sss.pgh.pa.us

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#3)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Fri, Jan 3, 2020 at 5:57 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Here is a new WIP version of the patch set that does that. Changes:

1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1]/messages/by-id/13619.1557935593@sss.pgh.pa.us, but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.

2. You can now change the relevant GUCs (wal_prefetch_distance,
wal_prefetch_fpw, effective_io_concurrency) at runtime and reload for
them to take immediate effect. For example, you can enable the
feature on a running replica by setting wal_prefetch_distance=8kB
(from the default of -1, which means off), and something like
effective_io_concurrency=10, and telling the postmaster to reload.

3. The new code is moved out to a new file
src/backend/access/transam/xlogprefetcher.c, to minimise new bloat in
the mighty xlog.c file. Functions were renamed to make their purpose
clearer, and a lot of comments were added.

4. The WAL receiver now exposes the current 'write' position via an
atomic value in shared memory, so we don't need to hammer the WAL
receiver's spinlock.

5. There is some rudimentary user documentation of the GUCs.

Attachments:

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v2.patchapplication/octet-stream; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v2.patchDownload

From 34a5bcab7eb4a2ac64f0fe9a533cacba0e7481b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
---
 src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++-------------
 src/include/storage/bufmgr.h        |  3 ++
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..6e0875022c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,6 +519,48 @@ ComputeIoConcurrency(int io_concurrency, double *target)
 	return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
 }
 
+void
+SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+	BufferTag	newTag;		/* identity of requested block */
+	uint32		newHash;	/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+
+	Assert(BlockNumberIsValid(blockNum));
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+				   forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(newPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&newTag, newHash);
+	LWLockRelease(newPartitionLock);
+
+	/* If not in buffers, initiate prefetch */
+	if (buf_id < 0)
+		smgrprefetch(smgr_reln, forkNum, blockNum);
+
+	/*
+	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
+	 * the block might be just about to be evicted, which would be stupid
+	 * since we know we are going to need it soon.  But the only easy answer
+	 * is to bump the usage_count, which does not seem like a great solution:
+	 * when the caller does ultimately touch the block, usage_count would get
+	 * bumped again, resulting in too much favoritism for blocks that are
+	 * involved in a prefetch sequence. A real fix would involve some
+	 * additional per-buffer state, and it's not clear that there's enough of
+	 * a problem to justify that.
+	 */
+#endif
+}
+
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
@@ -550,39 +592,8 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
-
-		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
-					   forkNum, blockNum);
-
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
-		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
-			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
-		 */
+		/* pass it to the shared buffer version */
+		SharedPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..89a47afec1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -18,6 +18,7 @@
 #include "storage/buf.h"
 #include "storage/bufpage.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 
@@ -162,6 +163,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  * prototypes for functions in bufmgr.c
  */
 extern bool ComputeIoConcurrency(int io_concurrency, double *target);
+extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
+								 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
-- 
2.23.0

0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v2.patchapplication/octet-stream; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v2.patchDownload

From 794f6c7d9f8e0b3b3e97aad1ce13d275be25bb4c Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.

An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.
---
 src/backend/access/transam/xlog.c          | 4 ++--
 src/backend/access/transam/xlogfuncs.c     | 2 +-
 src/backend/replication/walreceiverfuncs.c | 4 ++--
 src/backend/replication/walsender.c        | 2 +-
 src/include/replication/walreceiver.h      | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3813eadfb4..0c389e9315 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9261,7 +9261,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -12082,7 +12082,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
+						receivedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
 						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..9bce63b534 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -294,7 +294,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b9d0..1079b3f8cb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2903,7 +2903,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..147b374a26 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.23.0

0003-Add-WalRcvGetWriteRecPtr-new-definition-v2.patchapplication/octet-stream; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v2.patchDownload

From 165c9a9c5ecf300c2be1b79e2e480807416b2fae Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk.  To do that,
it needs to be able to see the write pointer advancing in shared
memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
---
 src/backend/replication/walreceiver.c      |  5 +++++
 src/backend/replication/walreceiverfuncs.c | 10 ++++++++++
 src/include/replication/walreceiver.h      |  9 +++++++++
 3 files changed, 24 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..88a51ba35f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -244,6 +244,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 9bce63b534..14e9a6245a 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,16 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 147b374a26..1e8f304dc4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -83,6 +84,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * Same as above, but advanced after writing and before flushing, without
+	 * the need to acquire the spin lock.  Data can be read by another process
+	 * up to this point, but shouldn't be used for data integrity purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -323,6 +331,7 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
 extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.23.0

0004-Allow-PrefetchBuffer-to-report-missing-file-in-re-v2.patchapplication/octet-stream; name=0004-Allow-PrefetchBuffer-to-report-missing-file-in-re-v2.patchDownload

From 9d7368f3328fdbe15d2078f44c8e4578bb90b84c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 30 Dec 2019 16:43:50 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report missing file in
 recovery.

Normally, smgrread() in recovery would create any missing files,
on the assumption that a later WAL record must unlink it.  In
order to support prefetching buffers during recovery, we must
also handle missing files there.  To give the caller the
opportunity to do that, return false to indicate that the
underlying file doesn't exist.

Also report whether a prefetch was actually initiated, so that
callers can limit the number of concurrent IOs they issue without
counting the prefetch calls that did nothing.
---
 src/backend/storage/buffer/bufmgr.c |  9 +++++++--
 src/backend/storage/smgr/md.c       |  9 +++++++--
 src/backend/storage/smgr/smgr.c     |  9 ++++++---
 src/include/storage/bufmgr.h        | 12 ++++++++++--
 src/include/storage/md.h            |  2 +-
 src/include/storage/smgr.h          |  2 +-
 6 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6e0875022c..5dbbcf8111 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,7 +519,7 @@ ComputeIoConcurrency(int io_concurrency, double *target)
 	return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
 }
 
-void
+PrefetchBufferResult
 SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
@@ -545,7 +545,11 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
 
 	/* If not in buffers, initiate prefetch */
 	if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+		if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+			return PREFETCH_BUFFER_NOREL;
+		return PREFETCH_BUFFER_MISS;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -559,6 +563,7 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
 	 * a problem to justify that.
 	 */
 #endif
+	return PREFETCH_BUFFER_HIT;
 }
 
 /*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
-void
+bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..f6c8a37290 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
-	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
@@ -489,11 +489,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later commit).
  */
-void
+bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
-	smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89a47afec1..5d7a796ba0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,12 +159,20 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+typedef enum PrefetchBufferResult
+{
+	PREFETCH_BUFFER_HIT,
+	PREFETCH_BUFFER_MISS,
+	PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
 /*
  * prototypes for functions in bufmgr.c
  */
 extern bool ComputeIoConcurrency(int io_concurrency, double *target);
-extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
-								 BlockNumber blockNum);
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
-- 
2.23.0

0005-Prefetch-referenced-blocks-during-recovery-v2.patchapplication/octet-stream; name=0005-Prefetch-referenced-blocks-during-recovery-v2.patchDownload

From 545ddb9055dfff3eff520d5fc854a8f4abfdf029 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2020 18:17:24 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC wal_prefetch_distance.  If it is set to a positive
number of bytes, then read ahead in the WAL at most that distance and
initiate asynchronous reading of referenced blocks, in the hope of
avoiding I/O stalls.

The number of concurrent asynchronous reads is limited by both
effective_io_concurrency and wal_prefetch_distance.

Author: Thomas Munro
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                    |  38 ++
 src/backend/access/transam/Makefile         |   1 +
 src/backend/access/transam/xlog.c           |  65 +++
 src/backend/access/transam/xlogprefetcher.c | 456 ++++++++++++++++++++
 src/backend/access/transam/xlogutils.c      |  23 +-
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/utils/misc/guc.c                |  25 ++
 src/include/access/xlog.h                   |   4 +
 src/include/access/xlogprefetcher.h         |  25 ++
 src/include/access/xlogutils.h              |  20 +
 src/include/storage/bufmgr.h                |   5 +
 src/include/utils/guc.h                     |   2 +
 12 files changed, 664 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..415b0793e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3082,6 +3082,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-prefetch-distance" xreflabel="wal_prefetch_distance">
+      <term><varname>wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-effective-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0c389e9315..0f27a4da54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,11 +34,13 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "catalog/storage_xlog.h"
 #include "commands/tablespace.h"
 #include "common/controldata_utils.h"
 #include "miscadmin.h"
@@ -104,6 +106,8 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			wal_prefetch_distance = -1;
+bool		wal_prefetch_fpw = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -801,6 +805,7 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
  */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -6191,6 +6196,7 @@ CheckRequiredParameterValues(void)
 	}
 }
 
+
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
@@ -7046,6 +7052,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetcher *prefetcher = NULL;
 
 			InRedo = true;
 
@@ -7053,6 +7060,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* the first time through, see if we need to enable prefetching */
+			ResetWalPrefetcher();
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7082,6 +7092,31 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * The first time through, or if any relevant settings or the
+				 * WAL source changes, we'll restart the prefetching machinery
+				 * as appropriate.  This is simpler than trying to handle
+				 * various complicated state changes.
+				 */
+				if (unlikely(reset_wal_prefetcher))
+				{
+					/* If we had one already, destroy it. */
+					if (prefetcher)
+					{
+						XLogPrefetcherFree(prefetcher);
+						prefetcher = NULL;
+					}
+					/* If we want one, create it. */
+					if (wal_prefetch_distance > 0)
+							prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+																currentSource == XLOG_FROM_STREAM);
+					reset_wal_prefetcher = false;
+				}
+
+				/* Peform WAL prefetching, if enabled. */
+				if (prefetcher)
+					XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7269,6 +7304,8 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			if (prefetcher)
+				XLogPrefetcherFree(prefetcher);
 
 			if (reachedRecoveryTarget)
 			{
@@ -10128,6 +10165,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+void
+assign_wal_prefetch_distance(int new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11911,6 +11966,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and move on to the next state.
 					 */
 					currentSource = XLOG_FROM_STREAM;
+					ResetWalPrefetcher();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12334,3 +12390,12 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+	reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..6b565dc313
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,456 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/hsearch.h"
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	XLogRecPtr	   *prefetch_queue;
+	int				prefetch_queue_size;
+	int				prefetch_head;
+	int				prefetch_tail;
+
+	/* Details of last prefetched block. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance ot get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* We're allowed to read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_LSN;
+		prefetcher->options.lsn = (XLogRecPtr) -1;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/*
+	 * The size of the queue is determined by target_prefetch_pages, which is
+	 * derived from effective_io_concurrency.  In theory we might have a
+	 * separate queue for each tablespace, but it's not clear how that should
+	 * work, so for now we'll just use the system-wide GUC to rate-limit all
+	 * prefetching.
+	 */
+	prefetcher->prefetch_queue_size = target_prefetch_pages;
+	prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+	prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+	/* Prepare to read at the given LSN. */
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher->prefetch_queue);
+	pfree(prefetcher);
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * If an error has occurred or we've hit the end of the WAL or a timeline
+	 * change, do nothing.  Eventually we might be restarted by the recovery
+	 * loop deciding to reset us due to a new timeline or a GUC change.
+	 */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of IOs running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/* Can we drop any filters yet, due to problem records begin replayed? */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/* Main prefetch loop. */
+	for (;;)
+	{
+		XLogReaderState *reader = prefetcher->reader;
+		char *error;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					elog(LOG, "WAL prefetch: %s", error);
+					prefetcher->shutdown = true;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (prefetcher->reader->ReadRecPtr >= replaying_lsn + wal_prefetch_distance)
+			break;
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/*
+		 * Scan the record for block references.  We might already have been
+		 * partway through processing this record when we hit maximum I/O
+		 * concurrency, so start where we left off.
+		 */
+		for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+		{
+			DecodedBkpBlock *block = &reader->blocks[i];
+			SMgrRelation reln;
+
+			/* Ignore everything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+				continue;
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so you might thing we should skip it.  However, if the
+			 * underlying filesystem uses larger logical blocks than us, it
+			 * might still need to perform a read-before-write some time later.
+			 * Therefore, only prefetch if configured to do so.
+			 */
+			if (block->has_image && !wal_prefetch_fpw)
+				continue;
+
+			/*
+			 * If this block will initialize a new page then it's probably an
+			 * extension.  Since it might create a new segment, we can't try
+			 * to prefetch this block until the record has been replayed, or we
+			 * might try to open a file that doesn't exist yet.
+			 */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										reader->ReadRecPtr);
+				continue;
+			}
+
+			/* Should we skip this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+										 block->blkno))
+				continue;
+
+			/* Fast path for repeated references to the same relation. */
+			if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+			{
+				/*
+				 * If this is a repeat or sequential access, then skip it.  We
+				 * expect the kernel to detect sequential access on its own
+				 * and do a better job than we could.
+				 */
+				if (block->blkno == prefetcher->last_blkno ||
+					block->blkno == prefetcher->last_blkno + 1)
+				{
+					prefetcher->last_blkno = block->blkno;
+					continue;
+				}
+
+				/* We can avoid calling smgropen(). */
+				reln = prefetcher->last_reln;
+			}
+			else
+			{
+				/* Otherwise we have to open it. */
+				reln = smgropen(block->rnode, InvalidBackendId);
+				prefetcher->last_rnode = block->rnode;
+				prefetcher->last_reln = reln;
+			}
+			prefetcher->last_blkno = block->blkno;
+
+			/* Try to prefetch this block! */
+			switch (SharedPrefetchBuffer(reln, block->forknum, block->blkno))
+			{
+			case PREFETCH_BUFFER_HIT:
+				/* It's already cached, so do nothing. */
+				break;
+			case PREFETCH_BUFFER_MISS:
+				/*
+				 * I/O has possibly been initiated (though we don't know if it
+				 * was already cached by the kernel, so we just have to assume
+				 * that it has due to lack of better information).  Record
+				 * this as an I/O in progress until eventually we replay this
+				 * LSN.
+				 */
+				XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+				/*
+				 * If the queue is now full, we'll have to wait before
+				 * processing any more blocks from this record.
+				 */
+				if (XLogPrefetcherSaturated(prefetcher))
+				{
+					prefetcher->next_block_id = i + 1;
+					return;
+				}
+				break;
+			case PREFETCH_BUFFER_NOREL:
+				/*
+				 * The underlying segment file doesn't exist.  Presumably it
+				 * will be unlinked by a later WAL record.  When recovery
+				 * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+				 * flag.  We certainly don't want to do that sort of thing
+				 * while merely prefetching, so let's just ignore references
+				 * to this relation until this record is replayed, and let
+				 * recovery create the dummy file or complain if something is
+				 * wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										reader->ReadRecPtr);
+				break;
+			}
+		}
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * IO, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when IO really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+	}
+}
+
+/*
+ * Check if the maximum allowed number of IOs is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
 
 	loc = targetPagePtr + reqLen;
 
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * notices recovery finishes, so we only have to maintain it for the
 		 * local process until recovery ends.
 		 */
-		if (!RecoveryInProgress())
+		if (options)
+		{
+			switch (options->read_upto_policy)
+			{
+			case XLRO_WALRCV_WRITTEN:
+				read_upto = GetWalRcvWriteRecPtr();
+				break;
+			case XLRO_LSN:
+				read_upto = options->lsn;
+				break;
+			default:
+				read_upto = 0;
+				elog(ERROR, "unknown read_upto_policy value");
+				break;
+			}
+		}
+		else if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			if (options && options->nowait)
+				break;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f390..2e07e2394a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1240,6 +1240,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless wal_prefetch_distance is set to a positive number.")
+		},
+		&wal_prefetch_fpw,
+		false,
+		NULL, assign_wal_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2626,6 +2638,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("How many bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable WAL prefetching."),
+			GUC_UNIT_BYTE
+		},
+		&wal_prefetch_distance,
+		-1, -1, INT_MAX,
+		NULL, assign_wal_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -11484,6 +11507,8 @@ assign_effective_io_concurrency(int newval, void *extra)
 {
 #ifdef USE_PREFETCH
 	target_prefetch_pages = *((int *) extra);
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
 #endif							/* USE_PREFETCH */
 }
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..0a31edfba4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
+extern int	wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ResetWalPrefetcher(void);
+
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..070ffc5c85
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/* How far to read. */
+	enum {
+		XLRO_WALRCV_WRITTEN,
+		XLRO_LSN
+	} read_upto_policy;
+
+	/* If read_upto_policy is XLRO_LSN, the LSN. */
+	XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5d7a796ba0..6e91c33f3d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
 typedef enum PrefetchBufferResult
 {
 	PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..903b0ec02b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
 
 #endif							/* GUC_H */
-- 
2.23.0

[1]: /messages/by-id/CA+hUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA@mail.gmail.com

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#4)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Feb 12, 2020 at 7:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.

I started a separate thread[1]/messages/by-id/CA+hUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA@mail.gmail.com to discuss that GUC, because it's
basically an independent question. Meanwhile, here's a new version of
the WAL prefetch patch, with the following changes:

1. A monitoring view:

postgres=# select * from pg_stat_wal_prefetcher ;
prefetch | skip_hit | skip_new | skip_fpw | skip_seq | distance | queue_depth
----------+----------+----------+----------+----------+----------+-------------
95854 | 291458 | 435 | 0 | 26245 | 261800 | 10
(1 row)

That shows a bunch of counters for blocks prefetched and skipped for
various reasons. It also shows the current read-ahead distance (in
bytes of WAL) and queue depth (an approximation of how many I/Os might
be in flight, used for rate limiting; I'm struggling to come up with a
better short name for this). This can be used to see the effects of
experiments with different settings, eg:

alter system set effective_io_concurrency = 20;
alter system set wal_prefetch_distance = '256kB';
select pg_reload_conf();

2. A log message when WAL prefetching begins and ends, so you can see
what it did during crash recovery:

LOG: WAL prefetch finished at 0/C5E98758; prefetch = 1112628,
skip_hit = 3607540,
skip_new = 45592, skip_fpw = 0, skip_seq = 177049, avg_distance =
247907.942532,
avg_queue_depth = 22.261352

3. A bit of general user documentation.

Attachments:

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRelatio.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRelatio.patchDownload

From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
---
 src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++-------------
 src/include/storage/bufmgr.h        |  3 ++
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..6e0875022c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,6 +519,48 @@ ComputeIoConcurrency(int io_concurrency, double *target)
 	return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
 }
 
+void
+SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+	BufferTag	newTag;		/* identity of requested block */
+	uint32		newHash;	/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+
+	Assert(BlockNumberIsValid(blockNum));
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+				   forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(newPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&newTag, newHash);
+	LWLockRelease(newPartitionLock);
+
+	/* If not in buffers, initiate prefetch */
+	if (buf_id < 0)
+		smgrprefetch(smgr_reln, forkNum, blockNum);
+
+	/*
+	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
+	 * the block might be just about to be evicted, which would be stupid
+	 * since we know we are going to need it soon.  But the only easy answer
+	 * is to bump the usage_count, which does not seem like a great solution:
+	 * when the caller does ultimately touch the block, usage_count would get
+	 * bumped again, resulting in too much favoritism for blocks that are
+	 * involved in a prefetch sequence. A real fix would involve some
+	 * additional per-buffer state, and it's not clear that there's enough of
+	 * a problem to justify that.
+	 */
+#endif
+}
+
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
@@ -550,39 +592,8 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
-
-		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
-					   forkNum, blockNum);
-
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
-		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
-			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
-		 */
+		/* pass it to the shared buffer version */
+		SharedPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..89a47afec1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -18,6 +18,7 @@
 #include "storage/buf.h"
 #include "storage/bufpage.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 
@@ -162,6 +163,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  * prototypes for functions in bufmgr.c
  */
 extern bool ComputeIoConcurrency(int io_concurrency, double *target);
+extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
+								 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
-- 
2.20.1

0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecPtr.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecPtr.patchDownload

From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.

An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.
---
 src/backend/access/transam/xlog.c          | 4 ++--
 src/backend/access/transam/xlogfuncs.c     | 2 +-
 src/backend/replication/walreceiverfuncs.c | 4 ++--
 src/backend/replication/walsender.c        | 2 +-
 src/include/replication/walreceiver.h      | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..cc7072ba13 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9283,7 +9283,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -12104,7 +12104,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
+						receivedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
 						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..9bce63b534 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -294,7 +294,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b9d0..1079b3f8cb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2903,7 +2903,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..147b374a26 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0003-Add-WalRcvGetWriteRecPtr-new-definition.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition.patchDownload

From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk.  To do that,
it needs to be able to see the write pointer advancing in shared
memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
---
 src/backend/replication/walreceiver.c      |  5 +++++
 src/backend/replication/walreceiverfuncs.c | 10 ++++++++++
 src/include/replication/walreceiver.h      |  9 +++++++++
 3 files changed, 24 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..88a51ba35f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -244,6 +244,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 9bce63b534..14e9a6245a 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,16 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 147b374a26..1e8f304dc4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -83,6 +84,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * Same as above, but advanced after writing and before flushing, without
+	 * the need to acquire the spin lock.  Data can be read by another process
+	 * up to this point, but shouldn't be used for data integrity purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -323,6 +331,7 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
 extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0004-Allow-PrefetchBuffer-to-report-the-outcome.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-the-outcome.patchDownload

From f9a53985e0e30659caa41c95c85001c91b3deb5f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 30 Dec 2019 16:43:50 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report the outcome.

Report when a relation's backing file is missing, to prepare
for use during recovery.  This will be used to handle cases of
relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been
replayed yet, after a crash.

Also report whether a prefetch was actually initiated, so that
callers can limit the number of concurrent I/Os they try to
issue, without counting the prefetch calls that did nothing
because the page was already in our buffers.

Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c |  9 +++++++--
 src/backend/storage/smgr/md.c       |  9 +++++++--
 src/backend/storage/smgr/smgr.c     | 10 +++++++---
 src/include/storage/bufmgr.h        | 12 ++++++++++--
 src/include/storage/md.h            |  2 +-
 src/include/storage/smgr.h          |  2 +-
 6 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6e0875022c..5dbbcf8111 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,7 +519,7 @@ ComputeIoConcurrency(int io_concurrency, double *target)
 	return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
 }
 
-void
+PrefetchBufferResult
 SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
@@ -545,7 +545,11 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
 
 	/* If not in buffers, initiate prefetch */
 	if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+		if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+			return PREFETCH_BUFFER_NOREL;
+		return PREFETCH_BUFFER_MISS;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -559,6 +563,7 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
 	 * a problem to justify that.
 	 */
 #endif
+	return PREFETCH_BUFFER_HIT;
 }
 
 /*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
-void
+bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
-	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later WAL
+ *		record).
  */
-void
+bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
-	smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89a47afec1..5d7a796ba0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,12 +159,20 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+typedef enum PrefetchBufferResult
+{
+	PREFETCH_BUFFER_HIT,
+	PREFETCH_BUFFER_MISS,
+	PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
 /*
  * prototypes for functions in bufmgr.c
  */
 extern bool ComputeIoConcurrency(int io_concurrency, double *target);
-extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
-								 BlockNumber blockNum);
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
-- 
2.20.1

0005-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery.patchDownload

From 6dc2cfa4b64ac25513c36538272e08b937bd46a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 2 Mar 2020 15:33:51 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC wal_prefetch_distance.  If it is set to a positive
number of bytes, then read ahead in the WAL at most that distance, and
initiate asynchronous reading of referenced blocks.  The goal is to
avoid I/O stalls and benefit from concurrent I/O.

The number of concurrent asynchronous reads is limited by both
effective_io_concurrency and wal_prefetch_distance.  The feature is
disabled by default.

Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                    |  38 ++
 doc/src/sgml/monitoring.sgml                |  69 +++
 doc/src/sgml/wal.sgml                       |  12 +
 src/backend/access/transam/Makefile         |   1 +
 src/backend/access/transam/xlog.c           |  64 ++
 src/backend/access/transam/xlogprefetcher.c | 653 ++++++++++++++++++++
 src/backend/access/transam/xlogutils.c      |  23 +-
 src/backend/catalog/system_views.sql        |  11 +
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/storage/ipc/ipci.c              |   3 +
 src/backend/utils/misc/guc.c                |  25 +
 src/include/access/xlog.h                   |   4 +
 src/include/access/xlogprefetcher.h         |  28 +
 src/include/access/xlogutils.h              |  20 +
 src/include/catalog/pg_proc.dat             |   8 +
 src/include/storage/bufmgr.h                |   5 +
 src/include/utils/guc.h                     |   2 +
 src/test/regress/expected/rules.out         |   8 +
 18 files changed, 974 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..415b0793e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3082,6 +3082,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-prefetch-distance" xreflabel="wal_prefetch_distance">
+      <term><varname>wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-effective-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87586a7b06..013537d2be 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2184,6 +2191,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+   <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-wal-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-wal-prefetch-distance"/>,
+   <xref linkend="guc-wal-prefetch-fpw"/> or
+   <xref linkend="guc-effective-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 4eb8feb903..943462ca05 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-wal-prefetch-distance"/> parameter can be
+   used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-effective-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, WAL prefetching is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc7072ba13..d042ebeaf5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -104,6 +105,8 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			wal_prefetch_distance = -1;
+bool		wal_prefetch_fpw = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -805,6 +808,7 @@ static XLogSource readSource = 0;	/* XLOG_FROM_* code */
  */
 static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
 static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -6212,6 +6216,7 @@ CheckRequiredParameterValues(void)
 	}
 }
 
+
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
@@ -7068,6 +7073,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetcher *prefetcher = NULL;
 
 			InRedo = true;
 
@@ -7075,6 +7081,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* the first time through, see if we need to enable prefetching */
+			ResetWalPrefetcher();
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7104,6 +7113,31 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * The first time through, or if any relevant settings or the
+				 * WAL source changes, we'll restart the prefetching machinery
+				 * as appropriate.  This is simpler than trying to handle
+				 * various complicated state changes.
+				 */
+				if (unlikely(reset_wal_prefetcher))
+				{
+					/* If we had one already, destroy it. */
+					if (prefetcher)
+					{
+						XLogPrefetcherFree(prefetcher);
+						prefetcher = NULL;
+					}
+					/* If we want one, create it. */
+					if (wal_prefetch_distance > 0)
+							prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+																currentSource == XLOG_FROM_STREAM);
+					reset_wal_prefetcher = false;
+				}
+
+				/* Peform WAL prefetching, if enabled. */
+				if (prefetcher)
+					XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7291,6 +7325,8 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			if (prefetcher)
+				XLogPrefetcherFree(prefetcher);
 
 			if (reachedRecoveryTarget)
 			{
@@ -10150,6 +10186,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+void
+assign_wal_prefetch_distance(int new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11933,6 +11987,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and move on to the next state.
 					 */
 					currentSource = XLOG_FROM_STREAM;
+					ResetWalPrefetcher();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12356,3 +12411,12 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+	reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..5b32522bb5
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,653 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/shmem.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	XLogRecPtr	   *prefetch_queue;
+	int				prefetch_queue_size;
+	int				prefetch_head;
+	int				prefetch_tail;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Counters used to compute avg_queue_depth and avg_distance. */
+	double			samples;
+	double			queue_depth_sum;
+	double			distance_sum;
+	XLogRecPtr		next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Sequential/repeat blocks skipped. */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+	return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+	pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+	MonitoringStats->distance = -1;
+	MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+	bool		found;
+
+	MonitoringStats = (XLogPrefetcherMonitoringStats *)
+		ShmemInitStruct("XLogPrefetcherMonitoringStats",
+						sizeof(XLogPrefetcherMonitoringStats),
+						&found);
+	if (!found)
+		XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* We're allowed to read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_LSN;
+		prefetcher->options.lsn = (XLogRecPtr) -1;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/*
+	 * The size of the queue is determined by target_prefetch_pages, which is
+	 * derived from effective_io_concurrency.  In theory we might have a
+	 * separate queue for each tablespace, but it's not clear how that should
+	 * work, so for now we'll just use the system-wide GUC to rate-limit all
+	 * prefetching.
+	 */
+	prefetcher->prefetch_queue_size = target_prefetch_pages;
+	prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+	prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+	/* Prepare to read at the given LSN. */
+	elog(LOG, "WAL prefetch started at %X/%X",
+		 (uint32) (lsn << 32), (uint32) lsn);
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	XLogPrefetcherResetMonitoringStats();
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	double		avg_distance = 0;
+	double		avg_queue_depth = 0;
+
+	/* Log final statistics. */
+	if (prefetcher->samples > 0)
+	{
+		avg_distance = prefetcher->distance_sum / prefetcher->samples;
+		avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+	}
+	elog(LOG,
+		 "WAL prefetch finished at %X/%X; "
+		 "prefetch = " UINT64_FORMAT ", "
+		 "skip_hit = " UINT64_FORMAT ", "
+		 "skip_new = " UINT64_FORMAT ", "
+		 "skip_fpw = " UINT64_FORMAT ", "
+		 "skip_seq = " UINT64_FORMAT ", "
+		 "avg_distance = %f, "
+		 "avg_queue_depth = %f",
+		 (uint32) (prefetcher->reader->EndRecPtr << 32),
+		 (uint32) (prefetcher->reader->EndRecPtr),
+		 pg_atomic_read_u64(&MonitoringStats->prefetch),
+		 pg_atomic_read_u64(&MonitoringStats->skip_hit),
+		 pg_atomic_read_u64(&MonitoringStats->skip_new),
+		 pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+		 pg_atomic_read_u64(&MonitoringStats->skip_seq),
+		 avg_distance,
+		 avg_queue_depth);
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher->prefetch_queue);
+	pfree(prefetcher);
+
+	XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/* Can we drop any filters yet, due to problem records begin replayed? */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/* Main prefetch loop. */
+	for (;;)
+	{
+		XLogReaderState *reader = prefetcher->reader;
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					elog(LOG, "WAL prefetch error: %s", error);
+					prefetcher->shutdown = true;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		MonitoringStats->distance = distance;
+
+		/* Sample the averages so we can log them at end of recovery. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			prefetcher->distance_sum += MonitoringStats->distance;
+			prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+			prefetcher->samples += 1.0;
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= wal_prefetch_distance)
+			break;
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/*
+		 * Scan the record for block references.  We might already have been
+		 * partway through processing this record when we hit maximum I/O
+		 * concurrency, so start where we left off.
+		 */
+		for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+		{
+			DecodedBkpBlock *block = &reader->blocks[i];
+			SMgrRelation reln;
+
+			/* Ignore everything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+				continue;
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so you might thing we should skip it.  However, if the
+			 * underlying filesystem uses larger logical blocks than us, it
+			 * might still need to perform a read-before-write some time later.
+			 * Therefore, only prefetch if configured to do so.
+			 */
+			if (block->has_image && !wal_prefetch_fpw)
+			{
+				inc_counter(&MonitoringStats->skip_fpw);
+				continue;
+			}
+
+			/*
+			 * If this block will initialize a new page then it's probably an
+			 * extension.  Since it might create a new segment, we can't try
+			 * to prefetch this block until the record has been replayed, or we
+			 * might try to open a file that doesn't exist yet.
+			 */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Should we skip this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+										 block->blkno))
+			{
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Fast path for repeated references to the same relation. */
+			if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+			{
+				/*
+				 * If this is a repeat or sequential access, then skip it.  We
+				 * expect the kernel to detect sequential access on its own
+				 * and do a better job than we could.
+				 */
+				if (block->blkno == prefetcher->last_blkno ||
+					block->blkno == prefetcher->last_blkno + 1)
+				{
+					prefetcher->last_blkno = block->blkno;
+					inc_counter(&MonitoringStats->skip_seq);
+					continue;
+				}
+
+				/* We can avoid calling smgropen(). */
+				reln = prefetcher->last_reln;
+			}
+			else
+			{
+				/* Otherwise we have to open it. */
+				reln = smgropen(block->rnode, InvalidBackendId);
+				prefetcher->last_rnode = block->rnode;
+				prefetcher->last_reln = reln;
+			}
+			prefetcher->last_blkno = block->blkno;
+
+			/* Try to prefetch this block! */
+			switch (SharedPrefetchBuffer(reln, block->forknum, block->blkno))
+			{
+			case PREFETCH_BUFFER_HIT:
+				/* It's already cached, so do nothing. */
+				inc_counter(&MonitoringStats->skip_hit);
+				break;
+			case PREFETCH_BUFFER_MISS:
+				/*
+				 * I/O has possibly been initiated (though we don't know if it
+				 * was already cached by the kernel, so we just have to assume
+				 * that it has due to lack of better information).  Record
+				 * this as an I/O in progress until eventually we replay this
+				 * LSN.
+				 */
+				inc_counter(&MonitoringStats->prefetch);
+				XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+				/*
+				 * If the queue is now full, we'll have to wait before
+				 * processing any more blocks from this record.
+				 */
+				if (XLogPrefetcherSaturated(prefetcher))
+				{
+					prefetcher->next_block_id = i + 1;
+					return;
+				}
+				break;
+			case PREFETCH_BUFFER_NOREL:
+				/*
+				 * The underlying segment file doesn't exist.  Presumably it
+				 * will be unlinked by a later WAL record.  When recovery
+				 * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+				 * flag.  We certainly don't want to do that sort of thing
+				 * while merely prefetching, so let's just ignore references
+				 * to this relation until this record is replayed, and let
+				 * recovery create the dummy file or complain if something is
+				 * wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+				break;
+			}
+		}
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+	bool		nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (MonitoringStats->distance < 0)
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = false;
+		values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+		values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+		values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+		values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+		values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+		values[5] = Int32GetDatum(MonitoringStats->distance);
+		values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+	}
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	MonitoringStats->queue_depth++;
+	Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		MonitoringStats->queue_depth--;
+		Assert(MonitoringStats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
 
 	loc = targetPagePtr + reqLen;
 
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * notices recovery finishes, so we only have to maintain it for the
 		 * local process until recovery ends.
 		 */
-		if (!RecoveryInProgress())
+		if (options)
+		{
+			switch (options->read_upto_policy)
+			{
+			case XLRO_WALRCV_WRITTEN:
+				read_upto = GetWalRcvWriteRecPtr();
+				break;
+			case XLRO_LSN:
+				read_upto = options->lsn;
+				break;
+			default:
+				read_upto = 0;
+				elog(ERROR, "unknown read_upto_policy value");
+				break;
+			}
+		}
+		else if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			if (options && options->nowait)
+				break;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..d0882e5f82 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_wal_prefetcher AS
+    SELECT
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth
+     FROM pg_stat_get_wal_prefetcher() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetcherShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetcherShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 464f264d9a..893c9478d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1240,6 +1240,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless wal_prefetch_distance is set to a positive number.")
+		},
+		&wal_prefetch_fpw,
+		false,
+		NULL, assign_wal_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2626,6 +2638,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("How many bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable WAL prefetching."),
+			GUC_UNIT_BYTE
+		},
+		&wal_prefetch_distance,
+		-1, -1, INT_MAX,
+		NULL, assign_wal_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -11484,6 +11507,8 @@ assign_effective_io_concurrency(int newval, void *extra)
 {
 #ifdef USE_PREFETCH
 	target_prefetch_pages = *((int *) extra);
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
 #endif							/* USE_PREFETCH */
 }
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..0a31edfba4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
+extern int	wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ResetWalPrefetcher(void);
+
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/* How far to read. */
+	enum {
+		XLRO_WALRCV_WRITTEN,
+		XLRO_LSN
+	} read_upto_policy;
+
+	/* If read_upto_policy is XLRO_LSN, the LSN. */
+	XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07a86c7b7b..0bd16c1b77 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+  prosrc => 'pg_stat_get_wal_prefetcher' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5d7a796ba0..6e91c33f3d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
 typedef enum PrefetchBufferResult
 {
 	PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..903b0ec02b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
 
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..62b1e0e113 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2087,6 +2087,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth
+   FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.20.1

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Thomas Munro (#5)

Re: WIP: WAL prefetch (another approach)

I tried my luck at a quick read of this patchset.
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.

First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.

From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.

LGTM.

It's a pity to have to include smgr.h in bufmgr.h. Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.

From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.

An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.

Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.

From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);

Umm, how come you're using WalRcv here instead of walrcv? I would flag
this patch for sneaky nastiness if this weren't mostly harmless. (I
think we should do away with local walrcv pointers altogether. But that
should be a separate patch, I think.)

+ pg_atomic_uint64 writtenUpto;

Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.

I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?

/*
*	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later WAL
+ *		record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)

I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.

+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

thomas.munro@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#6)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi Alvaro,

On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I tried my luck at a quick read of this patchset.

Thanks! Here's a new patch set, and some inline responses to your feedback:

I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.

The primary control is now maintenance_io_concurrency, which is
basically what Tomas suggested.

The byte-based control is just a cap to prevent it reading a crazy
distance ahead, that also functions as the on/off switch for the
feature. In this version I've added "max" to the name, to make that
clearer.

First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.

We should be careful about over-promising here: Sean basically had a
best case scenario for this type of techology, partly due to his 16kB
filesystem blocks. Common results may be a lot more pedestrian,
though it could get more interesting if we figure out how to get rid
of FPWs...

From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.

LGTM.

It's a pity to have to include smgr.h in bufmgr.h. Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.

OK, done.

While staring at this, I decided that SharedPrefetchBuffer() was a
weird word order, so I changed it to PrefetchSharedBuffer(). Then, by
analogy, I figured I should also change the pre-existing function
LocalPrefetchBuffer() to PrefetchLocalBuffer(). Do you think this is
an improvement?

From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.

An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.

Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.

Ok, I renamed that variable and a related one. There are more things
you could rename if you pull on that thread some more, including
pg_stat_wal_receiver's received_lsn column, but I didn't do that in
this patch.

From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);

Umm, how come you're using WalRcv here instead of walrcv? I would flag
this patch for sneaky nastiness if this weren't mostly harmless. (I
think we should do away with local walrcv pointers altogether. But that
should be a separate patch, I think.)

OK, done.

+ pg_atomic_uint64 writtenUpto;

Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.

Moved.

We use [u]int64 in various places in the replication code. Ideally
I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
assume that pg_atomic_uint64 is the right atomic integer width and
signedness, but here we are. In dsa.h I made a special typedef for
the atomic version of something else, but that's because the size of
that thing varied depending on the build, whereas our LSNs are of a
fixed width that ought to be en... <trails off>.

I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?

I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.

/*
*   smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *           In recovery only, this can return false to indicate that a file
+ *           doesn't exist (presumably it has been dropped by a later WAL
+ *           record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.

Hmm. But... md.c has other code like that. It's true that I'm adding
InRecovery awareness to a function that didn't previously have it, but
that's just because we previously had no reason to prefetch stuff in
recovery.

+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+                                                                                              ForkNumber forkNum,
+                                                                                              BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.

Attachments:

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v4.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v4.patchDownload

From 71641bcfed33c0a89f27b5246734eb4b8196485c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.  A new function
PrefetchSharedBuffer() is provided that works with SMgrRelation, and
LocalPrefetchBuffer() is renamed to PrefetchLocalBuffer() to fit with
that more natural naming scheme.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 84 ++++++++++++++++-----------
 src/backend/storage/buffer/localbuf.c |  4 +-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/bufmgr.h          |  6 ++
 4 files changed, 59 insertions(+), 37 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..d30aed6fd9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -466,6 +466,53 @@ static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+					 ForkNumber forkNum,
+					 BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+	BufferTag	newTag;		/* identity of requested block */
+	uint32		newHash;	/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+
+	Assert(BlockNumberIsValid(blockNum));
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+				   forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(newPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&newTag, newHash);
+	LWLockRelease(newPartitionLock);
+
+	/* If not in buffers, initiate prefetch */
+	if (buf_id < 0)
+		smgrprefetch(smgr_reln, forkNum, blockNum);
+
+	/*
+	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
+	 * the block might be just about to be evicted, which would be stupid
+	 * since we know we are going to need it soon.  But the only easy answer
+	 * is to bump the usage_count, which does not seem like a great solution:
+	 * when the caller does ultimately touch the block, usage_count would get
+	 * bumped again, resulting in too much favoritism for blocks that are
+	 * involved in a prefetch sequence. A real fix would involve some
+	 * additional per-buffer state, and it's not clear that there's enough of
+	 * a problem to justify that.
+	 */
+#endif							/* USE_PREFETCH */
+}
+
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
@@ -493,43 +540,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
-
-		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
-					   forkNum, blockNum);
-
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
-		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
-			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
-		 */
+		/* pass it to the shared buffer version */
+		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
 
 
 /*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
  *	  initiate asynchronous read of a block of a relation
  *
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
 void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 								BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d2a5b52f6e..e00dd3ffb7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -159,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+								 ForkNumber forkNum,
+								 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
-- 
2.20.1

0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v4.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v4.patchDownload

From a3a22ea59e9a9ac1d03dd3f22708e32a796785af Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk.  Also rename
a couple of variables relating to this value.

An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c          | 20 ++++++++++----------
 src/backend/access/transam/xlogfuncs.c     |  2 +-
 src/backend/replication/README             |  2 +-
 src/backend/replication/walreceiver.c      | 10 +++++-----
 src/backend/replication/walreceiverfuncs.c | 12 ++++++------
 src/backend/replication/walsender.c        |  2 +-
 src/include/replication/walreceiver.h      |  8 ++++----
 7 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fa446ffa4..fd30e27425 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -205,8 +205,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 
 static XLogRecPtr LastRec;
 
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
 /*
@@ -9288,7 +9288,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -11682,7 +11682,7 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 receivedUpto < targetPagePtr + reqLen))
+		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -11713,10 +11713,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -11952,7 +11952,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						curFileTLI = tli;
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName);
-						receivedUpto = 0;
+						flushedUpto = 0;
 					}
 
 					/*
@@ -12132,14 +12132,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
 					 */
-					if (RecPtr < receivedUpto)
+					if (RecPtr < flushedUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
 WalRcvData->receiveStart.
 
 As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
 the startup process to know how far WAL replay can advance.
 
 Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..0bdd0c3074 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
  * in the primary server), and then keeps receiving XLOG records and
  * writing them to the disk as long as the connection is alive. As XLOG
  * records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
  * If the primary server ends streaming, but doesn't disconnect, walreceiver
@@ -1006,10 +1006,10 @@ XLogWalRcvFlush(bool dying)
 
 		/* Update shared-memory status */
 		SpinLockAcquire(&walrcv->mutex);
-		if (walrcv->receivedUpto < LogstreamResult.Flush)
+		if (walrcv->flushedUpto < LogstreamResult.Flush)
 		{
-			walrcv->latestChunkStart = walrcv->receivedUpto;
-			walrcv->receivedUpto = LogstreamResult.Flush;
+			walrcv->latestChunkStart = walrcv->flushedUpto;
+			walrcv->flushedUpto = LogstreamResult.Flush;
 			walrcv->receivedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
@@ -1362,7 +1362,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
-	received_lsn = WalRcv->receivedUpto;
+	received_lsn = WalRcv->flushedUpto;
 	received_tli = WalRcv->receivedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..31025f97e3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -264,11 +264,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	/*
 	 * If this is the first startup of walreceiver (on this timeline),
-	 * initialize receivedUpto and latestChunkStart to the starting point.
+	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
 	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
 	{
-		walrcv->receivedUpto = recptr;
+		walrcv->flushedUpto = recptr;
 		walrcv->receivedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -294,13 +294,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&walrcv->mutex);
-	recptr = walrcv->receivedUpto;
+	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
 	if (receiveTLI)
@@ -327,7 +327,7 @@ GetReplicationApplyDelay(void)
 	TimestampTz chunkReplayStartTime;
 
 	SpinLockAcquire(&walrcv->mutex);
-	receivePtr = walrcv->receivedUpto;
+	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
 	replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f74bc8493..658e5280fd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2914,7 +2914,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..9ed71139ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,19 +74,19 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * receivedUpto-1 is the last byte position that has already been
+	 * flushedUpto-1 is the last byte position that has already been
 	 * received, and receivedTLI is the timeline it came from.  At the first
 	 * startup of walreceiver, these are set to receiveStart and
 	 * receiveStartTLI. After that, walreceiver updates these whenever it
 	 * flushes the received WAL to disk.
 	 */
-	XLogRecPtr	receivedUpto;
+	XLogRecPtr	flushedUpto;
 	TimeLineID	receivedTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
-	 * receivedUpto before the last flush to disk.  Startup process can use
+	 * flushedUpto before the last flush to disk.  Startup process can use
 	 * this to detect whether it's keeping up or not.
 	 */
 	XLogRecPtr	latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0003-Add-WalRcvGetWriteRecPtr-new-definition-v4.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v4.patchDownload

From 6ef218f60cab62ecbd5ad120cf535cb4e5045f45 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk.  To do that, it
needs to be able to see the write pointer advancing in shared memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/replication/walreceiver.c      |  5 +++++
 src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
 src/include/replication/walreceiver.h      | 10 ++++++++++
 3 files changed, 27 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 0bdd0c3074..e250f5583c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -245,6 +245,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 31025f97e3..96b44e2c88 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	WalRcvData *walrcv = WalRcv;
+
+	return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9ed71139ce..914e6e3d44 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -142,6 +143,14 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
+	/*
+	 * Like flushedUpto, but advanced after writing and before flushing,
+	 * without the need to acquire the spin lock.  Data can be read by another
+	 * process up to this point, but shouldn't be used for data integrity
+	 * purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
 extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0004-Allow-PrefetchBuffer-to-report-what-happened-v4.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-what-happened-v4.patchDownload

From d60e6f15180a40b117b3fc9b330967e52a5b6485 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.

Report whether a prefetch was actually initiated, so that callers can
limit the number of concurrent I/Os they try to issue, without counting
the prefetch calls that did nothing because the page was already in our
buffers.

Also report when a relation's backing file is missing, to prepare for
use during recovery.  This will be used to handle cases of relations
that are referenced in the WAL but have been unlinked already due to
actions covered by WAL records that haven't been replayed yet, after a
crash.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 15 ++++++++++-----
 src/backend/storage/buffer/localbuf.c |  7 +++++--
 src/backend/storage/smgr/md.c         |  9 +++++++--
 src/backend/storage/smgr/smgr.c       | 10 +++++++---
 src/include/storage/buf_internals.h   |  5 +++--
 src/include/storage/bufmgr.h          | 17 ++++++++++++-----
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 8 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..b13e05cce8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,7 +469,7 @@ static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
  */
-void
+PrefetchBufferResult
 PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 					 ForkNumber forkNum,
 					 BlockNumber blockNum)
@@ -497,7 +497,11 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 
 	/* If not in buffers, initiate prefetch */
 	if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+		if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+			return PREFETCH_BUFFER_NOREL;
+		return PREFETCH_BUFFER_MISS;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -511,6 +515,7 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 	 * a problem to justify that.
 	 */
 #endif							/* USE_PREFETCH */
+	return PREFETCH_BUFFER_HIT;
 }
 
 /*
@@ -521,7 +526,7 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
  * block will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
  */
-void
+PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
@@ -540,12 +545,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
 		/* pass it to the shared buffer version */
-		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..c728986e12 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,7 +60,7 @@ static Block GetLocalBufferStorage(void);
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
-void
+PrefetchBufferResult
 PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
@@ -81,11 +81,14 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	if (hresult)
 	{
 		/* Yes, so nothing to do */
-		return;
+		return PREFETCH_BUFFER_HIT;
 	}
 
 	/* Not in buffers, so initiate prefetch */
 	smgrprefetch(smgr, forkNum, blockNum);
+	return PREFETCH_BUFFER_MISS;
+#else
+	return PREFETCH_BUFFER_HIT;
 #endif							/* USE_PREFETCH */
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
-void
+bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
-	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later WAL
+ *		record).
  */
-void
+bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
-	smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
 /*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
-								BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+												ForkNumber forkNum,
+												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
 extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e00dd3ffb7..1210d1e7e8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,14 +159,21 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+typedef enum PrefetchBufferResult
+{
+	PREFETCH_BUFFER_HIT,
+	PREFETCH_BUFFER_MISS,
+	PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
 /*
  * prototypes for functions in bufmgr.c
  */
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
-								 ForkNumber forkNum,
-								 BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
-						   BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+										   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
-- 
2.20.1

0005-Prefetch-referenced-blocks-during-recovery-v4.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery-v4.patchDownload

From 477ac4a1f280faf189da52e635cea15367a262a8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 2 Mar 2020 15:33:51 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is disabled by default.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                    |  38 ++
 doc/src/sgml/monitoring.sgml                |  69 +++
 doc/src/sgml/wal.sgml                       |  12 +
 src/backend/access/transam/Makefile         |   1 +
 src/backend/access/transam/xlog.c           |  64 ++
 src/backend/access/transam/xlogprefetcher.c | 654 ++++++++++++++++++++
 src/backend/access/transam/xlogutils.c      |  23 +-
 src/backend/catalog/system_views.sql        |  11 +
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/storage/ipc/ipci.c              |   3 +
 src/backend/utils/misc/guc.c                |  38 +-
 src/include/access/xlog.h                   |   4 +
 src/include/access/xlogprefetcher.h         |  28 +
 src/include/access/xlogutils.h              |  20 +
 src/include/catalog/pg_proc.dat             |   8 +
 src/include/storage/bufmgr.h                |   5 +
 src/include/utils/guc.h                     |   2 +
 src/test/regress/expected/rules.out         |   8 +
 18 files changed, 987 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 672bf6f1ee..8249ec0139 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3102,6 +3102,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2192,6 +2199,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+   <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-wal-prefetch-distance"/>,
+   <xref linkend="guc-wal-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..9e956ad2a1 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-wal-prefetch-distance"/> parameter can be
+   used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, WAL prefetching is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd30e27425..f01a24f577 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -105,6 +106,8 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_wal_prefetch_distance = -1;
+bool		wal_prefetch_fpw = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -806,6 +809,7 @@ static XLogSource readSource = XLOG_FROM_ANY;
  */
 static XLogSource currentSource = XLOG_FROM_ANY;
 static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -6213,6 +6217,7 @@ CheckRequiredParameterValues(void)
 	}
 }
 
+
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
@@ -7069,6 +7074,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetcher *prefetcher = NULL;
 
 			InRedo = true;
 
@@ -7076,6 +7082,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* the first time through, see if we need to enable prefetching */
+			ResetWalPrefetcher();
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * The first time through, or if any relevant settings or the
+				 * WAL source changes, we'll restart the prefetching machinery
+				 * as appropriate.  This is simpler than trying to handle
+				 * various complicated state changes.
+				 */
+				if (unlikely(reset_wal_prefetcher))
+				{
+					/* If we had one already, destroy it. */
+					if (prefetcher)
+					{
+						XLogPrefetcherFree(prefetcher);
+						prefetcher = NULL;
+					}
+					/* If we want one, create it. */
+					if (max_wal_prefetch_distance > 0)
+							prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+																currentSource == XLOG_FROM_STREAM);
+					reset_wal_prefetcher = false;
+				}
+
+				/* Peform WAL prefetching, if enabled. */
+				if (prefetcher)
+					XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7292,6 +7326,8 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			if (prefetcher)
+				XLogPrefetcherFree(prefetcher);
 
 			if (reachedRecoveryTarget)
 			{
@@ -10155,6 +10191,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+void
+assign_max_wal_prefetch_distance(int new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	max_wal_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11961,6 +12015,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and move on to the next state.
 					 */
 					currentSource = XLOG_FROM_STREAM;
+					ResetWalPrefetcher();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12390,3 +12445,12 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+	reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..1d0bce692a
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,654 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	XLogRecPtr	   *prefetch_queue;
+	int				prefetch_queue_size;
+	int				prefetch_head;
+	int				prefetch_tail;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Counters used to compute avg_queue_depth and avg_distance. */
+	double			samples;
+	double			queue_depth_sum;
+	double			distance_sum;
+	XLogRecPtr		next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Sequential/repeat blocks skipped. */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+	return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+	pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+	MonitoringStats->distance = -1;
+	MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+	bool		found;
+
+	MonitoringStats = (XLogPrefetcherMonitoringStats *)
+		ShmemInitStruct("XLogPrefetcherMonitoringStats",
+						sizeof(XLogPrefetcherMonitoringStats),
+						&found);
+	if (!found)
+		XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* We're allowed to read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_LSN;
+		prefetcher->options.lsn = (XLogRecPtr) -1;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.
+	 */
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+	prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+	prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+	/* Prepare to read at the given LSN. */
+	elog(LOG, "WAL prefetch started at %X/%X",
+		 (uint32) (lsn << 32), (uint32) lsn);
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	XLogPrefetcherResetMonitoringStats();
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	double		avg_distance = 0;
+	double		avg_queue_depth = 0;
+
+	/* Log final statistics. */
+	if (prefetcher->samples > 0)
+	{
+		avg_distance = prefetcher->distance_sum / prefetcher->samples;
+		avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+	}
+	elog(LOG,
+		 "WAL prefetch finished at %X/%X; "
+		 "prefetch = " UINT64_FORMAT ", "
+		 "skip_hit = " UINT64_FORMAT ", "
+		 "skip_new = " UINT64_FORMAT ", "
+		 "skip_fpw = " UINT64_FORMAT ", "
+		 "skip_seq = " UINT64_FORMAT ", "
+		 "avg_distance = %f, "
+		 "avg_queue_depth = %f",
+		 (uint32) (prefetcher->reader->EndRecPtr << 32),
+		 (uint32) (prefetcher->reader->EndRecPtr),
+		 pg_atomic_read_u64(&MonitoringStats->prefetch),
+		 pg_atomic_read_u64(&MonitoringStats->skip_hit),
+		 pg_atomic_read_u64(&MonitoringStats->skip_new),
+		 pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+		 pg_atomic_read_u64(&MonitoringStats->skip_seq),
+		 avg_distance,
+		 avg_queue_depth);
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher->prefetch_queue);
+	pfree(prefetcher);
+
+	XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/* Can we drop any filters yet, due to problem records begin replayed? */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/* Main prefetch loop. */
+	for (;;)
+	{
+		XLogReaderState *reader = prefetcher->reader;
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					elog(LOG, "WAL prefetch error: %s", error);
+					prefetcher->shutdown = true;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		MonitoringStats->distance = distance;
+
+		/* Sample the averages so we can log them at end of recovery. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			prefetcher->distance_sum += MonitoringStats->distance;
+			prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+			prefetcher->samples += 1.0;
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_wal_prefetch_distance)
+			break;
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/*
+		 * Scan the record for block references.  We might already have been
+		 * partway through processing this record when we hit maximum I/O
+		 * concurrency, so start where we left off.
+		 */
+		for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+		{
+			DecodedBkpBlock *block = &reader->blocks[i];
+			SMgrRelation reln;
+
+			/* Ignore everything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+				continue;
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so you might thing we should skip it.  However, if the
+			 * underlying filesystem uses larger logical blocks than us, it
+			 * might still need to perform a read-before-write some time later.
+			 * Therefore, only prefetch if configured to do so.
+			 */
+			if (block->has_image && !wal_prefetch_fpw)
+			{
+				inc_counter(&MonitoringStats->skip_fpw);
+				continue;
+			}
+
+			/*
+			 * If this block will initialize a new page then it's probably an
+			 * extension.  Since it might create a new segment, we can't try
+			 * to prefetch this block until the record has been replayed, or we
+			 * might try to open a file that doesn't exist yet.
+			 */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Should we skip this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+										 block->blkno))
+			{
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Fast path for repeated references to the same relation. */
+			if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+			{
+				/*
+				 * If this is a repeat or sequential access, then skip it.  We
+				 * expect the kernel to detect sequential access on its own
+				 * and do a better job than we could.
+				 */
+				if (block->blkno == prefetcher->last_blkno ||
+					block->blkno == prefetcher->last_blkno + 1)
+				{
+					prefetcher->last_blkno = block->blkno;
+					inc_counter(&MonitoringStats->skip_seq);
+					continue;
+				}
+
+				/* We can avoid calling smgropen(). */
+				reln = prefetcher->last_reln;
+			}
+			else
+			{
+				/* Otherwise we have to open it. */
+				reln = smgropen(block->rnode, InvalidBackendId);
+				prefetcher->last_rnode = block->rnode;
+				prefetcher->last_reln = reln;
+			}
+			prefetcher->last_blkno = block->blkno;
+
+			/* Try to prefetch this block! */
+			switch (PrefetchSharedBuffer(reln, block->forknum, block->blkno))
+			{
+			case PREFETCH_BUFFER_HIT:
+				/* It's already cached, so do nothing. */
+				inc_counter(&MonitoringStats->skip_hit);
+				break;
+			case PREFETCH_BUFFER_MISS:
+				/*
+				 * I/O has possibly been initiated (though we don't know if it
+				 * was already cached by the kernel, so we just have to assume
+				 * that it has due to lack of better information).  Record
+				 * this as an I/O in progress until eventually we replay this
+				 * LSN.
+				 */
+				inc_counter(&MonitoringStats->prefetch);
+				XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+				/*
+				 * If the queue is now full, we'll have to wait before
+				 * processing any more blocks from this record.
+				 */
+				if (XLogPrefetcherSaturated(prefetcher))
+				{
+					prefetcher->next_block_id = i + 1;
+					return;
+				}
+				break;
+			case PREFETCH_BUFFER_NOREL:
+				/*
+				 * The underlying segment file doesn't exist.  Presumably it
+				 * will be unlinked by a later WAL record.  When recovery
+				 * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+				 * flag.  We certainly don't want to do that sort of thing
+				 * while merely prefetching, so let's just ignore references
+				 * to this relation until this record is replayed, and let
+				 * recovery create the dummy file or complain if something is
+				 * wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+				break;
+			}
+		}
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+	bool		nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (MonitoringStats->distance < 0)
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = false;
+		values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+		values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+		values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+		values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+		values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+		values[5] = Int32GetDatum(MonitoringStats->distance);
+		values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+	}
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	MonitoringStats->queue_depth++;
+	Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		MonitoringStats->queue_depth--;
+		Assert(MonitoringStats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
 
 	loc = targetPagePtr + reqLen;
 
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * notices recovery finishes, so we only have to maintain it for the
 		 * local process until recovery ends.
 		 */
-		if (!RecoveryInProgress())
+		if (options)
+		{
+			switch (options->read_upto_policy)
+			{
+			case XLRO_WALRCV_WRITTEN:
+				read_upto = GetWalRcvWriteRecPtr();
+				break;
+			case XLRO_LSN:
+				read_upto = options->lsn;
+				break;
+			default:
+				read_upto = 0;
+				elog(ERROR, "unknown read_upto_policy value");
+				break;
+			}
+		}
+		else if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			if (options && options->nowait)
+				break;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..7b27ac4805 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_wal_prefetcher AS
+    SELECT
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth
+     FROM pg_stat_get_wal_prefetcher() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetcherShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetcherShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..a2a9f62160 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,6 +197,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1241,6 +1242,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_wal_prefetch_distance is set to a positive number.")
+		},
+		&wal_prefetch_fpw,
+		false,
+		NULL, assign_wal_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2627,6 +2640,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable WAL prefetching."),
+			GUC_UNIT_BYTE
+		},
+		&max_wal_prefetch_distance,
+		-1, -1, INT_MAX,
+		NULL, assign_max_wal_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2900,7 +2924,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11498,6 +11523,17 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..82829d7854 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
+extern int	max_wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ResetWalPrefetcher(void);
+
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/* How far to read. */
+	enum {
+		XLRO_WALRCV_WRITTEN,
+		XLRO_LSN
+	} read_upto_policy;
+
+	/* If read_upto_policy is XLRO_LSN, the LSN. */
+	XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..742741afa1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+  prosrc => 'pg_stat_get_wal_prefetcher' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 1210d1e7e8..3ca171adb8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
 typedef enum PrefetchBufferResult
 {
 	PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..7d076a9743 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_max_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
 
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7304611c3..63bbb796fc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2102,6 +2102,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth
+   FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.20.1

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Thomas Munro (#7)

Re: WIP: WAL prefetch (another approach)

On 2020-Mar-17, Thomas Munro wrote:

Hi Thomas

On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.

The primary control is now maintenance_io_concurrency, which is
basically what Tomas suggested.

The byte-based control is just a cap to prevent it reading a crazy
distance ahead, that also functions as the on/off switch for the
feature. In this version I've added "max" to the name, to make that
clearer.

Mumble. I guess I should wait to comment on this after reading 0005
more in depth.

First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.

We should be careful about over-promising here: Sean basically had a
best case scenario for this type of techology, partly due to his 16kB
filesystem blocks. Common results may be a lot more pedestrian,
though it could get more interesting if we figure out how to get rid
of FPWs...

Well, in my mind it's an established fact that our WAL replay uses far
too little of the available I/O speed. I guess if the system is
generating little WAL, then this change will show no benefit, but that's
not the kind of system that cares about this anyway -- for the others,
the parallelisation gains will be substantial, I'm sure.

From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.

While staring at this, I decided that SharedPrefetchBuffer() was a
weird word order, so I changed it to PrefetchSharedBuffer(). Then, by
analogy, I figured I should also change the pre-existing function
LocalPrefetchBuffer() to PrefetchLocalBuffer(). Do you think this is
an improvement?

Looks good. I doubt you'll break anything by renaming that routine.

From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.

An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.

Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.

Ok, I renamed that variable and a related one. There are more things
you could rename if you pull on that thread some more, including
pg_stat_wal_receiver's received_lsn column, but I didn't do that in
this patch.

+1 for that approach. Maybe we'll want to rename the SQL-visible name,
but I wouldn't burden this patch with that, lest we lose the entire
series to that :-)

+ pg_atomic_uint64 writtenUpto;

Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.

Moved.

LGTM.

We use [u]int64 in various places in the replication code. Ideally
I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
assume that pg_atomic_uint64 is the right atomic integer width and
signedness, but here we are. In dsa.h I made a special typedef for
the atomic version of something else, but that's because the size of
that thing varied depending on the build, whereas our LSNs are of a
fixed width that ought to be en... <trails off>.

Let's rewrite Postgres in Rust ...

I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?

I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.

Ah, I see, you don't want lock xadd ... That's non-obvious. I suppose
the function could use more commentary on *why* you're doing it that way
then.

/*
*   smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *           In recovery only, this can return false to indicate that a file
+ *           doesn't exist (presumably it has been dropped by a later WAL
+ *           record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.
Hmm. But... md.c has other code like that. It's true that I'm adding
InRecovery awareness to a function that didn't previously have it, but
that's just because we previously had no reason to prefetch stuff in
recovery.

True. I'm uncomfortable about it anyway. I also noticed that
_mdfd_getseg() already has InRecovery-specific behavior flags.
Clearly that ship has sailed. Consider my objection^W comment withdrawn.

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.

LGTM.

As before, I didn't get to reading 0005 in depth.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

thomas.munro@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#8)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Mar 18, 2020 at 2:47 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Mar-17, Thomas Munro wrote:

I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.

Ah, I see, you don't want lock xadd ... That's non-obvious. I suppose
the function could use more commentary on *why* you're doing it that way
then.

I updated the comment:

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.  The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.

LGTM.

Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.

The concept here is that eventually we'll have just one XLogReader for
both read ahead and recovery, and we could attach the prefetch results
to the decoded records, and then recovery would try to use already
looked up buffers to avoid a bit of work (and then recheck). In other
words, the WAL would be decoded only once, and the buffers would
hopefully be looked up only once, so you'd claw back all of the
overheads of this patch. For now that's not done, and the buffer in
the result is only compared with InvalidBuffer to check if there was a
hit or not.

Similar things could be done for bitmap heap scan and btree prefetch
with this interface: their prefetch machinery could hold onto these
results in their block arrays and try to avoid a more expensive
ReadBuffer() call if they already have a buffer (though as before,
there's a small chance it turns out to be the wrong one and they need
to fall back to ReadBuffer()).

As before, I didn't get to reading 0005 in depth.

Updated to account for the above-mentioned change, and with a couple
of elog() calls changed to ereport().

Attachments:

0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v5.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v5.patchDownload

From 94df05846b155dfc68997f17899ddb34637d868a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.

Previously a Relation was required, but it's annoying to have to create
a "fake" one in recovery.  A new function PrefetchSharedBuffer() is
provided that works with SMgrRelation, and LocalPrefetchBuffer() is
renamed to PrefetchLocalBuffer() to fit with that more natural naming
scheme.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 84 ++++++++++++++++-----------
 src/backend/storage/buffer/localbuf.c |  4 +-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/bufmgr.h          |  6 ++
 4 files changed, 59 insertions(+), 37 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..d30aed6fd9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -466,6 +466,53 @@ static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+					 ForkNumber forkNum,
+					 BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+	BufferTag	newTag;		/* identity of requested block */
+	uint32		newHash;	/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+
+	Assert(BlockNumberIsValid(blockNum));
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+				   forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(newPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&newTag, newHash);
+	LWLockRelease(newPartitionLock);
+
+	/* If not in buffers, initiate prefetch */
+	if (buf_id < 0)
+		smgrprefetch(smgr_reln, forkNum, blockNum);
+
+	/*
+	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
+	 * the block might be just about to be evicted, which would be stupid
+	 * since we know we are going to need it soon.  But the only easy answer
+	 * is to bump the usage_count, which does not seem like a great solution:
+	 * when the caller does ultimately touch the block, usage_count would get
+	 * bumped again, resulting in too much favoritism for blocks that are
+	 * involved in a prefetch sequence. A real fix would involve some
+	 * additional per-buffer state, and it's not clear that there's enough of
+	 * a problem to justify that.
+	 */
+#endif							/* USE_PREFETCH */
+}
+
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
@@ -493,43 +540,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
-
-		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
-					   forkNum, blockNum);
-
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
-		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
-			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
-		 */
+		/* pass it to the shared buffer version */
+		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
 
 
 /*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
  *	  initiate asynchronous read of a block of a relation
  *
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
 void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 								BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d2a5b52f6e..e00dd3ffb7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -159,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+								 ForkNumber forkNum,
+								 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
-- 
2.20.1

0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v5.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v5.patchDownload

From 02a03ee9767fbb2ef6fc62bdf1e64c0fe24eccfa Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk.  Also rename a
couple of variables relating to this value.

An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c          | 20 ++++++++++----------
 src/backend/access/transam/xlogfuncs.c     |  2 +-
 src/backend/replication/README             |  2 +-
 src/backend/replication/walreceiver.c      | 10 +++++-----
 src/backend/replication/walreceiverfuncs.c | 12 ++++++------
 src/backend/replication/walsender.c        |  2 +-
 src/include/replication/walreceiver.h      |  8 ++++----
 7 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index de2d4ee582..abb227ce66 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -205,8 +205,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 
 static XLogRecPtr LastRec;
 
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
 /*
@@ -9288,7 +9288,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -11682,7 +11682,7 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 receivedUpto < targetPagePtr + reqLen))
+		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -11713,10 +11713,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -11952,7 +11952,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						curFileTLI = tli;
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName);
-						receivedUpto = 0;
+						flushedUpto = 0;
 					}
 
 					/*
@@ -12132,14 +12132,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
 					 */
-					if (RecPtr < receivedUpto)
+					if (RecPtr < flushedUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
 WalRcvData->receiveStart.
 
 As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
 the startup process to know how far WAL replay can advance.
 
 Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..0bdd0c3074 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
  * in the primary server), and then keeps receiving XLOG records and
  * writing them to the disk as long as the connection is alive. As XLOG
  * records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
  * If the primary server ends streaming, but doesn't disconnect, walreceiver
@@ -1006,10 +1006,10 @@ XLogWalRcvFlush(bool dying)
 
 		/* Update shared-memory status */
 		SpinLockAcquire(&walrcv->mutex);
-		if (walrcv->receivedUpto < LogstreamResult.Flush)
+		if (walrcv->flushedUpto < LogstreamResult.Flush)
 		{
-			walrcv->latestChunkStart = walrcv->receivedUpto;
-			walrcv->receivedUpto = LogstreamResult.Flush;
+			walrcv->latestChunkStart = walrcv->flushedUpto;
+			walrcv->flushedUpto = LogstreamResult.Flush;
 			walrcv->receivedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
@@ -1362,7 +1362,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
-	received_lsn = WalRcv->receivedUpto;
+	received_lsn = WalRcv->flushedUpto;
 	received_tli = WalRcv->receivedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..31025f97e3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -264,11 +264,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	/*
 	 * If this is the first startup of walreceiver (on this timeline),
-	 * initialize receivedUpto and latestChunkStart to the starting point.
+	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
 	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
 	{
-		walrcv->receivedUpto = recptr;
+		walrcv->flushedUpto = recptr;
 		walrcv->receivedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -294,13 +294,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&walrcv->mutex);
-	recptr = walrcv->receivedUpto;
+	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
 	if (receiveTLI)
@@ -327,7 +327,7 @@ GetReplicationApplyDelay(void)
 	TimestampTz chunkReplayStartTime;
 
 	SpinLockAcquire(&walrcv->mutex);
-	receivePtr = walrcv->receivedUpto;
+	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
 	replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 76ec3c7dd0..928a27dbaf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2913,7 +2913,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..9ed71139ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,19 +74,19 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * receivedUpto-1 is the last byte position that has already been
+	 * flushedUpto-1 is the last byte position that has already been
 	 * received, and receivedTLI is the timeline it came from.  At the first
 	 * startup of walreceiver, these are set to receiveStart and
 	 * receiveStartTLI. After that, walreceiver updates these whenever it
 	 * flushes the received WAL to disk.
 	 */
-	XLogRecPtr	receivedUpto;
+	XLogRecPtr	flushedUpto;
 	TimeLineID	receivedTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
-	 * receivedUpto before the last flush to disk.  Startup process can use
+	 * flushedUpto before the last flush to disk.  Startup process can use
 	 * this to detect whether it's keeping up or not.
 	 */
 	XLogRecPtr	latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0003-Add-WalRcvGetWriteRecPtr-new-definition-v5.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v5.patchDownload

From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk.  To do that, it
needs to be able to see the write pointer advancing in shared memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/replication/walreceiver.c      |  5 +++++
 src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
 src/include/replication/walreceiver.h      | 10 ++++++++++
 3 files changed, 27 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 0bdd0c3074..e250f5583c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -245,6 +245,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 31025f97e3..96b44e2c88 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	WalRcvData *walrcv = WalRcv;
+
+	return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9ed71139ce..914e6e3d44 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -142,6 +143,14 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
+	/*
+	 * Like flushedUpto, but advanced after writing and before flushing,
+	 * without the need to acquire the spin lock.  Data can be read by another
+	 * process up to this point, but shouldn't be used for data integrity
+	 * purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname);
 extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

0004-Allow-PrefetchBuffer-to-report-what-happened-v5.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-what-happened-v5.patchDownload

From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.

Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.

If the requested block was already cached, return a valid buffer.  This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.

Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery.  This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 38 +++++++++++++++++++++++----
 src/backend/storage/buffer/localbuf.c | 17 ++++++++----
 src/backend/storage/smgr/md.c         |  9 +++++--
 src/backend/storage/smgr/smgr.c       | 10 ++++---
 src/include/storage/buf_internals.h   |  5 ++--
 src/include/storage/bufmgr.h          | 19 ++++++++++----
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 8 files changed, 78 insertions(+), 24 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..4ceb40a856 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,11 +469,13 @@ static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
  */
-void
+PrefetchBufferResult
 PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 					 ForkNumber forkNum,
 					 BlockNumber blockNum)
 {
+	PrefetchBufferResult result = { InvalidBuffer, false };
+
 #ifdef USE_PREFETCH
 	BufferTag	newTag;		/* identity of requested block */
 	uint32		newHash;	/* hash value for newTag */
@@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 
 	/* If not in buffers, initiate prefetch */
 	if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+		/*
+		 * Try to initiate an asynchronous read.  This returns false in
+		 * recovery if the relation file doesn't exist.
+		 */
+		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+			result.initiated_io = true;
+	}
+	else
+	{
+		/*
+		 * Report the buffer it was in at that time.  The caller may be able
+		 * to avoid a buffer table lookup, but it's not pinned and it must be
+		 * rechecked!
+		 */
+		result.buffer = buf_id + 1;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -511,6 +529,8 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 	 * a problem to justify that.
 	 */
 #endif							/* USE_PREFETCH */
+
+	return result;
 }
 
 /*
@@ -520,8 +540,12 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
  * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
  * block will not be delayed by the I/O.  Prefetching is optional.
  * No-op if prefetching isn't compiled in.
+ *
+ * If the block is already cached, the result includes a valid buffer that can
+ * be used by the caller to avoid the need for a later buffer lookup, but it's
+ * not pinned, so the caller must recheck it.
  */
-void
+PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
 		/* pass it to the shared buffer version */
-		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
+#else
+	PrefetchBuffer result = { InvalidBuffer, false };
+
+	return result;
 #endif							/* USE_PREFETCH */
 }
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..18a8614e9b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,10 +60,12 @@ static Block GetLocalBufferStorage(void);
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
-void
+PrefetchBufferResult
 PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
+	PrefetchBufferResult result = { InvalidBuffer, false };
+
 #ifdef USE_PREFETCH
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -81,12 +83,17 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	if (hresult)
 	{
 		/* Yes, so nothing to do */
-		return;
+		result.buffer = -hresult->id - 1;
+	}
+	else
+	{
+		/* Not in buffers, so initiate prefetch */
+		smgrprefetch(smgr, forkNum, blockNum);
+		result.initiated_io = true;
 	}
-
-	/* Not in buffers, so initiate prefetch */
-	smgrprefetch(smgr, forkNum, blockNum);
 #endif							/* USE_PREFETCH */
+
+	return result;
 }
 
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
-void
+bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
-	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later WAL
+ *		record).
  */
-void
+bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
-	smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
 /*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
-								BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+												ForkNumber forkNum,
+												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
 extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e00dd3ffb7..64b643569f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -46,6 +46,15 @@ typedef enum
 								 * replay; otherwise same as RBM_NORMAL */
 } ReadBufferMode;
 
+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+	Buffer		buffer;			/* If valid, a hit (recheck needed!) */
+	bool		initiated_io;	/* If true, a miss resulting in async I/O */
+} PrefetchBufferResult;
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -162,11 +171,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
-								 ForkNumber forkNum,
-								 BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
-						   BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+										   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
-- 
2.20.1

0005-Prefetch-referenced-blocks-during-recovery-v5.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery-v5.patchDownload

From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC max_wal_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is disabled by default.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                    |  38 ++
 doc/src/sgml/monitoring.sgml                |  69 ++
 doc/src/sgml/wal.sgml                       |  12 +
 src/backend/access/transam/Makefile         |   1 +
 src/backend/access/transam/xlog.c           |  64 ++
 src/backend/access/transam/xlogprefetcher.c | 663 ++++++++++++++++++++
 src/backend/access/transam/xlogutils.c      |  23 +-
 src/backend/catalog/system_views.sql        |  11 +
 src/backend/replication/logical/logical.c   |   2 +-
 src/backend/storage/buffer/bufmgr.c         |   2 +-
 src/backend/storage/ipc/ipci.c              |   3 +
 src/backend/utils/misc/guc.c                |  38 +-
 src/include/access/xlog.h                   |   4 +
 src/include/access/xlogprefetcher.h         |  28 +
 src/include/access/xlogutils.h              |  20 +
 src/include/catalog/pg_proc.dat             |   8 +
 src/include/utils/guc.h                     |   2 +
 src/test/regress/expected/rules.out         |   8 +
 18 files changed, 992 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 672bf6f1ee..8249ec0139 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3102,6 +3102,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2192,6 +2199,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+   <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-wal-prefetch-distance"/>,
+   <xref linkend="guc-wal-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..9e956ad2a1 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-wal-prefetch-distance"/> parameter can be
+   used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, WAL prefetching is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abb227ce66..85f36ef6f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -105,6 +106,8 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_wal_prefetch_distance = -1;
+bool		wal_prefetch_fpw = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -806,6 +809,7 @@ static XLogSource readSource = XLOG_FROM_ANY;
  */
 static XLogSource currentSource = XLOG_FROM_ANY;
 static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -6213,6 +6217,7 @@ CheckRequiredParameterValues(void)
 	}
 }
 
+
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
@@ -7069,6 +7074,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetcher *prefetcher = NULL;
 
 			InRedo = true;
 
@@ -7076,6 +7082,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* the first time through, see if we need to enable prefetching */
+			ResetWalPrefetcher();
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * The first time through, or if any relevant settings or the
+				 * WAL source changes, we'll restart the prefetching machinery
+				 * as appropriate.  This is simpler than trying to handle
+				 * various complicated state changes.
+				 */
+				if (unlikely(reset_wal_prefetcher))
+				{
+					/* If we had one already, destroy it. */
+					if (prefetcher)
+					{
+						XLogPrefetcherFree(prefetcher);
+						prefetcher = NULL;
+					}
+					/* If we want one, create it. */
+					if (max_wal_prefetch_distance > 0)
+							prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+																currentSource == XLOG_FROM_STREAM);
+					reset_wal_prefetcher = false;
+				}
+
+				/* Peform WAL prefetching, if enabled. */
+				if (prefetcher)
+					XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7292,6 +7326,8 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			if (prefetcher)
+				XLogPrefetcherFree(prefetcher);
 
 			if (reachedRecoveryTarget)
 			{
@@ -10155,6 +10191,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 	}
 }
 
+void
+assign_max_wal_prefetch_distance(int new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	max_wal_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	wal_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+}
+
 
 /*
  * Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11961,6 +12015,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and move on to the next state.
 					 */
 					currentSource = XLOG_FROM_STREAM;
+					ResetWalPrefetcher();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12390,3 +12445,12 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+	reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..715552b428
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,663 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	XLogRecPtr	   *prefetch_queue;
+	int				prefetch_queue_size;
+	int				prefetch_head;
+	int				prefetch_tail;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Counters used to compute avg_queue_depth and avg_distance. */
+	double			samples;
+	double			queue_depth_sum;
+	double			distance_sum;
+	XLogRecPtr		next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Sequential/repeat blocks skipped. */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.  The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+	return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+	pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+	pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+	MonitoringStats->distance = -1;
+	MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+	bool		found;
+
+	MonitoringStats = (XLogPrefetcherMonitoringStats *)
+		ShmemInitStruct("XLogPrefetcherMonitoringStats",
+						sizeof(XLogPrefetcherMonitoringStats),
+						&found);
+	if (!found)
+		XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* We're allowed to read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_LSN;
+		prefetcher->options.lsn = (XLogRecPtr) -1;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.
+	 */
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+	prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+	prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("WAL prefetch started at %X/%X",
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	XLogPrefetcherResetMonitoringStats();
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	double		avg_distance = 0;
+	double		avg_queue_depth = 0;
+
+	/* Log final statistics. */
+	if (prefetcher->samples > 0)
+	{
+		avg_distance = prefetcher->distance_sum / prefetcher->samples;
+		avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+	}
+	ereport(LOG,
+			(errmsg("WAL prefetch finished at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&MonitoringStats->prefetch),
+			 pg_atomic_read_u64(&MonitoringStats->skip_hit),
+			 pg_atomic_read_u64(&MonitoringStats->skip_new),
+			 pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+			 pg_atomic_read_u64(&MonitoringStats->skip_seq),
+			 avg_distance,
+			 avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher->prefetch_queue);
+	pfree(prefetcher);
+
+	XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/* Can we drop any filters yet, due to problem records begin replayed? */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/* Main prefetch loop. */
+	for (;;)
+	{
+		XLogReaderState *reader = prefetcher->reader;
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("WAL prefetch error: %s", error)));
+					prefetcher->shutdown = true;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		MonitoringStats->distance = distance;
+
+		/* Sample the averages so we can log them at end of recovery. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			prefetcher->distance_sum += MonitoringStats->distance;
+			prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+			prefetcher->samples += 1.0;
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_wal_prefetch_distance)
+			break;
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/*
+		 * Scan the record for block references.  We might already have been
+		 * partway through processing this record when we hit maximum I/O
+		 * concurrency, so start where we left off.
+		 */
+		for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+		{
+			PrefetchBufferResult prefetch;
+			DecodedBkpBlock *block = &reader->blocks[i];
+			SMgrRelation reln;
+
+			/* Ignore everything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+				continue;
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so you might thing we should skip it.  However, if the
+			 * underlying filesystem uses larger logical blocks than us, it
+			 * might still need to perform a read-before-write some time later.
+			 * Therefore, only prefetch if configured to do so.
+			 */
+			if (block->has_image && !wal_prefetch_fpw)
+			{
+				inc_counter(&MonitoringStats->skip_fpw);
+				continue;
+			}
+
+			/*
+			 * If this block will initialize a new page then it's probably an
+			 * extension.  Since it might create a new segment, we can't try
+			 * to prefetch this block until the record has been replayed, or we
+			 * might try to open a file that doesn't exist yet.
+			 */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Should we skip this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+										 block->blkno))
+			{
+				inc_counter(&MonitoringStats->skip_new);
+				continue;
+			}
+
+			/* Fast path for repeated references to the same relation. */
+			if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+			{
+				/*
+				 * If this is a repeat or sequential access, then skip it.  We
+				 * expect the kernel to detect sequential access on its own
+				 * and do a better job than we could.
+				 */
+				if (block->blkno == prefetcher->last_blkno ||
+					block->blkno == prefetcher->last_blkno + 1)
+				{
+					prefetcher->last_blkno = block->blkno;
+					inc_counter(&MonitoringStats->skip_seq);
+					continue;
+				}
+
+				/* We can avoid calling smgropen(). */
+				reln = prefetcher->last_reln;
+			}
+			else
+			{
+				/* Otherwise we have to open it. */
+				reln = smgropen(block->rnode, InvalidBackendId);
+				prefetcher->last_rnode = block->rnode;
+				prefetcher->last_reln = reln;
+			}
+			prefetcher->last_blkno = block->blkno;
+
+			/* Try to prefetch this block! */
+			prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(prefetch.buffer))
+			{
+				/*
+				 * It was already cached, so do nothing.  Perhaps in future we
+				 * could remember the buffer so that recovery doesn't have to
+				 * look it up again.
+				 */
+				inc_counter(&MonitoringStats->skip_hit);
+			}
+			else if (prefetch.initiated_io)
+			{
+				/*
+				 * I/O has possibly been initiated (though we don't know if it
+				 * was already cached by the kernel, so we just have to assume
+				 * that it has due to lack of better information).  Record
+				 * this as an I/O in progress until eventually we replay this
+				 * LSN.
+				 */
+				inc_counter(&MonitoringStats->prefetch);
+				XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+				/*
+				 * If the queue is now full, we'll have to wait before
+				 * processing any more blocks from this record.
+				 */
+				if (XLogPrefetcherSaturated(prefetcher))
+				{
+					prefetcher->next_block_id = i + 1;
+					return;
+				}
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist.  Presumably it will be unlinked by a later
+				 * WAL record.  When recovery reads this block, it will use the
+				 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to
+				 * do that sort of thing while merely prefetching, so let's
+				 * just ignore references to this relation until this record is
+				 * replayed, and let recovery create the dummy file or complain
+				 * if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										reader->ReadRecPtr);
+				inc_counter(&MonitoringStats->skip_new);
+			}
+		}
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+	bool		nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (MonitoringStats->distance < 0)
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+			nulls[i] = false;
+		values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+		values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+		values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+		values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+		values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+		values[5] = Int32GetDatum(MonitoringStats->distance);
+		values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+	}
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	MonitoringStats->queue_depth++;
+	Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		MonitoringStats->queue_depth--;
+		Assert(MonitoringStats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
 
 	loc = targetPagePtr + reqLen;
 
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * notices recovery finishes, so we only have to maintain it for the
 		 * local process until recovery ends.
 		 */
-		if (!RecoveryInProgress())
+		if (options)
+		{
+			switch (options->read_upto_policy)
+			{
+			case XLRO_WALRCV_WRITTEN:
+				read_upto = GetWalRcvWriteRecPtr();
+				break;
+			case XLRO_LSN:
+				read_upto = options->lsn;
+				break;
+			default:
+				read_upto = 0;
+				elog(ERROR, "unknown read_upto_policy value");
+				break;
+			}
+		}
+		else if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			if (options && options->nowait)
+				break;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..7b27ac4805 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_wal_prefetcher AS
+    SELECT
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth
+     FROM pg_stat_get_wal_prefetcher() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..792d90ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4ceb40a856..4fc391a6e4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -572,7 +572,7 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 		return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #else
-	PrefetchBuffer result = { InvalidBuffer, false };
+	PrefetchBufferResult result = { InvalidBuffer, false };
 
 	return result;
 #endif							/* USE_PREFETCH */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetcherShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetcherShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..a2a9f62160 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,6 +197,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1241,6 +1242,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_wal_prefetch_distance is set to a positive number.")
+		},
+		&wal_prefetch_fpw,
+		false,
+		NULL, assign_wal_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2627,6 +2640,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable WAL prefetching."),
+			GUC_UNIT_BYTE
+		},
+		&max_wal_prefetch_distance,
+		-1, -1, INT_MAX,
+		NULL, assign_max_wal_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2900,7 +2924,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11498,6 +11523,17 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/* Reset the WAL prefetcher, because a setting it depends on changed. */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		ResetWalPrefetcher();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..82829d7854 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
+extern int	max_wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
 
 extern void XLogRequestWalReceiverReply(void);
 
+extern void ResetWalPrefetcher(void);
+
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/* How far to read. */
+	enum {
+		XLRO_WALRCV_WRITTEN,
+		XLRO_LSN
+	} read_upto_policy;
+
+	/* If read_upto_policy is XLRO_LSN, the LSN. */
+	XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..742741afa1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+  prosrc => 'pg_stat_get_wal_prefetcher' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..7d076a9743 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_max_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
 
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7304611c3..63bbb796fc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2102,6 +2102,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth
+   FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
 pg_stat_wal_receiver| SELECT s.pid,
     s.status,
     s.receive_start_lsn,
-- 
2.20.1

#10

andres@anarazel.de

almost 6 years ago

In reply to: Thomas Munro (#9)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:

From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?

From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.

Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.

If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.

Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.

We probably should take this into account in nodeBitmapHeapscan.c

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..4ceb40a856 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,11 +469,13 @@ static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
/*
* Implementation of PrefetchBuffer() for shared buffers.
*/
-void
+PrefetchBufferResult
PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum)
{
+	PrefetchBufferResult result = { InvalidBuffer, false };
+
#ifdef USE_PREFETCH
BufferTag	newTag;		/* identity of requested block */
uint32		newHash;	/* hash value for newTag */
@@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,

/* If not in buffers, initiate prefetch */
if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+		/*
+		 * Try to initiate an asynchronous read.  This returns false in
+		 * recovery if the relation file doesn't exist.
+		 */
+		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+			result.initiated_io = true;
+	}
+	else
+	{
+		/*
+		 * Report the buffer it was in at that time.  The caller may be able
+		 * to avoid a buffer table lookup, but it's not pinned and it must be
+		 * rechecked!
+		 */
+		result.buffer = buf_id + 1;

Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?

-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
-		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
/* pass it to the shared buffer version */
-		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
+#else
+	PrefetchBuffer result = { InvalidBuffer, false };
+
+	return result;
#endif							/* USE_PREFETCH */
}

Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?

+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+	Buffer		buffer;			/* If valid, a hit (recheck needed!) */

I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?

From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion:
/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com

Why is it disabled by default? Just for "risk management"?

+     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>

Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?

+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>

Hm. I think this needs more details - it's not clear enough what this
actually controls. I assume it's about prefetching for WAL records that
contain the FPW, but it also could be read to be about not prefetching
any pages that had FPWs before, or such?

</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
</entry>
</row>

+     <row>
+      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+      </entry>
+     </row>
+

'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?

+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).

I think it'd be good to have a 'reset time' column.

+  <para>
+   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-wal-prefetch-distance"/>,
+   <xref linkend="guc-wal-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+

So pg_stat_reset_shared() cannot be used? If so, why?

It sounds like the counters aren't persisted via the stats system - if
so, why?

@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();

+				/*
+				 * The first time through, or if any relevant settings or the
+				 * WAL source changes, we'll restart the prefetching machinery
+				 * as appropriate.  This is simpler than trying to handle
+				 * various complicated state changes.
+				 */
+				if (unlikely(reset_wal_prefetcher))
+				{
+					/* If we had one already, destroy it. */
+					if (prefetcher)
+					{
+						XLogPrefetcherFree(prefetcher);
+						prefetcher = NULL;
+					}
+					/* If we want one, create it. */
+					if (max_wal_prefetch_distance > 0)
+							prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+																currentSource == XLOG_FROM_STREAM);
+					reset_wal_prefetcher = false;
+				}

Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?

+				/* Peform WAL prefetching, if enabled. */
+				if (prefetcher)
+					XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().

Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).

+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for PostgreSQL write-ahead log manager
+ *

An architectural overview here would be good.

+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	XLogRecPtr	   *prefetch_queue;
+	int				prefetch_queue_size;
+	int				prefetch_head;
+	int				prefetch_tail;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;

Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.  The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}

Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.

+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* We're allowed to read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_LSN;
+		prefetcher->options.lsn = (XLogRecPtr) -1;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.
+	 */
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+	prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+	prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("WAL prefetch started at %X/%X",
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	XLogPrefetcherResetMonitoringStats();
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	double		avg_distance = 0;
+	double		avg_queue_depth = 0;
+
+	/* Log final statistics. */
+	if (prefetcher->samples > 0)
+	{
+		avg_distance = prefetcher->distance_sum / prefetcher->samples;
+		avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+	}
+	ereport(LOG,
+			(errmsg("WAL prefetch finished at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&MonitoringStats->prefetch),
+			 pg_atomic_read_u64(&MonitoringStats->skip_hit),
+			 pg_atomic_read_u64(&MonitoringStats->skip_new),
+			 pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+			 pg_atomic_read_u64(&MonitoringStats->skip_seq),
+			 avg_distance,
+			 avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher->prefetch_queue);
+	pfree(prefetcher);
+
+	XLogPrefetcherResetMonitoringStats();
+}

It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.

+ /* Can we drop any filters yet, due to problem records begin replayed? */

Odd grammar.

+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);

Hm, why isn't this part of the loop below?

+	/* Main prefetch loop. */
+	for (;;)
+	{

This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...

The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?

+		/*
+		 * Scan the record for block references.  We might already have been
+		 * partway through processing this record when we hit maximum I/O
+		 * concurrency, so start where we left off.
+		 */
+		for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+		{

Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.

Greetings,

Andres Freund

#11

thomas.munro@gmail.com

almost 6 years ago

In reply to: Andres Freund (#10)

8 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

Thanks for all that feedback. It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).

It also fixes a couple of other problems I found with the previous version:

1. While streaming, whenever it hit the end of available data (ie LSN
written by WAL receiver), it would close and then reopen the WAL
segment. Fixed by the machinery in 0007 which allows for "would
block" as distinct from other errors.

2. During crash recovery, there were some edge cases where it would
try to read the next WAL segment when there isn't one. Also fixed by
0007.

3. It was maxing out at maintenance_io_concurrency - 1 due to a silly
circular buffer fence post bug.

Note that 0006 is just for illustration, it's not proposed for commit.

On Wed, Mar 25, 2020 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:

From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.

The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.

Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?

Well, at least external code won't compile due to the change in arguments:

extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart,
TimeLineID *receiveTLI);
extern XLogRecPtr GetWalRcvWriteRecPtr(void);

Anyone who is using that for some kind of data integrity purposes
should hopefully be triggered to investigate, no? I tried to think of
a better naming scheme but...

From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.

Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.

If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.

Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.

We probably should take this into account in nodeBitmapHeapscan.c

Indeed. The naive version would be something like:

diff --git a/src/backend/executor/nodeBitmapHeapscan.c
b/src/backend/executor/nodeBitmapHeapscan.c
index 726d3a2d9a..3cd644d0ac 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,13 +484,11 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)
                                        node->prefetch_iterator = NULL;
                                        break;
                                }
-                               node->prefetch_pages++;

                                /*
                                 * If we expect not to have to
actually read this heap page,
                                 * skip this prefetch call, but
continue to run the prefetch
-                                * logic normally.  (Would it be
better not to increment
-                                * prefetch_pages?)
+                                * logic normally.
                                 *
                                 * This depends on the assumption that
the index AM will
                                 * report the same recheck flag for
this future heap page as
@@ -504,7 +502,13 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)

&node->pvmbuffer));

                                if (!skip_fetch)
-                                       PrefetchBuffer(scan->rs_rd,
MAIN_FORKNUM, tbmpre->blockno);
+                               {
+                                       PrefetchBufferResult prefetch;
+
+                                       prefetch =
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+                                       if (prefetch.initiated_io)
+                                               node->prefetch_pages++;
+                               }
                        }
                }

... but that might get arbitrarily far ahead, so it probably needs
some kind of cap, and the parallel version is a bit more complicated.
Something for later, along with more prefetching opportunities.

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..4ceb40a856 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,11 +469,13 @@ static int      ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
/*
* Implementation of PrefetchBuffer() for shared buffers.
*/
-void
+PrefetchBufferResult
PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum)
{
+     PrefetchBufferResult result = { InvalidBuffer, false };
+
#ifdef USE_PREFETCH
BufferTag       newTag;         /* identity of requested block */
uint32          newHash;        /* hash value for newTag */
@@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,

/* If not in buffers, initiate prefetch */
if (buf_id < 0)
-             smgrprefetch(smgr_reln, forkNum, blockNum);
+     {
+             /*
+              * Try to initiate an asynchronous read.  This returns false in
+              * recovery if the relation file doesn't exist.
+              */
+             if (smgrprefetch(smgr_reln, forkNum, blockNum))
+                     result.initiated_io = true;
+     }
+     else
+     {
+             /*
+              * Report the buffer it was in at that time.  The caller may be able
+              * to avoid a buffer table lookup, but it's not pinned and it must be
+              * rechecked!
+              */
+             result.buffer = buf_id + 1;

Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?

OK. Renamed to "recent_buffer".

-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
-             PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+             return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
/* pass it to the shared buffer version */
-             PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+             return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
+#else
+     PrefetchBuffer result = { InvalidBuffer, false };
+
+     return result;
#endif                                                       /* USE_PREFETCH */
}
Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?

Yeah. Done.

+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+     Buffer          buffer;                 /* If valid, a hit (recheck needed!) */
I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?

It *is* used, but only to see if it's valid. 0006 is a not-for-commit
patch to show how you might use it later to read a buffer. To
actually use this for something like bitmap heap scan, you'd first
need to fix the modularity violations in that code (I mean we have
PrefetchBuffer() in nodeBitmapHeapscan.c, but the corresponding
[ReleaseAnd]ReadBuffer() in heapam.c, and you'd need to get these into
the same module and/or to communicate in some graceful way).

From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.

Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion:
/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com

Why is it disabled by default? Just for "risk management"?

Well, it's not free, and might not help you, so not everyone would
want it on. I think the overheads can be mostly removed with more
work in a later release. Perhaps we could commit it enabled by
default, and then discuss it before release after looking at some more
data? On that basis I have now made it default to on, with
max_wal_prefetch_distance = 256kB, if your build has USE_PREFETCH.
Obviously this number can be discussed.

+     <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+      <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+        If this value is specified without units, it is taken as bytes.
+        The default is -1, meaning that WAL prefetching is disabled.
+       </para>
+      </listitem>
+     </varlistentry>

Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?

OK, I tried to explain that.

+     <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+      <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks with full page images during recovery.
+        Usually this doesn't help, since such blocks will not be read.  However,
+        on file systems with a block size larger than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+        read-before-write when a blocks are later written.
+        This setting has no effect unless
+        <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>

Ok, I have elaborated.

</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
</entry>
</row>

+     <row>
+      <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+      </entry>
+     </row>
+

'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?

Works for me, though while thinking about this I realised that the
"WAL" part was bothering me. It sounds like we're prefetching WAL
itself, which would be a different thing. So I renamed this view to
pg_stat_prefetch_recovery.

Then I renamed the main GUCs that control this thing to:

max_recovery_prefetch_distance
recovery_prefetch_fpw

+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.

It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).

I agree that that sounds useful, and I thought about various ways to
do that that involved new views, until I eventually found myself
wondering: why isn't recovery's I/O already tracked via the existing
stats views? For example, why can't I see blks_read, blks_hit,
blk_read_time etc moving in pg_stat_database due to recovery activity?

I seems like if you made that work first, or created a new view
pgstatio view for that, then you could add prefetching counters and
timing (if track_io_timing is on) to the existing machinery so that
bufmgr.c would automatically capture it, and then not only recovery
but also stuff like bitmap heap scan could also be measured the same
way.

However, time is short, so I'm not attempting to do anything like that
now. You can measure the posix_fadvise() times with OS facilities in
the meantime.

I think it'd be good to have a 'reset time' column.

Done, as stats_reset following other examples.

+  <para>
+   The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-wal-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-wal-prefetch-distance"/>,
+   <xref linkend="guc-wal-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+

So pg_stat_reset_shared() cannot be used? If so, why?

Hmm. OK, I made pg_stat_reset_shared('prefetch_recovery') work.

It sounds like the counters aren't persisted via the stats system - if
so, why?

Ok, I made it persist the simple counters by sending to the to stats
collector periodically. The view still shows data straight out of
shmem though, not out of the stats file. Now I'm wondering if I
should have the view show it from the stats file, more like other
things, now that I understand that a bit better... hmm.

@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();

+                             /*
+                              * The first time through, or if any relevant settings or the
+                              * WAL source changes, we'll restart the prefetching machinery
+                              * as appropriate.  This is simpler than trying to handle
+                              * various complicated state changes.
+                              */
+                             if (unlikely(reset_wal_prefetcher))
+                             {
+                                     /* If we had one already, destroy it. */
+                                     if (prefetcher)
+                                     {
+                                             XLogPrefetcherFree(prefetcher);
+                                             prefetcher = NULL;
+                                     }
+                                     /* If we want one, create it. */
+                                     if (max_wal_prefetch_distance > 0)
+                                                     prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+                                                                                                                             currentSource == XLOG_FROM_STREAM);
+                                     reset_wal_prefetcher = false;
+                             }

Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?

It's now done differently, so that StartupXLOG() only has three new
lines: XLogPrefetchBegin() before the loop, XLogPrefetch() in the
loop, and XLogPrefetchEnd() after the loop.

+                             /* Peform WAL prefetching, if enabled. */
+                             if (prefetcher)
+                                     XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).

Done.

+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *           Prefetching support for PostgreSQL write-ahead log manager
+ *

An architectural overview here would be good.

OK, added.

+struct XLogPrefetcher
+{
+     /* Reader and current reading state. */
+     XLogReaderState *reader;
+     XLogReadLocalOptions options;
+     bool                    have_record;
+     bool                    shutdown;
+     int                             next_block_id;
+
+     /* Book-keeping required to avoid accessing non-existing blocks. */
+     HTAB               *filter_table;
+     dlist_head              filter_queue;
+
+     /* Book-keeping required to limit concurrent prefetches. */
+     XLogRecPtr         *prefetch_queue;
+     int                             prefetch_queue_size;
+     int                             prefetch_head;
+     int                             prefetch_tail;
+
+     /* Details of last prefetch to skip repeats and seq scans. */
+     SMgrRelation    last_reln;
+     RelFileNode             last_rnode;
+     BlockNumber             last_blkno;

Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).

I've added a note to the new architectural comments.

+/*
+ * On modern systems this is really just *counter++.  On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.  The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+     pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}

Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.

Ok, I added pg_atomic_unlocked_add_fetch_XXX(). (Could also be
"fetch_add", I don't care, I don't use the result).

+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+     static HASHCTL hash_table_ctl = {
+             .keysize = sizeof(RelFileNode),
+             .entrysize = sizeof(XLogPrefetcherFilter)
+     };
+     XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+     prefetcher->options.nowait = true;
+     if (streaming)
+     {
+             /*
+              * We're only allowed to read as far as the WAL receiver has written.
+              * We don't have to wait for it to be flushed, though, as recovery
+              * does, so that gives us a chance to get a bit further ahead.
+              */
+             prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+     }
+     else
+     {
+             /* We're allowed to read as far as we can. */
+             prefetcher->options.read_upto_policy = XLRO_LSN;
+             prefetcher->options.lsn = (XLogRecPtr) -1;
+     }
+     prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+                                                                                     NULL,
+                                                                                     read_local_xlog_page,
+                                                                                     &prefetcher->options);
+     prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+                                                                                &hash_table_ctl,
+                                                                                HASH_ELEM | HASH_BLOBS);
+     dlist_init(&prefetcher->filter_queue);
+
+     /*
+      * The size of the queue is based on the maintenance_io_concurrency
+      * setting.  In theory we might have a separate queue for each tablespace,
+      * but it's not clear how that should work, so for now we'll just use the
+      * general GUC to rate-limit all prefetching.
+      */
+     prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+     prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+     prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+     /* Prepare to read at the given LSN. */
+     ereport(LOG,
+                     (errmsg("WAL prefetch started at %X/%X",
+                                     (uint32) (lsn << 32), (uint32) lsn)));
+     XLogBeginRead(prefetcher->reader, lsn);
+
+     XLogPrefetcherResetMonitoringStats();
+
+     return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+     double          avg_distance = 0;
+     double          avg_queue_depth = 0;
+
+     /* Log final statistics. */
+     if (prefetcher->samples > 0)
+     {
+             avg_distance = prefetcher->distance_sum / prefetcher->samples;
+             avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+     }
+     ereport(LOG,
+                     (errmsg("WAL prefetch finished at %X/%X; "
+                                     "prefetch = " UINT64_FORMAT ", "
+                                     "skip_hit = " UINT64_FORMAT ", "
+                                     "skip_new = " UINT64_FORMAT ", "
+                                     "skip_fpw = " UINT64_FORMAT ", "
+                                     "skip_seq = " UINT64_FORMAT ", "
+                                     "avg_distance = %f, "
+                                     "avg_queue_depth = %f",
+                      (uint32) (prefetcher->reader->EndRecPtr << 32),
+                      (uint32) (prefetcher->reader->EndRecPtr),
+                      pg_atomic_read_u64(&MonitoringStats->prefetch),
+                      pg_atomic_read_u64(&MonitoringStats->skip_hit),
+                      pg_atomic_read_u64(&MonitoringStats->skip_new),
+                      pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+                      pg_atomic_read_u64(&MonitoringStats->skip_seq),
+                      avg_distance,
+                      avg_queue_depth)));
+     XLogReaderFree(prefetcher->reader);
+     hash_destroy(prefetcher->filter_table);
+     pfree(prefetcher->prefetch_queue);
+     pfree(prefetcher);
+
+     XLogPrefetcherResetMonitoringStats();
+}

It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.

I looked into that, but in fact it's already pretty clear how much
memory this thing is using, if you call
MemoryContextStats(TopMemoryContext), because it's almost all in a
named hash table:

TopMemoryContext: 155776 total in 6 blocks; 18552 free (8 chunks); 137224 used
XLogPrefetcherFilterTable: 16384 total in 2 blocks; 4520 free (3
chunks); 11864 used
SP-GiST temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
GiST temporary context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
GIN recovery temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
Btree recovery temporary context: 8192 total in 1 blocks; 7928 free
(0 chunks); 264 used
RecoveryLockLists: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
PrivateRefCount: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Pending ops context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
LOCALLOCK hash: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used
ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
Grand total: 358208 bytes in 20 blocks; 86832 free (15 chunks); 271376 used

The XLogPrefetcher struct itself is not measured seperately, but I
don't think that's a problem, it's small and there's only ever one at
a time. It's that XLogPrefetcherFilterTable that is of variable size
(though it's often empty). While thinking about this, I made
prefetch_queue into a flexible array rather than a pointer to palloc'd
memory, which seemed a bit tidier.

+ /* Can we drop any filters yet, due to problem records begin replayed? */

Odd grammar.

Rewritten.

+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);

Hm, why isn't this part of the loop below?

It only needs to run when replaying_lsn has advanced (ie when records
have been replayed). I hope the new comment makes that clearer.

+     /* Main prefetch loop. */
+     for (;;)
+     {
This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...

OK, I have updated the comment.

The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?

I've broken the function up. It's now:

StartupXLOG()
-> XLogPrefetch()
-> XLogPrefetcherReadAhead()
-> XLogPrefetcherScanRecords()
-> XLogPrefetcherScanBlocks()

+             /*
+              * Scan the record for block references.  We might already have been
+              * partway through processing this record when we hit maximum I/O
+              * concurrency, so start where we left off.
+              */
+             for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+             {

Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.

Done.

Attachments:

v6-0007-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v6-0007-Allow-XLogReadRecord-to-be-non-blocking.patchDownload

From 664ece95655bfba9ed565c77e17a1ca73b5fe11c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v6 7/8] Allow XLogReadRecord() to be non-blocking.

Extend read_local_xlog_page() to support non-blocking modes:

1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlogreader.c | 37 +++++++++++----
 src/backend/access/transam/xlogutils.c  | 61 +++++++++++++++++++++++--
 src/backend/replication/walsender.c     |  2 +-
 src/include/access/xlogreader.h         |  4 ++
 src/include/access/xlogutils.h          | 16 +++++++
 5 files changed, 107 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132f..e2f2998911 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -254,6 +254,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
@@ -543,10 +546,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -558,8 +562,9 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the read_page() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -656,8 +661,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+		return XLOGPAGEREAD_WOULDBLOCK;
+
 	XLogReaderInvalReadState(state);
-	return -1;
+	return XLOGPAGEREAD_ERROR;
 }
 
 /*
@@ -936,6 +944,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	int			readLen;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -949,7 +958,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -1030,7 +1038,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	return InvalidXLogRecPtr;
 }
@@ -1081,13 +1090,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 			tli != seg->ws_tli)
 		{
 			XLogSegNo	nextSegNo;
-
 			if (seg->ws_file >= 0)
 				close(seg->ws_file);
 
 			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
 			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
 
+			/* callback reported that there was no such file */
+			if (seg->ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = segbytes;
+				errinfo->wre_read = readbytes;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = *seg;
+				return false;
+			}
+
 			/* Update the current segment info. */
 			seg->ws_tli = tli;
 			seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..5031877e7c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+					 WALSegmentContext *segcxt,
+					 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	if (errno != ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+
+	return -1;					/* keep compiler quiet */
+}
+
 /* openSegment callback for WALRead */
 static int
 wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,6 +856,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
+	bool		try_read = false;
 
 	loc = targetPagePtr + reqLen;
 
@@ -845,7 +872,24 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		 * notices recovery finishes, so we only have to maintain it for the
 		 * local process until recovery ends.
 		 */
-		if (!RecoveryInProgress())
+		if (options)
+		{
+			switch (options->read_upto_policy)
+			{
+			case XLRO_WALRCV_WRITTEN:
+				read_upto = GetWalRcvWriteRecPtr();
+				break;
+			case XLRO_END:
+				read_upto = (XLogRecPtr) -1;
+				try_read = true;
+				break;
+			default:
+				read_upto = 0;
+				elog(ERROR, "unknown read_upto_policy value");
+				break;
+			}
+		}
+		else if (!RecoveryInProgress())
 			read_upto = GetFlushRecPtr();
 		else
 			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -883,6 +927,10 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 				break;
 
+			/* not enough data there, but we were asked not to wait */
+			if (options && options->nowait)
+				return XLOGPAGEREAD_WOULDBLOCK;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
@@ -924,7 +972,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 	}
 	else
 	{
@@ -938,8 +986,15 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
-				 &state->segcxt, wal_segment_open, &errinfo))
+				 &state->segcxt,
+				 try_read ? wal_segment_try_open : wal_segment_open,
+				 &errinfo))
+	{
+		/* Caller asked for XLRO_END, so there may be no file at all. */
+		if (try_read)
+			return XLOGPAGEREAD_ERROR;
 		WALReadRaiseError(&errinfo);
+	}
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 414cf67d3d..37ec3ddc7b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..dc99d02b60 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR		-1
+#define XLOGPAGEREAD_WOULDBLOCK	-2
+
 /* Function type definition for the read_page callback */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..440dffac1a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,22 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/* How far to read. */
+	enum {
+		XLRO_WALRCV_WRITTEN,
+		XLRO_END
+	} read_upto_policy;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
-- 
2.20.1

v6-0008-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v6-0008-Prefetch-referenced-blocks-during-recovery.patchDownload

From 29fe16d08d3da4bdb6d950f02ba71ae784562663 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v6 8/8] Prefetch referenced blocks during recovery.

Introduce a new GUC max_recovery_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is enabled by default for
now, but we might reconsider that before release.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +
 doc/src/sgml/monitoring.sgml                  |  71 ++
 doc/src/sgml/wal.sgml                         |  13 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  11 +
 src/backend/access/transam/xlogprefetch.c     | 900 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               |  96 +-
 src/backend/replication/logical/logical.c     |   2 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  45 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlogprefetch.h             |  81 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  28 +-
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 17 files changed, 1334 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f68c992213..3e60f306ff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+      <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as
+        <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
+        might be counterproductive, if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.  A setting of -1 disables prefetching
+        during recovery.
+        The default is 256kB on systems that support
+        <function>posix_fadvise</function>, and otherwise -1.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.  This
+        setting has no effect unless
+        <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+        number.  The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..1229a28675 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-recovery-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-recovery-prefetch-distance"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
@@ -3446,6 +3515,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        counters shown in the <structname>pg_stat_bgwriter</structname> view.
        Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
        counters shown in the <structname>pg_stat_archiver</structname> view.
+       Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+       counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
       </entry>
      </row>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled,
+   but it can be disabled by setting the distance to -1.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 658af40816..4b7f902462 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -7116,6 +7117,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 
 			InRedo = true;
 
@@ -7123,6 +7125,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7152,6 +7157,10 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Peform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr,
+							 currentSource == XLOG_FROM_STREAM);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7339,6 +7348,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -11970,6 +11980,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..c190ffb6bd
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,900 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.  Currently, this is achieved by using a
+ * separate XLogReader to read ahead.  In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * There is some evidence that it's better to let the operating system detect
+ * sequential access and do its own prefetching.  Explicit prefetching is
+ * therefore skipped for sequential blocks, counted with "skip_seq".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer().  Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs.  When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching.  Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int			max_recovery_prefetch_distance = -1;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Sequential/repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  We add one to the size
+	 * because our circular buffer has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+						 sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* Read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_END;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											&prefetcher->options);
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("recovery started prefetching at %X/%X",
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_recovery_prefetch_distance)
+			break;
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			prefetcher->have_record = false;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= reader->max_block_id;
+		 ++block_id)
+	{
+		PrefetchBufferResult prefetch;
+		DecodedBkpBlock *block = &reader->blocks[block_id];
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably an
+		 * extension.  Since it might create a new segment, we can't try
+		 * to prefetch this block until the record has been replayed, or we
+		 * might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat or sequential access, then skip it.  We
+			 * expect the kernel to detect sequential access on its own and do
+			 * a better job than we could.
+			 */
+			if (block->blkno == prefetcher->last_blkno ||
+				block->blkno == prefetcher->last_blkno + 1)
+			{
+				prefetcher->last_blkno = block->blkno;
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	max_recovery_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..3d5afb633e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9ebde47dea..c0f7333808 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -276,6 +277,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6422,6 +6504,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..792d90ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 03a22d71ac..6fc9ceb196 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_recovery_prefetch_distance is set to a positive number.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable prefetching during recovery."),
+			GUC_UNIT_BYTE
+		},
+		&max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+		256 * 1024,
+#else
+		-1,
+#endif
+		-1, INT_MAX,
+		NULL, assign_max_recovery_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2955,7 +2985,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11573,6 +11604,18 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/* Reconfigure WAL prefetching, because a setting it depends on
+	 * changed. */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1ae8b77306..fd7406b399 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..afd807c408
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+/* Functions exposed only for use by the static inline wrappers below. */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+			 XLogRecPtr replaying_lsn,
+			 bool from_stream)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (max_recovery_prefetch_distance > 0)
+			state->prefetcher = XLogPrefetcherAllocate(replaying_lsn,
+													   from_stream);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2d1862a9d8..a0dabe2d18 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..701eeaeb01 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -761,7 +786,6 @@ typedef struct PgStat_SLRUStats
 	TimestampTz stat_reset_timestamp;
 } PgStat_SLRUStats;
 
-
 /* ----------
  * Backend states
  * ----------
@@ -1464,6 +1488,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1479,6 +1504,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6eec8ec568..9eda632b3c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1855,6 +1855,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v6-0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela.patchDownload

From 956224dfcd9dff3327751323bbb03fdb098dc0a0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH v6 1/8] Allow PrefetchBuffer() to be called with a
 SMgrRelation.

Previously a Relation was required, but it's annoying to have to create
a "fake" one in recovery.  A new function PrefetchSharedBuffer() is
provided that works with SMgrRelation, and LocalPrefetchBuffer() is
renamed to PrefetchLocalBuffer() to fit with that more natural naming
scheme.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 84 ++++++++++++++++-----------
 src/backend/storage/buffer/localbuf.c |  4 +-
 src/include/storage/buf_internals.h   |  2 +-
 src/include/storage/bufmgr.h          |  3 +
 4 files changed, 56 insertions(+), 37 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7317ac8a2c..22087a1c3c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,6 +480,53 @@ static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
 
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+					 ForkNumber forkNum,
+					 BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+	BufferTag	newTag;		/* identity of requested block */
+	uint32		newHash;	/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+
+	Assert(BlockNumberIsValid(blockNum));
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+				   forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(newPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&newTag, newHash);
+	LWLockRelease(newPartitionLock);
+
+	/* If not in buffers, initiate prefetch */
+	if (buf_id < 0)
+		smgrprefetch(smgr_reln, forkNum, blockNum);
+
+	/*
+	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
+	 * the block might be just about to be evicted, which would be stupid
+	 * since we know we are going to need it soon.  But the only easy answer
+	 * is to bump the usage_count, which does not seem like a great solution:
+	 * when the caller does ultimately touch the block, usage_count would get
+	 * bumped again, resulting in too much favoritism for blocks that are
+	 * involved in a prefetch sequence. A real fix would involve some
+	 * additional per-buffer state, and it's not clear that there's enough of
+	 * a problem to justify that.
+	 */
+#endif							/* USE_PREFETCH */
+}
+
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
  *
@@ -507,43 +554,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
-
-		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
-					   forkNum, blockNum);
-
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
-		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
-			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
-		/*
-		 * If the block *is* in buffers, we do nothing.  This is not really
-		 * ideal: the block might be just about to be evicted, which would be
-		 * stupid since we know we are going to need it soon.  But the only
-		 * easy answer is to bump the usage_count, which does not seem like a
-		 * great solution: when the caller does ultimately touch the block,
-		 * usage_count would get bumped again, resulting in too much
-		 * favoritism for blocks that are involved in a prefetch sequence. A
-		 * real fix would involve some additional per-buffer state, and it's
-		 * not clear that there's enough of a problem to justify that.
-		 */
+		/* pass it to the shared buffer version */
+		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 #endif							/* USE_PREFETCH */
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
 
 
 /*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
  *	  initiate asynchronous read of a block of a relation
  *
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
 void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
 #ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 								BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bf3b12a2de..39660aacba 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -162,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+								 ForkNumber forkNum,
+								 BlockNumber blockNum);
 extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
 						   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
-- 
2.20.1

v6-0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP.patchDownload

From 260563b1400f32a94ee4cc5e4552d17fecfb4ea3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH v6 2/8] Rename GetWalRcvWriteRecPtr() to
 GetWalRcvFlushRecPtr().

The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk.  Also rename a
couple of variables relating to this value.

An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c          | 20 ++++++++++----------
 src/backend/access/transam/xlogfuncs.c     |  2 +-
 src/backend/replication/README             |  2 +-
 src/backend/replication/walreceiver.c      | 10 +++++-----
 src/backend/replication/walreceiverfuncs.c | 12 ++++++------
 src/backend/replication/walsender.c        |  2 +-
 src/include/replication/walreceiver.h      |  8 ++++----
 7 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abf954ba39..658af40816 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -207,8 +207,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 
 static XLogRecPtr LastRec;
 
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
 /*
@@ -9335,7 +9335,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -11732,7 +11732,7 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 receivedUpto < targetPagePtr + reqLen))
+		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -11763,10 +11763,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12181,7 +12181,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						receivedUpto = 0;
+						flushedUpto = 0;
 					}
 
 					/*
@@ -12205,14 +12205,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
 					 */
-					if (RecPtr < receivedUpto)
+					if (RecPtr < flushedUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b84ba57259..00e1b33ed5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
 WalRcvData->receiveStart.
 
 As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
 the startup process to know how far WAL replay can advance.
 
 Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index aee67c61aa..1363c3facc 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
  * in the primary server), and then keeps receiving XLOG records and
  * writing them to the disk as long as the connection is alive. As XLOG
  * records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
  * A WAL receiver cannot directly load GUC parameters used when establishing
@@ -1005,10 +1005,10 @@ XLogWalRcvFlush(bool dying)
 
 		/* Update shared-memory status */
 		SpinLockAcquire(&walrcv->mutex);
-		if (walrcv->receivedUpto < LogstreamResult.Flush)
+		if (walrcv->flushedUpto < LogstreamResult.Flush)
 		{
-			walrcv->latestChunkStart = walrcv->receivedUpto;
-			walrcv->receivedUpto = LogstreamResult.Flush;
+			walrcv->latestChunkStart = walrcv->flushedUpto;
+			walrcv->flushedUpto = LogstreamResult.Flush;
 			walrcv->receivedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
@@ -1361,7 +1361,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
-	received_lsn = WalRcv->receivedUpto;
+	received_lsn = WalRcv->flushedUpto;
 	received_tli = WalRcv->receivedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 21d1823607..32260c2236 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -282,11 +282,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	/*
 	 * If this is the first startup of walreceiver (on this timeline),
-	 * initialize receivedUpto and latestChunkStart to the starting point.
+	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
 	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
 	{
-		walrcv->receivedUpto = recptr;
+		walrcv->flushedUpto = recptr;
 		walrcv->receivedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
@@ -304,7 +304,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -312,13 +312,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&walrcv->mutex);
-	recptr = walrcv->receivedUpto;
+	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
 	if (receiveTLI)
@@ -345,7 +345,7 @@ GetReplicationApplyDelay(void)
 	TimestampTz chunkReplayStartTime;
 
 	SpinLockAcquire(&walrcv->mutex);
-	receivePtr = walrcv->receivedUpto;
+	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
 	replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9e5611574c..414cf67d3d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2949,7 +2949,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cf3e43128c..6298ca07be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -73,19 +73,19 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * receivedUpto-1 is the last byte position that has already been
+	 * flushedUpto-1 is the last byte position that has already been
 	 * received, and receivedTLI is the timeline it came from.  At the first
 	 * startup of walreceiver, these are set to receiveStart and
 	 * receiveStartTLI. After that, walreceiver updates these whenever it
 	 * flushes the received WAL to disk.
 	 */
-	XLogRecPtr	receivedUpto;
+	XLogRecPtr	flushedUpto;
 	TimeLineID	receivedTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
-	 * receivedUpto before the last flush to disk.  Startup process can use
+	 * flushedUpto before the last flush to disk.  Startup process can use
 	 * this to detect whether it's keeping up or not.
 	 */
 	XLogRecPtr	latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v6-0003-Add-GetWalRcvWriteRecPtr-new-definition.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Add-GetWalRcvWriteRecPtr-new-definition.patchDownload

From 1ae59996301aefd91efd10fcbee50b2c3d7c140e Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH v6 3/8] Add GetWalRcvWriteRecPtr() (new definition).

A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk.  To do that, it
needs to be able to see the write pointer advancing in shared memory.

The function formerly bearing name was recently renamed to
GetWalRcvFlushRecPtr(), which better described what it does.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/replication/walreceiver.c      |  5 +++++
 src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
 src/include/replication/walreceiver.h      | 10 ++++++++++
 3 files changed, 27 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 1363c3facc..d69fb90132 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -261,6 +261,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -984,6 +986,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 32260c2236..4afad83539 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -328,6 +328,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	WalRcvData *walrcv = WalRcv;
+
+	return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 6298ca07be..f1aa6e9977 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -141,6 +142,14 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
+	/*
+	 * Like flushedUpto, but advanced after writing and before flushing,
+	 * without the need to acquire the spin lock.  Data can be read by another
+	 * process up to this point, but shouldn't be used for data integrity
+	 * purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
 extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v6-0004-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 62556d7ecaed1e225afdf4c8a7b51e66d9affab4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v6 4/8] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v6-0005-Allow-PrefetchBuffer-to-report-what-happened.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Allow-PrefetchBuffer-to-report-what-happened.patchDownload

From 6fffe00e39ec837cb08afb57bce413b8fad456ed Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH v6 5/8] Allow PrefetchBuffer() to report what happened.

Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.

If the requested block was already cached, return a valid buffer.  This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.

Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery.  This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 57 +++++++++++++++++++++------
 src/backend/storage/buffer/localbuf.c | 18 ++++++---
 src/backend/storage/smgr/md.c         |  9 ++++-
 src/backend/storage/smgr/smgr.c       | 10 +++--
 src/include/storage/buf_internals.h   |  5 ++-
 src/include/storage/bufmgr.h          | 19 ++++++---
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 8 files changed, 90 insertions(+), 32 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 22087a1c3c..23f269ae74 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -483,14 +483,14 @@ static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
  */
-void
+PrefetchBufferResult
 PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 					 ForkNumber forkNum,
 					 BlockNumber blockNum)
 {
-#ifdef USE_PREFETCH
-	BufferTag	newTag;		/* identity of requested block */
-	uint32		newHash;	/* hash value for newTag */
+	PrefetchBufferResult result = {InvalidBuffer, false};
+	BufferTag	newTag;			/* identity of requested block */
+	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
 	int			buf_id;
 
@@ -511,7 +511,25 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 
 	/* If not in buffers, initiate prefetch */
 	if (buf_id < 0)
-		smgrprefetch(smgr_reln, forkNum, blockNum);
+	{
+#ifdef USE_PREFETCH
+		/*
+		 * Try to initiate an asynchronous read.  This returns false in
+		 * recovery if the relation file doesn't exist.
+		 */
+		if (smgrprefetch(smgr_reln, forkNum, blockNum))
+			result.initiated_io = true;
+#endif							/* USE_PREFETCH */
+	}
+	else
+	{
+		/*
+		 * Report the buffer it was in at that time.  The caller may be able
+		 * to avoid a buffer table lookup, but it's not pinned and it must be
+		 * rechecked!
+		 */
+		result.recent_buffer = buf_id + 1;
+	}
 
 	/*
 	 * If the block *is* in buffers, we do nothing.  This is not really ideal:
@@ -524,7 +542,8 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 	 * additional per-buffer state, and it's not clear that there's enough of
 	 * a problem to justify that.
 	 */
-#endif							/* USE_PREFETCH */
+
+	return result;
 }
 
 /*
@@ -533,12 +552,27 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
  * This is named by analogy to ReadBuffer but doesn't actually allocate a
  * buffer.  Instead it tries to ensure that a future ReadBuffer for the given
  * block will not be delayed by the I/O.  Prefetching is optional.
- * No-op if prefetching isn't compiled in.
+ *
+ * There are three possible outcomes:
+ *
+ * 1.  If the block is already cached, the result includes a valid buffer that
+ * could be used by the caller to avoid the need for a later buffer lookup, but
+ * it's not pinned, so the caller must recheck it.
+ *
+ * 2.  If the kernel has been asked to initiate I/O, the initated_io member is
+ * true.  Currently there is no way to know if the data was already cached by
+ * the kernel and therefore didn't really initiate I/O, and no way to know when
+ * the I/O completes other than using synchronous ReadBuffer().
+ *
+ * 3.  Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * USE_PREFETCH is not defined (this build doesn't support prefetching due to
+ * lack of a kernel facility), or the underlying relation file was found and we
+ * are in recovery.  (If the relation file isn't found and we are not in
+ * recovery, an error is raised).
  */
-void
+PrefetchBufferResult
 PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 {
-#ifdef USE_PREFETCH
 	Assert(RelationIsValid(reln));
 	Assert(BlockNumberIsValid(blockNum));
 
@@ -554,14 +588,13 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 					 errmsg("cannot access temporary tables of other sessions")));
 
 		/* pass it off to localbuf.c */
-		PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
 	else
 	{
 		/* pass it to the shared buffer version */
-		PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+		return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
 	}
-#endif							/* USE_PREFETCH */
 }
 
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..1614ca03ea 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,11 +60,11 @@ static Block GetLocalBufferStorage(void);
  * Do PrefetchBuffer's work for temporary relations.
  * No-op if prefetching isn't compiled in.
  */
-void
+PrefetchBufferResult
 PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum)
 {
-#ifdef USE_PREFETCH
+	PrefetchBufferResult result = { InvalidBuffer, false };
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 
@@ -81,12 +81,18 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	if (hresult)
 	{
 		/* Yes, so nothing to do */
-		return;
+		result.recent_buffer = -hresult->id - 1;
 	}
-
-	/* Not in buffers, so initiate prefetch */
-	smgrprefetch(smgr, forkNum, blockNum);
+	else
+	{
+#ifdef USE_PREFETCH
+		/* Not in buffers, so initiate prefetch */
+		smgrprefetch(smgr, forkNum, blockNum);
+		result.initiated_io = true;
 #endif							/* USE_PREFETCH */
+	}
+
+	return result;
 }
 
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ee9822c6e1..e0b020da11 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -524,14 +524,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
-void
+bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
 	off_t		seekpos;
 	MdfdVec    *v;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
 
 	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -539,6 +542,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 72c9696ad1..b053a4dc76 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
-	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, char *buffer);
@@ -524,11 +524,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ *		In recovery only, this can return false to indicate that a file
+ *		doesn't	exist (presumably it has been dropped by a later WAL
+ *		record).
  */
-void
+bool
 smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
-	smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
 /*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
-								BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+												ForkNumber forkNum,
+												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
 extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 39660aacba..ee91b8fa26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -46,6 +46,15 @@ typedef enum
 								 * replay; otherwise same as RBM_NORMAL */
 } ReadBufferMode;
 
+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+	Buffer		recent_buffer;	/* If valid, a hit (recheck needed!) */
+	bool		initiated_io;	/* If true, a miss resulting in async I/O */
+} PrefetchBufferResult;
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -162,11 +171,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 /*
  * prototypes for functions in bufmgr.c
  */
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
-								 ForkNumber forkNum,
-								 BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
-						   BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+												 ForkNumber forkNum,
+												 BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+										   BlockNumber blockNum);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 79dfe0e373..bb8428f27f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -93,7 +93,7 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer);
-- 
2.20.1

v6-0006-Add-ReadBufferPrefetched-POC-only.patchtext/x-patch; charset=US-ASCII; name=v6-0006-Add-ReadBufferPrefetched-POC-only.patchDownload

From a00a528be26b06d7de40d59381b7ee864f06f3a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 26 Mar 2020 22:34:29 +1300
Subject: [PATCH v6 6/8] Add ReadBufferPrefetched() (POC only)

Provide a potentially faster version of ReadBuffer(), or cases where you
have a PrefetchBufferResult.  We might be able to avoid an extra buffer
mapping table lookup.

NOT FOR COMMIT -- PROOF OF CONCEPT ONLY
---
 src/backend/storage/buffer/bufmgr.c | 49 +++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |  3 ++
 2 files changed, 52 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 23f269ae74..f00c837f5a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -597,6 +597,55 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadBufferPrefetched -- read a buffer for which a prefetch was issued
+ *
+ * Like ReadBuffer(), but try to use the result of a recent PrefetchBuffer()
+ * call to avoid a buffer mapping table lookup.
+ */
+Buffer
+ReadBufferPrefetched(PrefetchBufferResult *prefetch,
+					 Relation reln,
+					 BlockNumber blockNum)
+{
+	/*
+	 * If PrefetchBuffer() found this block in a buffer recently, try to pin it
+	 * and then double check that it still holds the block we want.
+	 */
+	if (BufferIsValid(prefetch->recent_buffer))
+	{
+		BufferDesc *bufHdr;
+		BufferTag	tag;
+
+		if (BufferIsLocal(prefetch->recent_buffer))
+		{
+			bufHdr = GetBufferDescriptor(-prefetch->recent_buffer - 1);
+		}
+		else
+		{
+			bufHdr = GetBufferDescriptor(prefetch->recent_buffer - 1);
+			ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+			if (!PinBuffer(bufHdr, NULL))
+			{
+				/* not valid, forget about it */
+				UnpinBuffer(bufHdr, true);
+				bufHdr = NULL;
+			}
+		}
+
+		/* If we managed to pin it or it's local, check tag. */
+		if (bufHdr)
+		{
+			RelationOpenSmgr(reln);
+			INIT_BUFFERTAG(tag, reln->rd_smgr->smgr_rnode.node, MAIN_FORKNUM,
+						   blockNum);
+			if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+				return BufferDescriptorGetBuffer(bufHdr);
+		}
+	}
+
+	return ReadBuffer(reln, blockNum);
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..8f6b19e6ac 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -183,6 +183,9 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer ReadBufferPrefetched(PrefetchBufferResult *prefetch,
+								   Relation reln,
+								   BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
-- 
2.20.1

#12

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#11)

Re: WIP: WAL prefetch (another approach)

On Wed, Apr 8, 2020 at 4:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks for all that feedback. It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).

Here's an executive summary of an off-list chat with Andres:

* he withdrew his objection to the new definition of
GetWalRcvWriteRecPtr() based on my argument that any external code
will fail to compile anyway

* he doesn't like the naive code that detects sequential access and
skips prefetching; I agreed to rip it out for now and revisit if/when
we have better evidence that that's worth bothering with; the code
path that does that and the pg_stat_recovery_prefetch.skip_seq counter
will remain, but be used only to skip prefetching of repeated access
to the *same* block for now

* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode

* he +1s the plan to commit with the feature enabled, and revisit before release

* he thinks the idea of a variant of ReadBuffer() that takes a
PrefetchBufferResult (as sketched by the v6 0006 patch) broadly makes
sense as a stepping stone towards his asychronous I/O proposal, but
there's no point in committing something like 0006 without a user

I'm going to go and commit the first few patches in this series, and
come back in a bit with a new version of the main patch to fix the
above and a compiler warning reported by cfbot.

#13

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#12)

4 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode

So... logical.c wants to give its LogicalDecodingContext to any
XLogPageReadCB you give it, via "private_data"; that is, it really
only accepts XLogPageReadCB implementations that understand that (or
ignore it). What I want to do is give every XLogPageReadCB the chance
to have its own state that it is control of (to receive settings
specific to the implementation, or whatever), that you supply along
with it. We can't do both kinds of things with private_data, so I
have added a second member read_page_data to XLogReaderState. If you
pass in read_local_xlog_page as read_page, then you can optionally
install a pointer to XLogReadLocalOptions as reader->read_page_data,
to activate the new behaviours I added for prefetching purposes.

While working on that, I realised the readahead XLogReader was
breaking a rule expressed in XLogReadDetermineTimeLine(). Timelines
are really confusing and there were probably several subtle or not to
subtle bugs there. So I added an option to skip all of that logic,
and just say "I command you to read only from TLI X". It reads the
same TLI as recovery is reading, until it hits the end of readable
data and that causes prefetching to shut down. Then the main recovery
loop resets the prefetching module when it sees a TLI switch, so then
it starts up again. This seems to work reliably, but I've obviously
had limited time to test. Does this scheme sound sane?

I think this is basically committable (though of course I wish I had
more time to test and review). Ugh. Feature freeze in half an hour.

Attachments:

v7-0001-Rationalize-GetWalRcv-Write-Flush-RecPtr.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Rationalize-GetWalRcv-Write-Flush-RecPtr.patchDownload

From 8654ea7f2ed61de7ab3f0b305e37d190932ad97c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH v7 1/4] Rationalize GetWalRcv{Write,Flush}RecPtr().

GetWalRcvWriteRecPtr() previously reported the latest *flushed*
location.  Adopt the conventional terminology used elsewhere in the tree
by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related
variables that used the term "received".

Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest
*written* value.  This will allow later patches to use the value for
non-data-integrity purposes, without having to wait for the flush
pointer to advance.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c          | 20 +++++++++---------
 src/backend/access/transam/xlogfuncs.c     |  2 +-
 src/backend/replication/README             |  2 +-
 src/backend/replication/walreceiver.c      | 15 +++++++++-----
 src/backend/replication/walreceiverfuncs.c | 24 ++++++++++++++++------
 src/backend/replication/walsender.c        |  2 +-
 src/include/replication/walreceiver.h      | 18 ++++++++++++----
 7 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8a4c1743e5..c60842ea03 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -209,8 +209,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
 
 static XLogRecPtr LastRec;
 
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
 static TimeLineID receiveTLI = 0;
 
 /*
@@ -9376,7 +9376,7 @@ CreateRestartPoint(int flags)
 	 * Retreat _logSegNo using the current end of xlog replayed or received,
 	 * whichever is later.
 	 */
-	receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 	endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
 	KeepLogSeg(endptr, &_logSegNo);
@@ -11869,7 +11869,7 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 receivedUpto < targetPagePtr + reqLen))
+		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -11900,10 +11900,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12318,7 +12318,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						receivedUpto = 0;
+						flushedUpto = 0;
 					}
 
 					/*
@@ -12342,14 +12342,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
 					 */
-					if (RecPtr < receivedUpto)
+					if (RecPtr < flushedUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b84ba57259..00e1b33ed5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 {
 	XLogRecPtr	recptr;
 
-	recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+	recptr = GetWalRcvFlushRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
 WalRcvData->receiveStart.
 
 As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
 the startup process to know how far WAL replay can advance.
 
 Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index aee67c61aa..d69fb90132 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
  * in the primary server), and then keeps receiving XLOG records and
  * writing them to the disk as long as the connection is alive. As XLOG
  * records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
  * A WAL receiver cannot directly load GUC parameters used when establishing
@@ -261,6 +261,8 @@ WalReceiverMain(void)
 
 	SpinLockRelease(&walrcv->mutex);
 
+	pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
 	/* Arrange to clean up at walreceiver exit */
 	on_shmem_exit(WalRcvDie, 0);
 
@@ -984,6 +986,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* Update shared-memory status */
+	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
 /*
@@ -1005,10 +1010,10 @@ XLogWalRcvFlush(bool dying)
 
 		/* Update shared-memory status */
 		SpinLockAcquire(&walrcv->mutex);
-		if (walrcv->receivedUpto < LogstreamResult.Flush)
+		if (walrcv->flushedUpto < LogstreamResult.Flush)
 		{
-			walrcv->latestChunkStart = walrcv->receivedUpto;
-			walrcv->receivedUpto = LogstreamResult.Flush;
+			walrcv->latestChunkStart = walrcv->flushedUpto;
+			walrcv->flushedUpto = LogstreamResult.Flush;
 			walrcv->receivedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
@@ -1361,7 +1366,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
-	received_lsn = WalRcv->receivedUpto;
+	received_lsn = WalRcv->flushedUpto;
 	received_tli = WalRcv->receivedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 21d1823607..4afad83539 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -282,11 +282,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 
 	/*
 	 * If this is the first startup of walreceiver (on this timeline),
-	 * initialize receivedUpto and latestChunkStart to the starting point.
+	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
 	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
 	{
-		walrcv->receivedUpto = recptr;
+		walrcv->flushedUpto = recptr;
 		walrcv->receivedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
@@ -304,7 +304,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 }
 
 /*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
  *
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
@@ -312,13 +312,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * receiveTLI.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&walrcv->mutex);
-	recptr = walrcv->receivedUpto;
+	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
 	if (receiveTLI)
@@ -328,6 +328,18 @@ GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	return recptr;
 }
 
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+	WalRcvData *walrcv = WalRcv;
+
+	return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
 /*
  * Returns the replication apply delay in ms or -1
  * if the apply delay info is not available
@@ -345,7 +357,7 @@ GetReplicationApplyDelay(void)
 	TimestampTz chunkReplayStartTime;
 
 	SpinLockAcquire(&walrcv->mutex);
-	receivePtr = walrcv->receivedUpto;
+	receivePtr = walrcv->flushedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
 	replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 06e8b79036..122d884f3e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2949,7 +2949,7 @@ GetStandbyFlushRecPtr(void)
 	 * has streamed, but hasn't been replayed yet.
 	 */
 
-	receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+	receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
 	replayPtr = GetXLogReplayRecPtr(&replayTLI);
 
 	ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cf3e43128c..f1aa6e9977 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
 #include "access/xlogdefs.h"
 #include "getaddrinfo.h"		/* for NI_MAXHOST */
 #include "pgtime.h"
+#include "port/atomics.h"
 #include "replication/logicalproto.h"
 #include "replication/walsender.h"
 #include "storage/latch.h"
@@ -73,19 +74,19 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * receivedUpto-1 is the last byte position that has already been
+	 * flushedUpto-1 is the last byte position that has already been
 	 * received, and receivedTLI is the timeline it came from.  At the first
 	 * startup of walreceiver, these are set to receiveStart and
 	 * receiveStartTLI. After that, walreceiver updates these whenever it
 	 * flushes the received WAL to disk.
 	 */
-	XLogRecPtr	receivedUpto;
+	XLogRecPtr	flushedUpto;
 	TimeLineID	receivedTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
-	 * receivedUpto before the last flush to disk.  Startup process can use
+	 * flushedUpto before the last flush to disk.  Startup process can use
 	 * this to detect whether it's keeping up or not.
 	 */
 	XLogRecPtr	latestChunkStart;
@@ -141,6 +142,14 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
+	/*
+	 * Like flushedUpto, but advanced after writing and before flushing,
+	 * without the need to acquire the spin lock.  Data can be read by another
+	 * process up to this point, but shouldn't be used for data integrity
+	 * purposes.
+	 */
+	pg_atomic_uint64 writtenUpto;
+
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -322,7 +331,8 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v7-0002-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From d0a1b60cbe589a4023b94db35ce3b830f5cbde04 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v7 2/4] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v7-0003-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Allow-XLogReadRecord-to-be-non-blocking.patchDownload

From dea9a3c46d35b12bbea8469e44223f73e4004d22 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v7 3/4] Allow XLogReadRecord() to be non-blocking.

Extend read_local_xlog_page() to support non-blocking modes:

1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlogreader.c |  37 ++++--
 src/backend/access/transam/xlogutils.c  | 151 +++++++++++++++++-------
 src/backend/replication/walsender.c     |   2 +-
 src/include/access/xlogreader.h         |  20 +++-
 src/include/access/xlogutils.h          |  23 ++++
 5 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..554b2029da 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -257,6 +257,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
@@ -546,10 +549,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -561,8 +565,9 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the read_page() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -659,8 +664,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+		return XLOGPAGEREAD_WOULDBLOCK;
+
 	XLogReaderInvalReadState(state);
-	return -1;
+	return XLOGPAGEREAD_ERROR;
 }
 
 /*
@@ -939,6 +947,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	int			readLen;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -952,7 +961,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -1033,7 +1041,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	return InvalidXLogRecPtr;
 }
@@ -1084,13 +1093,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 			tli != seg->ws_tli)
 		{
 			XLogSegNo	nextSegNo;
-
 			if (seg->ws_file >= 0)
 				close(seg->ws_file);
 
 			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
 			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
 
+			/* callback reported that there was no such file */
+			if (seg->ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = 0;
+				errinfo->wre_read = 0;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = *seg;
+				return false;
+			}
+
 			/* Update the current segment info. */
 			seg->ws_tli = tli;
 			seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..2d702437dd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+					 WALSegmentContext *segcxt,
+					 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	if (errno != ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+
+	return -1;					/* keep compiler quiet */
+}
+
 /* openSegment callback for WALRead */
 static int
 wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,58 +856,92 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	bool		try_read = false;
+	XLogReadLocalOptions *options =
+		(XLogReadLocalOptions *) state->read_page_data;
 
 	loc = targetPagePtr + reqLen;
 
 	/* Loop waiting for xlog to be available if necessary */
 	while (1)
 	{
-		/*
-		 * Determine the limit of xlog we can currently read to, and what the
-		 * most recent timeline is.
-		 *
-		 * RecoveryInProgress() will update ThisTimeLineID when it first
-		 * notices recovery finishes, so we only have to maintain it for the
-		 * local process until recovery ends.
-		 */
-		if (!RecoveryInProgress())
-			read_upto = GetFlushRecPtr();
-		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		tli = ThisTimeLineID;
+		switch (options ? options->read_upto_policy : -1)
+		{
+		case XLRO_WALRCV_WRITTEN:
+			/*
+			 * We'll try to read as far as has been written by the WAL
+			 * receiver, on the requested timeline.  When we run out of valid
+			 * data, we'll return an error.  This is used by xlogprefetch.c
+			 * while streaming.
+			 */
+			read_upto = GetWalRcvWriteRecPtr();
+			try_read = true;
+			state->currTLI = tli = options->tli;
+			break;
 
-		/*
-		 * Check which timeline to get the record from.
-		 *
-		 * We have to do it each time through the loop because if we're in
-		 * recovery as a cascading standby, the current timeline might've
-		 * become historical. We can't rely on RecoveryInProgress() because in
-		 * a standby configuration like
-		 *
-		 * A => B => C
-		 *
-		 * if we're a logical decoding session on C, and B gets promoted, our
-		 * timeline will change while we remain in recovery.
-		 *
-		 * We can't just keep reading from the old timeline as the last WAL
-		 * archive in the timeline will get renamed to .partial by
-		 * StartupXLOG().
-		 *
-		 * If that happens after our caller updated ThisTimeLineID but before
-		 * we actually read the xlog page, we might still try to read from the
-		 * old (now renamed) segment and fail. There's not much we can do
-		 * about this, but it can only happen when we're a leaf of a cascading
-		 * standby whose master gets promoted while we're decoding, so a
-		 * one-off ERROR isn't too bad.
-		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		case XLRO_END:
+			/*
+			 * We'll try to read as far as we can on one timeline.  This is
+			 * used by xlogprefetch.c for crash recovery.
+			 */
+			read_upto = (XLogRecPtr) -1;
+			try_read = true;
+			state->currTLI = tli = options->tli;
+			break;
+
+		default:
+			/*
+			 * Determine the limit of xlog we can currently read to, and what the
+			 * most recent timeline is.
+			 *
+			 * RecoveryInProgress() will update ThisTimeLineID when it first
+			 * notices recovery finishes, so we only have to maintain it for
+			 * the local process until recovery ends.
+			 */
+			if (!RecoveryInProgress())
+				read_upto = GetFlushRecPtr();
+			else
+				read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			tli = ThisTimeLineID;
+
+			/*
+			 * Check which timeline to get the record from.
+			 *
+			 * We have to do it each time through the loop because if we're in
+			 * recovery as a cascading standby, the current timeline might've
+			 * become historical. We can't rely on RecoveryInProgress()
+			 * because in a standby configuration like
+			 *
+			 * A => B => C
+			 *
+			 * if we're a logical decoding session on C, and B gets promoted,
+			 * our timeline will change while we remain in recovery.
+			 *
+			 * We can't just keep reading from the old timeline as the last
+			 * WAL archive in the timeline will get renamed to .partial by
+			 * StartupXLOG().
+			 *
+			 * If that happens after our caller updated ThisTimeLineID but
+			 * before we actually read the xlog page, we might still try to
+			 * read from the old (now renamed) segment and fail. There's not
+			 * much we can do about this, but it can only happen when we're a
+			 * leaf of a cascading standby whose master gets promoted while
+			 * we're decoding, so a one-off ERROR isn't too bad.
+			 */
+			XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+			break;
+		}
 
-		if (state->currTLI == ThisTimeLineID)
+		if (state->currTLI == tli)
 		{
 
 			if (loc <= read_upto)
 				break;
 
+			/* not enough data there, but we were asked not to wait */
+			if (options && options->nowait)
+				return XLOGPAGEREAD_WOULDBLOCK;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
@@ -924,7 +983,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 	}
 	else
 	{
@@ -938,8 +997,18 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
-				 &state->segcxt, wal_segment_open, &errinfo))
+				 &state->segcxt,
+				 try_read ? wal_segment_try_open : wal_segment_open,
+				 &errinfo))
+	{
+		/*
+		 * When on one single timeline, we may read past the end of available
+		 * segments.  Report lack of file as an error.
+		 */
+		if (try_read)
+			return XLOGPAGEREAD_ERROR;
 		WALReadRaiseError(&errinfo);
+	}
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884f3e..15ff3d35e4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..a3ac7f414b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR		-1
+#define XLOGPAGEREAD_WOULDBLOCK	-2
+
 /* Function type definition for the read_page callback */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -99,10 +103,13 @@ struct XLogReaderState
 	 * This callback shall read at least reqLen valid bytes of the xlog page
 	 * starting at targetPagePtr, and store them in readBuf.  The callback
 	 * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
-	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
-	 * requested bytes to become available.  The callback will not be invoked
-	 * again for the same page unless more than the returned number of bytes
-	 * are needed.
+	 * XLOGPAGEREAD_ERROR on failure.  The callback may either sleep or return
+	 * XLOGPAGEREAD_WOULDBLOCK, if necessary, to wait for the requested bytes
+	 * to become available.  If a callback that can return
+	 * XLOGPAGEREAD_WOULDBLOCK is installed, the reader client must expect to
+	 * fail to read when there is not enough data.  The callback will not be
+	 * invoked again for the same page unless more than the returned number of
+	 * bytes are needed.
 	 *
 	 * targetRecPtr is the position of the WAL record we're reading.  Usually
 	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -126,6 +133,11 @@ struct XLogReaderState
 	 */
 	void	   *private_data;
 
+	/*
+	 * Opaque data for callbacks to use.  Not used by XLogReader.
+	 */
+	void	   *read_page_data;
+
 	/*
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..89c9ce90f8 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,29 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/*
+	 * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+	 * provided.
+	 */
+	TimeLineID	tli;
+
+	/* How far to read. */
+	enum {
+		XLRO_STANDARD,
+		XLRO_WALRCV_WRITTEN,
+		XLRO_END
+	} read_upto_policy;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
-- 
2.20.1

v7-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v7-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From 85c2ea245c03c6a859e652cc2d9df3b2ca323bb4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v7 4/4] Prefetch referenced blocks during recovery.

Introduce a new GUC max_recovery_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is enabled by default for
now, but we might reconsider that before release.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +
 doc/src/sgml/monitoring.sgml                  |  81 ++
 doc/src/sgml/wal.sgml                         |  13 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  16 +
 src/backend/access/transam/xlogprefetch.c     | 905 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               |  96 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  47 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlogprefetch.h             |  85 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 16 files changed, 1359 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a0da4aabac..18979d0496 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+      <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as
+        <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
+        might be counterproductive, if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.  A setting of -1 disables prefetching
+        during recovery.
+        The default is 256kB on systems that support
+        <function>posix_fadvise</function>, and otherwise -1.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.  This
+        setting has no effect unless
+        <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+        number.  The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..ddf2ee1f96 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-recovery-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-recovery-prefetch-distance"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
@@ -3446,6 +3525,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        counters shown in the <structname>pg_stat_bgwriter</structname> view.
        Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
        counters shown in the <structname>pg_stat_archiver</structname> view.
+       Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+       counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
       </entry>
      </row>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled,
+   but it can be disabled by setting the distance to -1.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c60842ea03..6b2e95c06c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -7144,6 +7145,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 
 			InRedo = true;
 
@@ -7151,6 +7153,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7181,6 +7186,12 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Peform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch,
+							 ThisTimeLineID,
+							 xlogreader->ReadRecPtr,
+							 currentSource == XLOG_FROM_STREAM);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7352,6 +7363,9 @@ StartupXLOG(void)
 					 */
 					if (switchedTLI && AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7379,6 +7393,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12107,6 +12122,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..7d3aea53f7
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,905 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.  Currently, this is achieved by using a
+ * separate XLogReader to read ahead.  In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer().  Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs.  When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching.  Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int			max_recovery_prefetch_distance = -1;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  We add one to the size
+	 * because our circular buffer has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+						 sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->options.tli = tli;
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* Read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_END;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											NULL);
+	prefetcher->reader->read_page_data = &prefetcher->options;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("recovery started prefetching on timeline %u at %X/%X",
+					tli,
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_recovery_prefetch_distance)
+			break;
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			prefetcher->have_record = false;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= reader->max_block_id;
+		 ++block_id)
+	{
+		PrefetchBufferResult prefetch;
+		DecodedBkpBlock *block = &reader->blocks[block_id];
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably an
+		 * extension.  Since it might create a new segment, we can't try
+		 * to prefetch this block until the record has been replayed, or we
+		 * might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	max_recovery_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8118..3b15f5ef8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..6c9ac5b29b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -276,6 +277,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6420,6 +6502,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 417840a8f1..a965ab9d35 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "commands/wait.h"
 #include "miscadmin.h"
@@ -125,6 +126,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -214,6 +216,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..5ed7ed13e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_recovery_prefetch_distance is set to a positive number.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable prefetching during recovery."),
+			GUC_UNIT_BYTE
+		},
+		&max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+		256 * 1024,
+#else
+		-1,
+#endif
+		-1, INT_MAX,
+		NULL, assign_max_recovery_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..55cce90763 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+											  XLogRecPtr lsn,
+											  bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+			 TimeLineID replaying_tli,
+			 XLogRecPtr replaying_lsn,
+			 bool from_stream)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (max_recovery_prefetch_distance > 0)
+			state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+													   replaying_lsn,
+													   from_stream);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9f5f0ed4c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..105c2e77d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1464,6 +1489,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1479,6 +1505,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840739..942a07ffee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

#14

thomas.munro@gmail.com

almost 6 years ago

In reply to: Thomas Munro (#13)

Re: WIP: WAL prefetch (another approach)

On Wed, Apr 8, 2020 at 11:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode

So... logical.c wants to give its LogicalDecodingContext to any
XLogPageReadCB you give it, via "private_data"; that is, it really
only accepts XLogPageReadCB implementations that understand that (or
ignore it). What I want to do is give every XLogPageReadCB the chance
to have its own state that it is control of (to receive settings
specific to the implementation, or whatever), that you supply along
with it. We can't do both kinds of things with private_data, so I
have added a second member read_page_data to XLogReaderState. If you
pass in read_local_xlog_page as read_page, then you can optionally
install a pointer to XLogReadLocalOptions as reader->read_page_data,
to activate the new behaviours I added for prefetching purposes.

While working on that, I realised the readahead XLogReader was
breaking a rule expressed in XLogReadDetermineTimeLine(). Timelines
are really confusing and there were probably several subtle or not to
subtle bugs there. So I added an option to skip all of that logic,
and just say "I command you to read only from TLI X". It reads the
same TLI as recovery is reading, until it hits the end of readable
data and that causes prefetching to shut down. Then the main recovery
loop resets the prefetching module when it sees a TLI switch, so then
it starts up again. This seems to work reliably, but I've obviously
had limited time to test. Does this scheme sound sane?

I think this is basically committable (though of course I wish I had
more time to test and review). Ugh. Feature freeze in half an hour.

Ok, so the following parts of this work have been committed:

b09ff536: Simplify the effective_io_concurrency setting.
fc34b0d9: Introduce a maintenance_io_concurrency setting.
3985b600: Support PrefetchBuffer() in recovery.
d140f2f3: Rationalize GetWalRcv{Write,Flush}RecPtr().

However, I didn't want to push the main patch into the tree at
(literally) the last minute after doing such much work on it in the
last few days, without more review from recovery code experts and some
independent testing. Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.

#15

David Steele

david@pgmasters.net

almost 6 years ago

In reply to: Thomas Munro (#14)

Re: WIP: WAL prefetch (another approach)

On 4/8/20 8:12 AM, Thomas Munro wrote:

Ok, so the following parts of this work have been committed:

b09ff536: Simplify the effective_io_concurrency setting.
fc34b0d9: Introduce a maintenance_io_concurrency setting.
3985b600: Support PrefetchBuffer() in recovery.
d140f2f3: Rationalize GetWalRcv{Write,Flush}RecPtr().

However, I didn't want to push the main patch into the tree at
(literally) the last minute after doing such much work on it in the
last few days, without more review from recovery code experts and some
independent testing.

I definitely think that was the right call.

Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.

That's up to the RMT, of course, but we did already have an extra week.
Might be best to just get this in at the beginning of the PG14 cycle.
FWIW, I do think the feature is really valuable.

Looks like you'll need to rebase, so I'll move this to the next CF in
WoA state.

Regards,
--
-David
david@pgmasters.net

#16

thomas.munro@gmail.com

almost 6 years ago

In reply to: David Steele (#15)

3 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 9, 2020 at 12:27 AM David Steele <david@pgmasters.net> wrote:

On 4/8/20 8:12 AM, Thomas Munro wrote:

Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.

That's up to the RMT, of course, but we did already have an extra week.
Might be best to just get this in at the beginning of the PG14 cycle.
FWIW, I do think the feature is really valuable.

Looks like you'll need to rebase, so I'll move this to the next CF in
WoA state.

Thanks. Here's a rebase.

Attachments:

v8-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From db0d2774ac0faf9284e14ad243fefb940e1bc173 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v8 1/3] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v8-0002-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Allow-XLogReadRecord-to-be-non-blocking.patchDownload

From 743e11495e81af1f96ca304baf130b20dba056e5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v8 2/3] Allow XLogReadRecord() to be non-blocking.

Extend read_local_xlog_page() to support non-blocking modes:

1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlogreader.c |  37 ++++--
 src/backend/access/transam/xlogutils.c  | 151 +++++++++++++++++-------
 src/backend/replication/walsender.c     |   2 +-
 src/include/access/xlogreader.h         |  20 +++-
 src/include/access/xlogutils.h          |  23 ++++
 5 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..554b2029da 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -257,6 +257,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
@@ -546,10 +549,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -561,8 +565,9 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the read_page() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -659,8 +664,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+		return XLOGPAGEREAD_WOULDBLOCK;
+
 	XLogReaderInvalReadState(state);
-	return -1;
+	return XLOGPAGEREAD_ERROR;
 }
 
 /*
@@ -939,6 +947,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	int			readLen;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -952,7 +961,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -1033,7 +1041,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	return InvalidXLogRecPtr;
 }
@@ -1084,13 +1093,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 			tli != seg->ws_tli)
 		{
 			XLogSegNo	nextSegNo;
-
 			if (seg->ws_file >= 0)
 				close(seg->ws_file);
 
 			XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
 			seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
 
+			/* callback reported that there was no such file */
+			if (seg->ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = 0;
+				errinfo->wre_read = 0;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = *seg;
+				return false;
+			}
+
 			/* Update the current segment info. */
 			seg->ws_tli = tli;
 			seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..2d702437dd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	}
 }
 
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+					 WALSegmentContext *segcxt,
+					 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+	int			fd;
+
+	XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (fd >= 0)
+		return fd;
+
+	if (errno != ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+
+	return -1;					/* keep compiler quiet */
+}
+
 /* openSegment callback for WALRead */
 static int
 wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,58 +856,92 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	bool		try_read = false;
+	XLogReadLocalOptions *options =
+		(XLogReadLocalOptions *) state->read_page_data;
 
 	loc = targetPagePtr + reqLen;
 
 	/* Loop waiting for xlog to be available if necessary */
 	while (1)
 	{
-		/*
-		 * Determine the limit of xlog we can currently read to, and what the
-		 * most recent timeline is.
-		 *
-		 * RecoveryInProgress() will update ThisTimeLineID when it first
-		 * notices recovery finishes, so we only have to maintain it for the
-		 * local process until recovery ends.
-		 */
-		if (!RecoveryInProgress())
-			read_upto = GetFlushRecPtr();
-		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		tli = ThisTimeLineID;
+		switch (options ? options->read_upto_policy : -1)
+		{
+		case XLRO_WALRCV_WRITTEN:
+			/*
+			 * We'll try to read as far as has been written by the WAL
+			 * receiver, on the requested timeline.  When we run out of valid
+			 * data, we'll return an error.  This is used by xlogprefetch.c
+			 * while streaming.
+			 */
+			read_upto = GetWalRcvWriteRecPtr();
+			try_read = true;
+			state->currTLI = tli = options->tli;
+			break;
 
-		/*
-		 * Check which timeline to get the record from.
-		 *
-		 * We have to do it each time through the loop because if we're in
-		 * recovery as a cascading standby, the current timeline might've
-		 * become historical. We can't rely on RecoveryInProgress() because in
-		 * a standby configuration like
-		 *
-		 * A => B => C
-		 *
-		 * if we're a logical decoding session on C, and B gets promoted, our
-		 * timeline will change while we remain in recovery.
-		 *
-		 * We can't just keep reading from the old timeline as the last WAL
-		 * archive in the timeline will get renamed to .partial by
-		 * StartupXLOG().
-		 *
-		 * If that happens after our caller updated ThisTimeLineID but before
-		 * we actually read the xlog page, we might still try to read from the
-		 * old (now renamed) segment and fail. There's not much we can do
-		 * about this, but it can only happen when we're a leaf of a cascading
-		 * standby whose master gets promoted while we're decoding, so a
-		 * one-off ERROR isn't too bad.
-		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		case XLRO_END:
+			/*
+			 * We'll try to read as far as we can on one timeline.  This is
+			 * used by xlogprefetch.c for crash recovery.
+			 */
+			read_upto = (XLogRecPtr) -1;
+			try_read = true;
+			state->currTLI = tli = options->tli;
+			break;
+
+		default:
+			/*
+			 * Determine the limit of xlog we can currently read to, and what the
+			 * most recent timeline is.
+			 *
+			 * RecoveryInProgress() will update ThisTimeLineID when it first
+			 * notices recovery finishes, so we only have to maintain it for
+			 * the local process until recovery ends.
+			 */
+			if (!RecoveryInProgress())
+				read_upto = GetFlushRecPtr();
+			else
+				read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			tli = ThisTimeLineID;
+
+			/*
+			 * Check which timeline to get the record from.
+			 *
+			 * We have to do it each time through the loop because if we're in
+			 * recovery as a cascading standby, the current timeline might've
+			 * become historical. We can't rely on RecoveryInProgress()
+			 * because in a standby configuration like
+			 *
+			 * A => B => C
+			 *
+			 * if we're a logical decoding session on C, and B gets promoted,
+			 * our timeline will change while we remain in recovery.
+			 *
+			 * We can't just keep reading from the old timeline as the last
+			 * WAL archive in the timeline will get renamed to .partial by
+			 * StartupXLOG().
+			 *
+			 * If that happens after our caller updated ThisTimeLineID but
+			 * before we actually read the xlog page, we might still try to
+			 * read from the old (now renamed) segment and fail. There's not
+			 * much we can do about this, but it can only happen when we're a
+			 * leaf of a cascading standby whose master gets promoted while
+			 * we're decoding, so a one-off ERROR isn't too bad.
+			 */
+			XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+			break;
+		}
 
-		if (state->currTLI == ThisTimeLineID)
+		if (state->currTLI == tli)
 		{
 
 			if (loc <= read_upto)
 				break;
 
+			/* not enough data there, but we were asked not to wait */
+			if (options && options->nowait)
+				return XLOGPAGEREAD_WOULDBLOCK;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
@@ -924,7 +983,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 	}
 	else
 	{
@@ -938,8 +997,18 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
-				 &state->segcxt, wal_segment_open, &errinfo))
+				 &state->segcxt,
+				 try_read ? wal_segment_try_open : wal_segment_open,
+				 &errinfo))
+	{
+		/*
+		 * When on one single timeline, we may read past the end of available
+		 * segments.  Report lack of file as an error.
+		 */
+		if (try_read)
+			return XLOGPAGEREAD_ERROR;
 		WALReadRaiseError(&errinfo);
+	}
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884f3e..15ff3d35e4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..a3ac7f414b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR		-1
+#define XLOGPAGEREAD_WOULDBLOCK	-2
+
 /* Function type definition for the read_page callback */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -99,10 +103,13 @@ struct XLogReaderState
 	 * This callback shall read at least reqLen valid bytes of the xlog page
 	 * starting at targetPagePtr, and store them in readBuf.  The callback
 	 * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
-	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
-	 * requested bytes to become available.  The callback will not be invoked
-	 * again for the same page unless more than the returned number of bytes
-	 * are needed.
+	 * XLOGPAGEREAD_ERROR on failure.  The callback may either sleep or return
+	 * XLOGPAGEREAD_WOULDBLOCK, if necessary, to wait for the requested bytes
+	 * to become available.  If a callback that can return
+	 * XLOGPAGEREAD_WOULDBLOCK is installed, the reader client must expect to
+	 * fail to read when there is not enough data.  The callback will not be
+	 * invoked again for the same page unless more than the returned number of
+	 * bytes are needed.
 	 *
 	 * targetRecPtr is the position of the WAL record we're reading.  Usually
 	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -126,6 +133,11 @@ struct XLogReaderState
 	 */
 	void	   *private_data;
 
+	/*
+	 * Opaque data for callbacks to use.  Not used by XLogReader.
+	 */
+	void	   *read_page_data;
+
 	/*
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..89c9ce90f8 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,29 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/*
+	 * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+	 * provided.
+	 */
+	TimeLineID	tli;
+
+	/* How far to read. */
+	enum {
+		XLRO_STANDARD,
+		XLRO_WALRCV_WRITTEN,
+		XLRO_END
+	} read_upto_policy;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
-- 
2.20.1

v8-0003-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Prefetch-referenced-blocks-during-recovery.patchDownload

From 4e5ac5a6dbaa3ff519fbd2d8acf9b7d9756ad2cb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v8 3/3] Prefetch referenced blocks during recovery.

Introduce a new GUC max_recovery_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is enabled by default for
now, but we might reconsider that before release.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +
 doc/src/sgml/monitoring.sgml                  |  81 ++
 doc/src/sgml/wal.sgml                         |  13 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  16 +
 src/backend/access/transam/xlogprefetch.c     | 905 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               |  96 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  47 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlogprefetch.h             |  85 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 16 files changed, 1359 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a0da4aabac..18979d0496 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+      <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as
+        <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
+        might be counterproductive, if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.  A setting of -1 disables prefetching
+        during recovery.
+        The default is 256kB on systems that support
+        <function>posix_fadvise</function>, and otherwise -1.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.  This
+        setting has no effect unless
+        <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+        number.  The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..ddf2ee1f96 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-recovery-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-recovery-prefetch-distance"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="3">
@@ -3446,6 +3525,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        counters shown in the <structname>pg_stat_bgwriter</structname> view.
        Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
        counters shown in the <structname>pg_stat_archiver</structname> view.
+       Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+       counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
       </entry>
      </row>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled,
+   but it can be disabled by setting the distance to -1.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c38bc1412d..05a1c0ded8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -7143,6 +7144,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 
 			InRedo = true;
 
@@ -7150,6 +7152,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7179,6 +7184,12 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Peform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch,
+							 ThisTimeLineID,
+							 xlogreader->ReadRecPtr,
+							 currentSource == XLOG_FROM_STREAM);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7350,6 +7361,9 @@ StartupXLOG(void)
 					 */
 					if (switchedTLI && AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7366,6 +7380,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12094,6 +12109,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..7d3aea53f7
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,905 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.  Currently, this is achieved by using a
+ * separate XLogReader to read ahead.  In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer().  Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs.  When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching.  Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int			max_recovery_prefetch_distance = -1;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  We add one to the size
+	 * because our circular buffer has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+						 sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->options.tli = tli;
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* Read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_END;
+	}
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											read_local_xlog_page,
+											NULL);
+	prefetcher->reader->read_page_data = &prefetcher->options;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("recovery started prefetching on timeline %u at %X/%X",
+					tli,
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_recovery_prefetch_distance)
+			break;
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			prefetcher->have_record = false;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= reader->max_block_id;
+		 ++block_id)
+	{
+		PrefetchBufferResult prefetch;
+		DecodedBkpBlock *block = &reader->blocks[block_id];
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably an
+		 * extension.  Since it might create a new segment, we can't try
+		 * to prefetch this block until the record has been replayed, or we
+		 * might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	max_recovery_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8118..3b15f5ef8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..6c9ac5b29b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -276,6 +277,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6420,6 +6502,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..5ed7ed13e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_recovery_prefetch_distance is set to a positive number.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable prefetching during recovery."),
+			GUC_UNIT_BYTE
+		},
+		&max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+		256 * 1024,
+#else
+		-1,
+#endif
+		-1, INT_MAX,
+		NULL, assign_max_recovery_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..55cce90763 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+											  XLogRecPtr lsn,
+											  bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+			 TimeLineID replaying_tli,
+			 XLogRecPtr replaying_lsn,
+			 bool from_stream)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (max_recovery_prefetch_distance > 0)
+			state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+													   replaying_lsn,
+													   from_stream);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9f5f0ed4c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..105c2e77d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1464,6 +1489,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1479,6 +1505,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840739..942a07ffee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

#17

Dmitry Dolgov

9erthalion6@gmail.com

over 5 years ago

In reply to: Thomas Munro (#16)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 09, 2020 at 09:55:25AM +1200, Thomas Munro wrote:
Thanks. Here's a rebase.

Thanks for working on this patch, it seems like a great feature. I'm
probably a bit late to the party, but still want to make couple of
commentaries.

The patch indeed looks good, I couldn't find any significant issues so
far and almost all my questions I had while reading it were actually
answered in this thread. I'm still busy with benchmarking, mostly to see
how prefetching would work with different workload distributions and how
much the kernel will actually prefetch.

In the meantime I have a few questions:

On Wed, Feb 12, 2020 at 07:52:42PM +1300, Thomas Munro wrote:

On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Here is a new WIP version of the patch set that does that. Changes:

1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.

This totally makes sense, I believe the question "how much to prefetch"
eventually depends equally on a type of workload (correlates with how
far in WAL to read) and how much resources are available for prefetching
(correlates with queue depth). But in the documentation it looks like
maintenance-io-concurrency is just an "unimportant" option, and I'm
almost sure will be overlooked by many readers:

The maximum distance to look ahead in the WAL during recovery, to find
blocks to prefetch. Prefetching blocks that will soon be needed can
reduce I/O wait times. The number of concurrent prefetches is limited
by this setting as well as
<xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
might be counterproductive, if it means that data falls out of the
kernel cache before it is needed. If this value is specified without
units, it is taken as bytes. A setting of -1 disables prefetching
during recovery.

Maybe it makes also sense to emphasize that maintenance-io-concurrency
directly affects resource consumption and it's a "primary control"?

On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:

Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.

I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?

On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.

Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?

#18

[1]: /messages/by-id/CA+hUKG+NPZeEdLXAcNr+w0YOZVb0Un0_MwTBpgmmVDh7No2jbg@mail.gmail.com
[2]: https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
[3]: https://kernel.dk/io_uring.pdf
[4]: https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html

thomas.munro@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#17)

Re: WIP: WAL prefetch (another approach)

On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Thanks for working on this patch, it seems like a great feature. I'm
probably a bit late to the party, but still want to make couple of
commentaries.

Hi Dmitry,

Thanks for your feedback and your interest in this work!

The patch indeed looks good, I couldn't find any significant issues so
far and almost all my questions I had while reading it were actually
answered in this thread. I'm still busy with benchmarking, mostly to see
how prefetching would work with different workload distributions and how
much the kernel will actually prefetch.

Cool.

One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1]/messages/by-id/CA+hUKG+NPZeEdLXAcNr+w0YOZVb0Un0_MwTBpgmmVDh7No2jbg@mail.gmail.com become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.

In the meantime I have a few questions:

1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.

This totally makes sense, I believe the question "how much to prefetch"
eventually depends equally on a type of workload (correlates with how
far in WAL to read) and how much resources are available for prefetching
(correlates with queue depth). But in the documentation it looks like
maintenance-io-concurrency is just an "unimportant" option, and I'm
almost sure will be overlooked by many readers:

The maximum distance to look ahead in the WAL during recovery, to find
blocks to prefetch. Prefetching blocks that will soon be needed can
reduce I/O wait times. The number of concurrent prefetches is limited
by this setting as well as
<xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
might be counterproductive, if it means that data falls out of the
kernel cache before it is needed. If this value is specified without
units, it is taken as bytes. A setting of -1 disables prefetching
during recovery.

Maybe it makes also sense to emphasize that maintenance-io-concurrency
directly affects resource consumption and it's a "primary control"?

You're right. I will add something in the next version to emphasise that.

On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:

Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.

I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?

Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls. Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3]https://kernel.dk/io_uring.pdf or lio_listio() on a
hypothetical POSIX AIO implementation[4]https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html. (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).

Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.

As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question. I didn't
see how to do it, but maybe I didn't try hard enough. I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure. In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?

Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue. I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience. Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting... I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.

On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:

Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?

Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.

Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?

Hmm.

I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so. I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it. Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z? I think we should
provide primary measurements and let people compute derived statistics
from those.

I suppose the reason for this request was the analogy with Linux
iostat -x's "aqu-sz", which is the primary way that people understand
device queue depth on that OS. This number is actually computed by
iostat, not the kernel, so by analogy I could argue that a
hypothetical pg_iostat program compute that for you from raw
ingredients. AFAIK iostat computes the *unweighted* average queue
depth during the time between output lines, by observing changes in
the "aveq" ("the sum of how long all requests have spent in flight, in
milliseconds") and "use" ("how many milliseconds there has been at
least one IO in flight") fields of /proc/diskstats. But it's OK that
it's unweighted, because it computes a new value for every line it
output (ie every 5 seconds or whatever you asked for). It's not too
clear how to do something like that here, but all suggestions are
weclome.

Or maybe we'll have something more general that makes this more
specific thing irrelevant, in future AIO infrastructure work.

On a more superficial note, one thing I don't like about the last
version of the patch is the difference in the ordering of the words in
the GUC recovery_prefetch_distance and the view
pg_stat_prefetch_recovery. Hrmph.

#19

Dmitry Dolgov

9erthalion6@gmail.com

over 5 years ago

In reply to: Thomas Munro (#18)

Re: WIP: WAL prefetch (another approach)

On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:

One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.

At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.

If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.

# with prefetch
Function = b'heap2_redo' [8064]
nsecs : count distribution
4096 -> 8191 : 1213 | |
8192 -> 16383 : 66639 |****************************************|
16384 -> 32767 : 27846 |**************** |
32768 -> 65535 : 873 | |

# without prefetch
Function = b'heap2_redo' [17980]
nsecs : count distribution
4096 -> 8191 : 1 | |
8192 -> 16383 : 66997 |****************************************|
16384 -> 32767 : 30966 |****************** |
32768 -> 65535 : 1602 | |

# with prefetch
Function = b'btree_redo' [8064]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 246 |****************************************|
8192 -> 16383 : 5 | |
16384 -> 32767 : 2 | |

# without prefetch
Function = b'btree_redo' [17980]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 82 |******************** |
8192 -> 16383 : 19 |**** |
16384 -> 32767 : 160 |****************************************|

Of course it doesn't take into account time we spend doing extra
syscalls for prefetching, but still can give some interesting
information.

I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?

Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls. Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3] or lio_listio() on a
hypothetical POSIX AIO implementation[4]. (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).

Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.

As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question. I didn't
see how to do it, but maybe I didn't try hard enough. I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure. In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?

Yes, definitely.

Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue. I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience. Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting... I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.

Sure, this patch is quite valuable as it is, and those questions I've
mentioned are targeting mostly future development.

Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?

Hmm.

I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so. I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it. Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z? I think we should
provide primary measurements and let people compute derived statistics
from those.

For once I disagree, since I believe this very approach, widely applied,
leads to a slightly chaotic situation with monitoring. But of course
you're right, it has nothing to do with the patch itself. I also would
be in favour of removing the existing averages, unless Andres has more
arguments to keep it.

#20

Dmitry Dolgov

9erthalion6@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#19)

Re: WIP: WAL prefetch (another approach)

On Sat, Apr 25, 2020 at 09:19:35PM +0200, Dmitry Dolgov wrote:

On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:

One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.

At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.

If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.

I've finally performed couple of tests involving more IO. The
not-that-big dataset of 1.5 GB for the replica with the memory allowing
fitting ~ 1/6 of it, default prefetching parameters and an update
workload with uniform distribution. Rather a small setup, but causes
stable reading into the page cache on the replica and allows to see a
visible influence of the patch (more measurement samples tend to happen
at lower latencies):

# with patch
Function = b'heap_redo' [206]
nsecs : count distribution
1024 -> 2047 : 0 | |
2048 -> 4095 : 32833 |********************** |
4096 -> 8191 : 59476 |****************************************|
8192 -> 16383 : 18617 |************ |
16384 -> 32767 : 3992 |** |
32768 -> 65535 : 425 | |
65536 -> 131071 : 5 | |
131072 -> 262143 : 326 | |
262144 -> 524287 : 6 | |

# without patch
Function = b'heap_redo' [130]
nsecs : count distribution
1024 -> 2047 : 0 | |
2048 -> 4095 : 20062 |*********** |
4096 -> 8191 : 70662 |****************************************|
8192 -> 16383 : 12895 |******* |
16384 -> 32767 : 9123 |***** |
32768 -> 65535 : 560 | |
65536 -> 131071 : 1 | |
131072 -> 262143 : 460 | |
262144 -> 524287 : 3 | |

Not that there were any doubts, but at the same time it was surprising
to me how good linux readahead works in this situation. The results
above are shown with disabled readahead for filesystem and device, and
without that there was almost no difference, since a lot of IO was
avoided by readahead (which was in fact the majority of all reads):

# with patch
flags = Read
usecs : count distribution
16 -> 31 : 0 | |
32 -> 63 : 1 |******** |
64 -> 127 : 5 |****************************************|

flags = ReadAhead-Read
usecs : count distribution
32 -> 63 : 0 | |
64 -> 127 : 131 |****************************************|
128 -> 255 : 12 |*** |
256 -> 511 : 6 |* |

# without patch
flags = Read
usecs : count distribution
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 4 |****************************************|

flags = ReadAhead-Read
usecs : count distribution
32 -> 63 : 0 | |
64 -> 127 : 143 |****************************************|
128 -> 255 : 20 |***** |

Numbers of reads in this case were similar with and without patch, which
means it couldn't be attributed to the situation when a page was read
too early, then evicted and read again later.

#21

thomas.munro@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#20)

3 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Sun, May 3, 2020 at 3:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've finally performed couple of tests involving more IO. The
not-that-big dataset of 1.5 GB for the replica with the memory allowing
fitting ~ 1/6 of it, default prefetching parameters and an update
workload with uniform distribution. Rather a small setup, but causes
stable reading into the page cache on the replica and allows to see a
visible influence of the patch (more measurement samples tend to happen
at lower latencies):

Thanks for these tests Dmitry. You didn't mention the details of the
workload, but one thing I'd recommend for a uniform/random workload
that's generating a lot of misses on the primary server using N
backends is to make sure that maintenance_io_concurrency is set to a
number like N*2 or higher, and to look at the queue depth on both
systems with iostat -x 1. Then you can experiment with ALTER SYSTEM
SET maintenance_io_concurrency = X; SELECT pg_reload_conf(); to try to
understand the way it works; there is a point where you've set it high
enough and the replica is able to handle the same rate of concurrent
I/Os as the primary. The default of 10 is actually pretty low unless
you've only got ~4 backends generating random updates on the primary.
That's with full_page_writes=off; if you leave it on, it takes a while
to get into a scenario where it has much effect.

Here's a rebase, after the recent XLogReader refactoring.

Attachments:

v9-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From a7fd3f728d64c3c94387e9e424dba507b166bcab Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v9 1/3] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v9-0002-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Allow-XLogReadRecord-to-be-non-blocking.patchDownload

From 6ed95fffba6751ddc9607659183c072cb11fa4a8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v9 2/3] Allow XLogReadRecord() to be non-blocking.

Extend read_local_xlog_page() to support non-blocking modes:

1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlogreader.c |  37 ++++--
 src/backend/access/transam/xlogutils.c  | 149 +++++++++++++++++-------
 src/backend/replication/walsender.c     |   2 +-
 src/include/access/xlogreader.h         |  14 ++-
 src/include/access/xlogutils.h          |  26 +++++
 5 files changed, 173 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..897efaf682 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -259,6 +259,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
@@ -548,10 +551,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -563,8 +567,9 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the page_read() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the page_read callback).
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -661,8 +666,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+		return XLOGPAGEREAD_WOULDBLOCK;
+
 	XLogReaderInvalReadState(state);
-	return -1;
+	return XLOGPAGEREAD_ERROR;
 }
 
 /*
@@ -941,6 +949,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	int			readLen;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -954,7 +963,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -1035,7 +1043,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	return InvalidXLogRecPtr;
 }
@@ -1094,8 +1103,16 @@ WALRead(XLogReaderState *state,
 			XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
 			state->routine.segment_open(state, nextSegNo, &tli);
 
-			/* This shouldn't happen -- indicates a bug in segment_open */
-			Assert(state->seg.ws_file >= 0);
+			/* callback reported that there was no such file */
+			if (state->seg.ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = 0;
+				errinfo->wre_read = 0;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = state->seg;
+				return false;
+			}
 
 			/* Update the current segment info. */
 			state->seg.ws_tli = tli;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 322b0e8ff5..18aa499831 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -808,6 +809,29 @@ wal_segment_open(XLogReaderState *state, XLogSegNo nextSegNo,
 						path)));
 }
 
+/*
+ * XLogReaderRoutine->segment_open callback that reports missing files rather
+ * than raising an error.
+ */
+void
+wal_segment_try_open(XLogReaderState *state, XLogSegNo nextSegNo,
+					 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, tli, nextSegNo, state->segcxt.ws_segsize);
+	state->seg.ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (state->seg.ws_file >= 0)
+		return;
+
+	if (errno != ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
 /* stock XLogReaderRoutine->segment_close callback */
 void
 wal_segment_close(XLogReaderState *state)
@@ -823,6 +847,10 @@ wal_segment_close(XLogReaderState *state)
  * Public because it would likely be very helpful for someone writing another
  * output method outside walsender, e.g. in a bgworker.
  *
+ * A pointer to an XLogReadLocalOptions struct may be passed in as
+ * XLogReaderRouter->page_read_private to control the behavior of this
+ * function.
+ *
  * TODO: The walsender has its own version of this, but it relies on the
  * walsender's latch being set whenever WAL is flushed. No such infrastructure
  * exists for normal backends, so we have to do a check/sleep/repeat style of
@@ -837,58 +865,89 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options =
+		(XLogReadLocalOptions *) state->routine.page_read_private;
 
 	loc = targetPagePtr + reqLen;
 
 	/* Loop waiting for xlog to be available if necessary */
 	while (1)
 	{
-		/*
-		 * Determine the limit of xlog we can currently read to, and what the
-		 * most recent timeline is.
-		 *
-		 * RecoveryInProgress() will update ThisTimeLineID when it first
-		 * notices recovery finishes, so we only have to maintain it for the
-		 * local process until recovery ends.
-		 */
-		if (!RecoveryInProgress())
-			read_upto = GetFlushRecPtr();
-		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		tli = ThisTimeLineID;
+		switch (options ? options->read_upto_policy : -1)
+		{
+		case XLRO_WALRCV_WRITTEN:
+			/*
+			 * We'll try to read as far as has been written by the WAL
+			 * receiver, on the requested timeline.  When we run out of valid
+			 * data, we'll return an error.  This is used by xlogprefetch.c
+			 * while streaming.
+			 */
+			read_upto = GetWalRcvWriteRecPtr();
+			state->currTLI = tli = options->tli;
+			break;
 
-		/*
-		 * Check which timeline to get the record from.
-		 *
-		 * We have to do it each time through the loop because if we're in
-		 * recovery as a cascading standby, the current timeline might've
-		 * become historical. We can't rely on RecoveryInProgress() because in
-		 * a standby configuration like
-		 *
-		 * A => B => C
-		 *
-		 * if we're a logical decoding session on C, and B gets promoted, our
-		 * timeline will change while we remain in recovery.
-		 *
-		 * We can't just keep reading from the old timeline as the last WAL
-		 * archive in the timeline will get renamed to .partial by
-		 * StartupXLOG().
-		 *
-		 * If that happens after our caller updated ThisTimeLineID but before
-		 * we actually read the xlog page, we might still try to read from the
-		 * old (now renamed) segment and fail. There's not much we can do
-		 * about this, but it can only happen when we're a leaf of a cascading
-		 * standby whose master gets promoted while we're decoding, so a
-		 * one-off ERROR isn't too bad.
-		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		case XLRO_END:
+			/*
+			 * We'll try to read as far as we can on one timeline.  This is
+			 * used by xlogprefetch.c for crash recovery.
+			 */
+			read_upto = (XLogRecPtr) -1;
+			state->currTLI = tli = options->tli;
+			break;
+
+		default:
+			/*
+			 * Determine the limit of xlog we can currently read to, and what the
+			 * most recent timeline is.
+			 *
+			 * RecoveryInProgress() will update ThisTimeLineID when it first
+			 * notices recovery finishes, so we only have to maintain it for
+			 * the local process until recovery ends.
+			 */
+			if (!RecoveryInProgress())
+				read_upto = GetFlushRecPtr();
+			else
+				read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			tli = ThisTimeLineID;
+
+			/*
+			 * Check which timeline to get the record from.
+			 *
+			 * We have to do it each time through the loop because if we're in
+			 * recovery as a cascading standby, the current timeline might've
+			 * become historical. We can't rely on RecoveryInProgress()
+			 * because in a standby configuration like
+			 *
+			 * A => B => C
+			 *
+			 * if we're a logical decoding session on C, and B gets promoted,
+			 * our timeline will change while we remain in recovery.
+			 *
+			 * We can't just keep reading from the old timeline as the last
+			 * WAL archive in the timeline will get renamed to .partial by
+			 * StartupXLOG().
+			 *
+			 * If that happens after our caller updated ThisTimeLineID but
+			 * before we actually read the xlog page, we might still try to
+			 * read from the old (now renamed) segment and fail. There's not
+			 * much we can do about this, but it can only happen when we're a
+			 * leaf of a cascading standby whose master gets promoted while
+			 * we're decoding, so a one-off ERROR isn't too bad.
+			 */
+			XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+			break;
+		}
 
-		if (state->currTLI == ThisTimeLineID)
+		if (state->currTLI == tli)
 		{
 
 			if (loc <= read_upto)
 				break;
 
+			/* not enough data there, but we were asked not to wait */
+			if (options && options->nowait)
+				return XLOGPAGEREAD_WOULDBLOCK;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
@@ -930,7 +989,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 	}
 	else
 	{
@@ -945,7 +1004,17 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 */
 	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
 				 &errinfo))
+	{
+		/*
+		 * When not following timeline changes, we may read past the end of
+		 * available segments.  Report lack of file as an error rather than
+		 * raising an error.
+		 */
+		if (errinfo.wre_errno == ENOENT)
+			return XLOGPAGEREAD_ERROR;
+
 		WALReadRaiseError(&errinfo);
+	}
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..448c83b684 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -835,7 +835,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d930fe957d..3a5ab4b3ce 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -57,6 +57,10 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR		-1
+#define XLOGPAGEREAD_WOULDBLOCK	-2
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -76,10 +80,11 @@ typedef struct XLogReaderRoutine
 	 * This callback shall read at least reqLen valid bytes of the xlog page
 	 * starting at targetPagePtr, and store them in readBuf.  The callback
 	 * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
-	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
-	 * requested bytes to become available.  The callback will not be invoked
-	 * again for the same page unless more than the returned number of bytes
-	 * are needed.
+	 * XLOGPAGEREAD_ERROR on failure.  The callback shall either sleep, if
+	 * necessary, to wait for the requested bytes to become available, or
+	 * return XLOGPAGEREAD_WOULDBLOCK.  The callback will not be invoked again
+	 * for the same page unless more than the returned number of bytes are
+	 * needed.
 	 *
 	 * targetRecPtr is the position of the WAL record we're reading.  Usually
 	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -91,6 +96,7 @@ typedef struct XLogReaderRoutine
 	 * read from.
 	 */
 	XLogPageReadCB page_read;
+	void	   *page_read_private;
 
 	/*
 	 * Callback to open the specified WAL segment for reading.  ->seg.ws_file
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..6325c23dc2 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,38 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/*
+	 * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+	 * provided.
+	 */
+	TimeLineID	tli;
+
+	/* How far to read. */
+	enum {
+		XLRO_STANDARD,
+		XLRO_WALRCV_WRITTEN,
+		XLRO_END
+	} read_upto_policy;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
+extern void wal_segment_try_open(XLogReaderState *state,
+								 XLogSegNo nextSegNo,
+								 TimeLineID *tli_p);
 extern void wal_segment_close(XLogReaderState *state);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
-- 
2.20.1

v9-0003-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Prefetch-referenced-blocks-during-recovery.patchDownload

From 68cbfa9e553359a57a4806cab8af60b0450f7e5b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v9 3/3] Prefetch referenced blocks during recovery.

Introduce a new GUC max_recovery_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is enabled by default for
now, but we might reconsider that before release.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +
 doc/src/sgml/monitoring.sgml                  |  85 +-
 doc/src/sgml/wal.sgml                         |  13 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  16 +
 src/backend/access/transam/xlogprefetch.c     | 910 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               |  96 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  47 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlogprefetch.h             |  85 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 16 files changed, 1366 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a2694e548a..0c9842b0f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+      <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as
+        <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
+        might be counterproductive, if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.  A setting of -1 disables prefetching
+        during recovery.
+        The default is 256kB on systems that support
+        <function>posix_fadvise</function>, and otherwise -1.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.  This
+        setting has no effect unless
+        <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+        number.  The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0ab278e087 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2674,6 +2681,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connected server.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-recovery-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-recovery-prefetch-distance"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4494,8 +4573,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         argument.  The argument can be <literal>bgwriter</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
-        view,or <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view.
+        view, <literal>archiver</literal> to reset all the counters shown in
+        the <structname>pg_stat_archiver</structname> view, and
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled,
+   but it can be disabled by setting the distance to -1.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..81147d5f59 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -7169,6 +7170,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 
 			InRedo = true;
 
@@ -7176,6 +7178,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7205,6 +7210,12 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Peform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch,
+							 ThisTimeLineID,
+							 xlogreader->ReadRecPtr,
+							 currentSource == XLOG_FROM_STREAM);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7376,6 +7387,9 @@ StartupXLOG(void)
 					 */
 					if (switchedTLI && AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7392,6 +7406,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12138,6 +12153,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..6d8cff12c6
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,910 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.  Currently, this is achieved by using a
+ * separate XLogReader to read ahead.  In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer().  Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs.  When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching.  Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int			max_recovery_prefetch_distance = -1;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogReaderRoutine reader_routines = {
+		.page_read = read_local_xlog_page,
+		.segment_open = wal_segment_try_open,
+		.segment_close = wal_segment_close
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  We add one to the size
+	 * because our circular buffer has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+						 sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->options.tli = tli;
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* Read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_END;
+	}
+	reader_routines.page_read_private = &prefetcher->options;
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											&reader_routines,
+											NULL);
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("recovery started prefetching on timeline %u at %X/%X",
+					tli,
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_recovery_prefetch_distance)
+			break;
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			prefetcher->have_record = false;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= reader->max_block_id;
+		 ++block_id)
+	{
+		PrefetchBufferResult prefetch;
+		DecodedBkpBlock *block = &reader->blocks[block_id];
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably an
+		 * extension.  Since it might create a new segment, we can't try
+		 * to prefetch this block until the record has been replayed, or we
+		 * might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	max_recovery_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..6c39b9ad48 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -826,6 +826,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..5ac3fed4c6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -282,6 +283,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -354,6 +356,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1370,11 +1373,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2696,6 +2708,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4444,6 +4472,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4640,6 +4685,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4915,6 +4964,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5174,6 +5230,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5261,6 +5318,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5560,6 +5629,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5625,6 +5695,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6422,6 +6504,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..2fea5f3dcd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_recovery_prefetch_distance is set to a positive number.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable prefetching during recovery."),
+			GUC_UNIT_BYTE
+		},
+		&max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+		256 * 1024,
+#else
+		-1,
+#endif
+		-1, INT_MAX,
+		NULL, assign_max_recovery_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 81055edde7..38763f88b0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+											  XLogRecPtr lsn,
+											  bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+			 TimeLineID replaying_tli,
+			 XLogRecPtr replaying_lsn,
+			 bool from_stream)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (max_recovery_prefetch_distance > 0)
+			state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+													   replaying_lsn,
+													   from_stream);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..56b48bf2ad 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6136,6 +6136,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..0dcd3c377a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -62,6 +62,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -182,6 +183,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -453,6 +467,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -597,6 +621,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1458,6 +1483,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1473,6 +1499,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..74dd8c604c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

#22

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#21)

Re: WIP: WAL prefetch (another approach)

Thomas Munro escribiï¿½:

@@ -1094,8 +1103,16 @@ WALRead(XLogReaderState *state,
XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
state->routine.segment_open(state, nextSegNo, &tli);
-			/* This shouldn't happen -- indicates a bug in segment_open */
-			Assert(state->seg.ws_file >= 0);
+			/* callback reported that there was no such file */
+			if (state->seg.ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = 0;
+				errinfo->wre_read = 0;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = state->seg;
+				return false;
+			}

Ah, this is what Michael was saying ... we need to fix WALRead so that
it doesn't depend on segment_open alway returning a good FD. This needs
a fix everywhere, not just here, and improve the error report interface.

Maybe it does make sense to get it fixed in pg13 and avoid a break
later.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#21)

3 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

I've spent some time testing this, mostly from the performance point of
view. I've done a very simple thing, in order to have reproducible test:

1) I've initialized pgbench with scale 8000 (so ~120GB on a machine with
only 64GB of RAM)

2) created a physical backup, enabled WAL archiving

3) did 1h pgbench run with 32 clients

4) disabled full-page writes and did another 1h pgbench run

Once I had this, I did a recovery using the physical backup and WAL
archive, measuring how long it took to apply each WAL segment. First
without any prefetching (current master), then twice with prefetching.
First with default values (m_io_c=10, distance=256kB) and then with
higher values (100 + 2MB).

I did this on two storage systems I have in the system - NVME SSD and
SATA RAID (3 x 7.2k drives). So, a fast one and slow one.

1) NVME

On the NVME, this generates ~26k WAL segments (~400GB), and each of the
pgbench runs generates ~120M transactions (~33k tps). Of course, wast
majority of the WAL segments ~16k comes from the first run, because
there's a lot of FPI due to the random nature of the workload.

I have not expected a significant improvement from the prefetching, as
the NVME is pretty good in handling random I/O. The total duration looks
like this:

no prefetch prefetch prefetch2
10618 10385 9403

So the default is a tiny bit faster, and the more aggressive config
makes it about 10% faster. Not bad, considering the expectations.

Attached is a chart comparing the three runs. There are three clearly
visible parts - first the 1h run with f_p_w=on, with two checkpoints.
That's first ~16k segments. Then there's a bit of a gap before the
second pgbench run was started - I think it's mostly autovacuum etc. And
then at segment ~23k the second pgbench (f_p_w=off) starts.

I think this shows the prefetching starts to help as the number of FPIs
decreases. It's subtle, but it's there.

2) SATA

On SATA it's just ~550 segments (~8.5GB), and the pgbench runs generate
only about 1M transactions. Again, vast majority of the segments comes
from the first run, due to FPI.

In this case, I don't have complete results, but after processing 542
segments (out of the ~550) it looks like this:

no prefetch prefetch prefetch2
6644 6635 8282

So the no prefetch and "default" prefetch are roughly on par, but the
"aggressive" prefetch is way slower. I'll get back to this shortly, but
I'd like to point out this is entirely due to the "no FPI" pgbench,
because after the first ~525 initial segments it looks like this:

no prefetch prefetch prefetch2
58 65 57

So it goes very fast by the initial segments with plenty of FPIs, and
then we get to the "no FPI" segments and the prefetch either does not
help or makes it slower.

Looking at how long it takes to apply the last few segments, it looks
like this:

no prefetch prefetch prefetch2
280 298 478

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
0000000100000108000000FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
000000010000010900000001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

nvme-prefetch.pngimage/pngDownload

�PNG


IHDR�Z
%t�IDATx���|TU���BB����4Q�+*�]++"�uuuE.��`�U�b��`GAT��=�N%R�����@��L&��������9w���s��s�x<�����R���:�y=\�P��w	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH�B� ��(n��6�=Z=���������G��1c�a�]x���7o�6n�����l����������>��{���{��O?�v�����{�q����������~�I��~���w�yj���^~��\�+*			���+5}�t�{���0aB��_��m��eK}��gE�>�H�>|�n��&7����������Z���w�#8p@c�����_�Z�j��bbb��w���j��u������s���p}��79����O�������&<�@����>�H�'O�|��O?=��}���:t�f��U�����Z�(�(5N;�4���Ow�q�~��7���e�_�^��������6��������&
N<�D�X�B�����N��������K\p������Y�w�^�k�����s�s�9>GvRRRr���H��ow�^w�u9~���_�(�/��E�@�aA�����3�<�
�����g�����r���ym��Yu����n]�4i�v�������.�L����Q	�{���SO=U5k����k�����;�z��u!C��
��������mj����
��������"1r�H���k�g`l��W_���+U�bE7ay�����u��U�����{�Yg������>[������%
������q�K�������_���X�����fe��P�L}���z��'�n�:5j�HO>������l?����X�4����[����{U�~}�x��z����:�vn(�H��[�[o�U?�����z������_?7�{��w�����5i�$=��3~��v����on+��%K��U+w}ZZ��E�����f�^�v��e�����~���u�]��c	�uY�����`�����{����*�l���}��l����E�X��2x�`7��~a	��K�;w��}�����~	����X��7�
��n���'�|���q�?��;F��/K.f�������g���a���DGG���7��& �k�>�3����UO�]t����>�J�*��������?������"PjX�����6�k�d�Y�fn�k���X��>[q$k������>����q��n��]`	B�r���C���Y�t���Ys�.D��X k-�Vw��+��I����������n]Y,�5���u��fk�6���}f�����O��5j���q�,��fgc+^x�7������Nr[����{���*��
������H�+W>���RO�k��X��1���z��9�J2����[�g�����Z��c��U��p��kq��2��2[�m]C,i��jtl�.����Ovl���|km�����M�6��b��SN��M��U��5�/�\gnF�o����rd@mc),��q����-Z���q�@��8��d���1��a�U^�i���l��,[����3K��"P*Y�o-���d��9��c��E�����_~q�Y���X�y�Vb�wv������Z�s��S�]��������YYm�e8r�"�l����w������C��eK��U���U?�^�r�>��2~������4��#F�p�q���f����s�$
J"���6��C�n������_��+�����X������b���Y���6mr�=zz�����)S�g����;�Zg�����+��@��K/=j{�

�8ku7�
����~�bd/����}�~��&��!Pj��������p�
��c��l@�W_}������cMui���_�r[������
�����zHXLk}�3����F���g�{V��r`3����l��u!:������ql[�{�1��/*��x�~���m�:u�{�u����X�i���xn(�H�j6��
��~�����NP�n��_ku�n���lFK,��~N3�d�������A�����c��PF���X�o���b[c�c�IN�����rK�l�Pk%ONN�o��N'j�b�k����Y�#�B��i������c3
���3��l������<]gn�{�c�������d�
�X�o��������"P�Y@���/�sG�~Yx�W�L��X���Z�;v�����������s�n�*5j�Q�-��K2���Y`��z������4����c��0XW�'��e����i` �.���q����`�b�@_����O��g�l��
r�i:��+���K=��4ncA�W���!2dH�����@��]�����}����W~��]�|�����r�Wn��U�a�g��9��/\�0�c�N�������l��#���������r�]]�����d�~y��%M�*���$"B	BH���ow!@h#B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH��o��x@��
���3�O�2E>��V�X���+���������n�={����-[�L5k����C��O�����BP�={��M�6~�w���=zh�������5�|u��U;v�)���^�zi��������M������}��j��q���l���>�b��o_7�?�����:tH/�����v���U�Vn������������t���={j����rA�
2$�W
�`@P,��N�Z�t��7g���_��g�u�&L��&YY���U�j�m@�!R7n���^���o��I�/Y������
�� ?����%�(����m�����l���>����Nqqq����x�\��o��3gj����v���hEDD���/.�:�����[�F���m�������u��w��O?��g����\�rn�n�
�����F��/�P�:u*���^�_��@(�X� ����k��V�G���M�n��_|��6t����4i��D���hB	@�����|���rJJ�~��Ww������]����'+�6t���7n����A���~��1n!S�m-$f��<��v����6�t����
��!$@!B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH�B� ���k��o�����&�C�#��?-��g���WtT����]�<!�)~k�����.��]OPg��R��e.��V^���,���	��Wo����d�_x������F�C����&����OIB���3W��Y�3���^��) �.46s�eL��~�l��ER�������^�A��Os���_���wa�N��(-�=�����s�C:{�:o!��t���T��#����h�����G;G�6>��UGC�
��ErJ����>��`��j���[8��t�U�T��!(&o���x�
6L�\?{�l�q�Z�l�j����C��O�>E�
�����t(5�]���%��m��T1���H����g��i��o}rr�z������_�~�6m��w������q�����e����:����?Y�M\�Y�G�tUQ��P����k1���H�A��}��cG���~��L����h����-w��E={����c����6d��^5@���;��@rjf�a'�����P��GP,����%K��U+�u�J�p�BU�Z��� g�w���	s3�/z&���|;�<�ju�H���������bbb��E�
9�������q�4 �������J��	&��w�����g�O||�V�\���3g����{�d[~���?�`�qw�F�<����)�@�-�:j������b���!�E�=���#F�P��m3��\�rn�}���*���;F��/�P�:u*�my�>;��?�`R�����'�(�����,������@�
�FU� ��[7EFF��_t��<y�&M��&
�����
G��j���?�}���m������ ���4�/_�]NII�����|������_��q���A�����1c��_������2��W�����6^�@1��p�XDD�<���v��i����)�k�����Q���m�������P��E�o�����VN�)���6���jT�H��,��
�2�%��m��I�lt6G�<$y6���O|�Y���n5���-D������jV�H�F}�?����}�+��'j�p��	B���d
|������kt���]�����7S��	B��/f*-��.����C���m|x�Q�B��u5@>�H<��?�M�~�g�*zyM�Hg_QL5+:$Y�����djT��w���6�7��jU�H��m����f�{j��*�[��]jsz1��h� $
|����1����o����P�� @���l��O[�Y��.�@`w�����}�u���JVH]�j��������B��J���x��y;��a��w������R���Z��B����/�����H��SA�R��'I�"���������FO}�G�r��4uM[�������d��R�t��u�j}
��E�!�����t������1�����*V9�Iv���M���
���
��E�!�Pj��{f\f����1~�o�a��$i���������B�e�#@Hxj�~�'�G�
�:��yh��4XJ?�]���~]�5�z���������Z����p _7N:��c���B�D�}����q�Z�@ @�������������5��������b��R�
�rF�Rm��}z��9�����P��)�B���e?�Iv��_n�RXD!�2pHP����O�`W)B����o�mC�}������}�
gI1'n%���_����)K3�/{~R�<�B�����c���f++5|��kX$(�.�j��g�
�x��'H�X������%)<��jW<HP*���<���;�<���N@���V)�U�'��>7f��?��T��B�g�����x��}:�o��	������?�I�
�/7YU+v$(u>�e�Vl��.GEFh^r���oz���$/�v��+W��9Q���h1 @�b3����������U��;���U�^��}�Gt���L!��x��T�������.G�	��u��6�,�����������JO�\z�!oG������r�����7�[�m"����	��I����+]�$yyRp�A�Rc�G��g�}�W�c���	���[���$
C�v�����M	����������.��C�Kr?A�,)u��lS~FT-�Z?�
������c+��u�|��0��=i��;�����k<HP��X��}exj����-X����~���J��}�fr���#@�7��2�+�
W��i�������Ck����9E�nS�5$(�>�m�����Y���t�p�J
�~��O�i�o���7���+DHPb��{4�U_����f�R�o���QY��N��e��/HaQ�\��B����fi�������9xu��<}��������D���U.�*�]�*$(��i�����Y��E����"������	����\��B�ap"@����O��O-�+��A�|������k��\��N�����	J�u������3���HV�%)���;��m��M���e�I5�(�Z'�8���_��%/�
g].E����m����Aj���a�Z�`F�e����t�o��w�oR����Bl���r>�P���U_�Z��w*��)�(w���_�����
�=!�G�|��+���(���$Af���z��G�o�>�����{��wx����=�]^�l�j����C��O�>��
�$�q�M]�!�<��*i~��]���[���|�zC�����d�#"{���W\�	&��������u��'�]�v����z������_�~�6m��w������q�����e���d�<K�xt�s�3������s��v�]�������+�k+U��j�H�H||�:���A�:���p�B����h�������K���Sc��u��m��!�s�0r�\��~=�1�4�����.9��~�r��EP��� �X�_�n]}��������t�R�\�R��u�?��V�Z��o-��T�Z�@�J��St��?d�����Z������A9�8^����\�f)2��+YB���e�j���n7��j��]z��g����/�p[����������WA��O}��_y��,3����mN�����������x����!�������7��t��[����;��m��i��=��Y�4s�Lw��u*���\_�?��R���%���+$�~��!`����9�*��~���f�'m��������G�}�Q�F���=�P��y��i������� ��K/UXXX��^x�:u�T�my�>;>��@�6���\n\�����?���Uu�u7e�����_|�J���}���y��$A���N��%K�x�b�n�Z������_t�����"##���/���&O��I�&i�����-�6�`�d�N-^�#��p�u�������?r8�#���W�j"5Q$u,iH��i�����\s���Tu��W����VDD����N�9h� 7�3f������
�
 �
z{�_���w}�����9�����I���<&��-�
�@$A�o���+;�<�3f�6�`5qN���cyf������I���$�m���O���w^�Q-K��t�.x����������;|;���T�R�g
��c�����%	������/�O����m���)1�@���$�B�V��#@P�������5�W=FM������g��l�X��?�_��N0�kZ2� �<��T���zS�m���I�����M���c�-���$[v���c�e�{���*�&�v����L�%���W�u���A!�A����+����;t8+���������yw�V�!@P��b����$����UTfj�o��9L���f����u�	�(�J�$
��.s9�FE=��U��+H�.:���������o��^S��,�HP��.��9+�d�_��RZ������<� O�����@����jXz����q�����o�kf�mls�T���m��������"����	��������U_�����u��3����r�����A�J�t�G������?��+��)���A�di����RD5,}HPl}k�_���I���S��tE��Z7���t�^�h*X
��X<������-j��yY��?���1��u�����`5ns��E\�������g������������G�p��ls��y��jWz� �v�9��?�#�|�
7%K��3�}��+%O���|��aEV���w�st 9�]�^��^^��ocl���������W��o�l����!@@����	�Wf������IY������~�9���rDe�Q $����%s9�ve��������l�������(�����U[�������N�(��M�FN�?`����?�\�D������	��oe.����z�����2e���|eO�����'���"�a�G���j�r���������z~��;FJ��q�;�^t$(r�t���g�����_=�����R�N�������}���
]P���E�������O�(��e��#�����������i�B		�TrJ���cf�ElU��-�C���je��?�V��o��b��>���E�������6�)�����q������,3���*_V�1$(2������3��]�N�?�����"�j-_yuo�rx��j^�W2�����>�����2�����O��-���D_���RXT�0���H��z����/�9I��>���oz��7�I����mQ�����4���H\6�3�����?_�bU���|��O��A�,����};c��n�u����6���������0�r�r)�c��:��"���������>�\��~����ocT9��s<��>�m�u����!���������_�Tz�i��G��-o)�%���_p�/�h+��8p@s���g����$�9Riii0`�*T��* ��=~3�4����?=����e��I�%m��o[d=�|��4t$����T�^=78p�6o��:u��o�����O�}������e����>���v��o�����e����S��H$�:u�V�Z����_(>>^�+WV�&M�����zd�����W���?>��!�����wy�wR���m�_�"���!. 	@DD������O?�C��R��<��
���N���`��l�<����9��12J��w�����;����ta`+��t��Y]t�-Z��^z�]7d��k�.o�"�s������r�.�U��,�~�8A�X���;��F���,�P���=Z_�������05j���o���@��D�������0n��o�*i�/��Q���S��v�* 	@��e��wo�u��sO �El���z�������S��,����]�PY�������pE�4k�,�������`�\'x+s9�F
��i��}��7��`�g4)�E����H��>�����7�h��e�����j��i���z�����[��|{�������l�����|+��7������G|��\.U������i��}{��-����s��3e������m�����(��"������}�>�+��Kjv�wy����U�zC���1;w�t[���_�o��={���K���D�q�������������6{�lw���R�fM
:T}��9�m5��%��x ��K�����jK�?8a���o[������G
H`Al�V��.?U�V����5e��}���x�������q��lI��W^�+��B-[�T�^�4x�`���O��MsE�_Y7n\�mvN��HNI�m/��uS�=:g�:�����4#��Di������J"[I}�Q7����_�k�.7	x��'x�v���q��i��
���V�F���{cQ�u� �.]��g��;v��\�m�,��x����w�Pf��M�6��Ej���l���l�.GT����J"[I��`/�l����W��^�u���{��o��v���Z�d�Q����/\��M�
�
� v�;��9)���5W�o���{����}e����������~��O>��l����m�u�JHH��u�T�zu����g�}�;�������������/�6������������h�T��Ha�R�)�f��QF�H`���Z�������b��	}�����L�2j���;u�����F�����k���~u�9s�����>K� ��s}��,@p��/E���,?�6Y�t�]^p��Z;�/g�/�\�=���7q�S:O<���[W�^{m ��P���#��w�I'������9��/�����'�4]v�en~���fN�:j�(]x��n�P�my�>;>��N���W�>�����8h�N��}�J�3};�~P���!`uf�����k�����6�r��aC�w�y�����^k���'�|����"##���/j����<y�&M��#F(66�@��ca�v�_�{��g�_*Z������=��uw�*w�T������2�a�����+V����G��m���5j�x
��.�>%�f��<m��c��Q\\�qm���#��+_�Y�]x�k�N#������,c
c��`aB�H�����3�������Z��6u����
 /~��ZS�n�I�TJ�z��:��]yp��c���j7I�-\SK@�����)�����k�K��>�3s��'A�xVy�=���I���<��L-���WA�Y@�����[nqg��)(���t�Yg��w��+
@��e�Z����)�C=�z��R���]��?)u����w�0z���$�Z��>e�wNzK�|�M�!W?��c �����/3�{xV��Ro��s��K�G��p��`�U@�m����~���{�������@�6~�v�9�Y~1}�UN����������j!5z;��D~$�i(�aT�D�+V�P��e�����}�f�n���[��Q��(=��t�}�h������?��$�-����$C�Q��u�g�z���@��S�����!5-]���q^{5c�&�s�����$w���������|��e���g�w�*u�"c��(TI���j�~��������@�P+`s���%�O�����w^.��]{h��}�������}���L���-�?�xW��������zF@���d}����������������, �&���
���[�5k�f%���y�����g�x�j���v���������}R���W]q�������$��������{��f�*T�������,��u���c�Vo��u�5w�V��a�vw���/��0��������)=L���<;U_{�P{T�	�s��L����m/I��������N����H0y�d�[�.�i�U�V�G}�
���JRr��n�����j���Nr�7l���;����Q�����>�I�Q?�|��.���RO�*�%{��<��j-�NC��V�J���]p����W�|�z���# 	�=�w���j��a������4`P*�H<���v���'��WDx�;�v��-���P��|d�������J�)Q��OV�)9�9�\6�	��:C['��#�o&�k��m*�6�N�y�9�����k0�x/�`����,@�:uR�*U�Y����O�5*oPh��=��k�Vo��wi��-n�k�_�v���`��G�UF�c�� 2Y��������b�L5�n���W��<�O�6��ER�����R���`�v),,�Z���P
/��s��$������s��Y�j�����������u��y����������t,�_�v��m�Yh�S�j�j�Dh��$u�t@��T�5^5U�m nR�S�,�;�	+W�.���m�n�7��^������?�gM'�(������	H`,�������:u$���m{��~��g�������������d]����������}g��F���*�T?e�o_�&��)rG�Y{vq���u�s��RT���4����-�E���������?���P��l��I�����������u��5U!��Zwk�_��
�m|�o�����*����4a�J�����nU�����w��v,�V/�v������|��w��T�+5h���H�n�/������>���i@/�����{j������oT���5g��@T��J�%Q[w�w[����]{j������Z���VU�J�T�F%��WA��P�[�`�_��{���4�p�E��JM�^�T��������Sn���� �@�A���-\�[W���,��������������'oJk�����o��>��r��)�M:����_X�4��*�U�J�bkTT���j^-J����U�vE��^���d���f��"�����7���(��*��w8�w^e"��
,����e������S�8��to��#�"�m �>[,�D�	HP�lY-\�0���,[�L111�x{P���V[����AV������V;����tO^:��M�j�U�ZE��_M,���o���
��^9�a�*�o��V��������v|oX3����+��(5h&�o.U�Q�s���8}������8e'�?�\J�$���W�H���+�j"0�<��3:��3�i@w���^�z��������x{P�l��u�����0+k���k��Mg���:����O�u���]Y��U���������El55�[E
kUV�����u��xi�R{'��t��f���q�j��v����o����Zyg����*���x�yk�O^��]#X�m�w[��C��RL�:v�3�cP$�Ok�=�d�����:���T�vm����_�Ju��\�)��=g��M��'�
�w�Kv��/,5+{����V��������	��;�,��V!J���[�J�WH���b�(i�S��K��[�����M��a�t�R�����6�6<�0.���;�Vo�����9����}Xv��:�}o�~���9�u���5�u>�����8�@@�h��U���������#G����
����4w������.Y��lk���5��9��a
������q�}�M��{l���m��@~�'�+�<�
�m����������@��e��IgH
[xg����a����%����-��|{\�\�vo?}k���('�)k�VTv����n�
��9���Q���v�m�s�8�w:���7�����>���@T��c}��;���{�n9sWmu��[k�=����'�����e�U�b9��>�Ne5�SE
jVRS'�� �f��V}��J�6x_+�J+6JS�x���J]\6U�M�i�����
���}�m0�����pw��k���5�-�_���z.���f\\���\s�F����.��`����/�K�_|��
T�re5i�$o@�c�em���[��Y�7k��������	��{��/��X_)&�T�(~�����V�8�w�$;[�m����S�I�w:��������x�����Z��~����N#oW�;?��%xg�� ���[���UR���?wtK'�o��[�n17?�" 	@DD������O?�C��R���/.���
����&)9�m��A�k��~���n����6��JIM/����0+��rz�g_��k�im�m����_��e�o6����M��:���e��M�_k������RdY�i[��	��3���
�
6���3��:�M��{`���������/E5s����:���'c=PI:w���.�H�-�K/���2d���k�� h������m���{�&��&�z�&��"#"tB�jj��������Tq��W,W����N0��	������O��p�t����76�f������o]s�5���:��r���=
]����������`��}��:��^�|;O�+w��s�7d.�D@���G����V��u�q�F�L
y��J_�����g����t�F�}jm��z���l����,:�fH_8��N���7Z����������M��c�q�k:YHc'�o�]��a�;��
�V��N�sp��������M���	�*t���`��b�B!'`������{��'o
@��������0U�)���Ys����}��^U��R�h��=���=g�Li�Ji�.i��n[��Yg�����=��|e���o~�Z�?:�������s���R6����	���),�p���;��;�Nco�o/��#��rP���������/U��?������f�y�E�Z�W-��:�l��w�ni�"o��=�0'���[_��u�
U���A����wn0K?�m�O�,�w!O�t�/�����u����
�cN�M�i}����9@�H(�<��_y����s����~��N@��Kg;A}�w���&�Io@o������{��7�Yt�d-��r�v�Ywxf�]����+���n���_�7����t[������0}��.��D��m���-����~���z]�t�5����z�V���������`�9���c}�-�O��
��A��1�6+��������G����jz�EK>J9h�������,�����:'���*K��z[��5�^o+�u�	�]e����-���~�7����]�8��geO�-{x���RTs��[k���(b$�5O�=I��r�H�&�z�NQ������4���-5h�}��
��X%���/wz����9)[�e'�O���������Z��+J���Wop��u>������L�	���ym���xcf��C3U���[����	��;��'���8����N`��;~�n��"g[!�2d��t^5�����|�O~������t��z�����Z����������}@Vq��x[���v�q��?�����/�
����Y��[���'���	�ae�}B@�0E��������>Y����v8�_��������spi����!����|���~]��[[������{/�B@�����jSBfyj����C���J�����3��{���slzL�{����]��t{���^���Q-�~tso���O ��P��i�?�^�Y���:��8�^�Ig\��zR?�j�t`���������	�<<Ol��6��
�[9���k#��ej��($�I���=���g������p���hSb��r�;��jo7��f���W�jN`_�;Uf���g��}x�m]g}��yA��"�l�N������:�I�9�W�Wk�@��������s��x������
RTc�o���}�6���6�����9��Pb�P���S���~��K�X1J�FN���o3����	���M�����S�doPo���96U&>�\�P�J�I�G��{/}��{�yO}$5o�]^o����{o���M'���7�v�q�d[��.@�G@!�x�^C>����f���9���S�r�wy�����|��9�F{��X+��Vo���B`���W~����r�r�������t��1��}�K[��\������ �����m���K/���#�u�g��w��e���f��:t����s\��#����������_z��8��o]~�
�?��["�($A����ST��)���������~��i��i������o���h[��-��
��8������_?��U�<R�F�����>��P�5�H����������!B�|���Z�M7i��M��)S�(::Z���w�]�tQ��=5v�Xw� ��R<�D���f7���-�3�����)�8��M�b�J�M�b*zg�Y�'K���;�;5'	@�IHH�=���o��V�~�i��%K��U�V~�Z���U�j�m\rJ�.~t�_�o~N�P���-�2Q�P�Yp��uw:���Xw�T�o��Hr1a�������?�s�9G+V�p��c�[��o�A�:���k��5nA��������`�?9M7�^���t�����(�*�RM�q��J�2�]���{�_�������0�����[	@.z������������7�xC����7o���/������7��i���JJJ����Z�my�>��Y@���c�N�����"�MH�����m'�<=V�����'����_�����O+.��,��Q� �|���n��&M��������{��G5l�0y<�����,X��;�u�����?oT����[w���}�Xo��������y��^v��w|;��.��w`*9 "�d���Ok��
�4�)))�����/���j����4i�F�����m����M��o������u�g��`��$e"����J�G�v.�^j���
@HJ����N�9h� 7�3f��������]��d���x����o����u�����wI�<�k�����|����O�%�H��c�=�Wn���f���������6��'��������<���3V'xvyW<��t�M��z�MC������������{_�I������Z����N=���u��f��S�y��o��gip��K� �����e�t�c�hG���o��[=Ia��?_�.��oc�>ieO�P�o]��R��N���@���������]����2i�����Zh�����������p`���V)u�o]D%��GN��f��y/}5KO��];������>O�y~T�������Kn�?p����A����$�{��kGY6���S_h���~�O�l�����Dm�Z�*=�����C��(��+������KU�@���H!g��m��>�6%��U���>Wa�$����]��C�xi�=���c��=)����4@�X�v�n���\���mm<�59�#�P�t�����%��v�Y~�
����`X�t�4)�b�
	����d����$��p�Q��j�^J����Uj�U���R�6�;Y������]������-��@ �J����4l�����[�o�,��,�Z����X:���N�I�A�Y����wRt�"�;%@�b3�������)�0���gy��A���������~&�v�{9�C�'��G�X&��������@	D(&�_��>�]��[�����N������(�k,�{O���'O��������M��$E5-��@`�J,{x���?��)9�s�g�y���g����t�(��.�;�������$����W;��Wy(&$����4i����/M���>1J�=�Yz0}������<��������E�����}'	����\.�}\��\����P"��z�>��T�M\�u��������t�g�������R���;��"m~BJ^��I�^+��W��[�W����\�Q?�^�'��=�}b�W��/���j����� ]r��<�J�v��6�K�$}��:�����M��Z�A�T����~Y����K�w��u�������z����tkZ��e�����G����X7��=�0������t�b�����6h��9�h��\�=��V�y��r�
�S�t�UR��;zP������i���O�����.J@�mOL��,��yk���%n��h��������@{�y/�
�����|p�w@��_�|�S�b���������!9{ ���-��[4��u����� ��=��[�T�s����R����&G�|��g����{%*]�}�W�F�q%P��
]���n����5~�
������37�����p�U���	��@��J�]#�n��������8���C��?�=����R��K1���@)@8nv�����i�>}6e�1�{0����L+t�g�w������Q�kO����������7��K�]���K��K�1�((�H�b�y�-��I��j�"������;����UQ��_�H)5E�����K5c};� ���	�'HI3���'�*���a�s���2
$����k���,=�o��������\�fh���S�U�y�u����,�
t�.uv^^'������xg������*o���T�)�B�/B	 ���{����Z�f�>��Xw�=j����H�)E����kt�6��'���()%Y���t��R��������_N��������{��e?u�T�T�)�l���E!�����p������Vow���;p(O��P�N�����w�^�Y�(�E��@�m'��s��
��������{���k/U<�y��,�sV���2G �R�f�Y�a���?g���x�����=I�y>�=m��g�:i�N�lQm����n�t��R�Z�e�I��K��>��H{^�����v��"�I�8�3���;�����k	�"��?m�&-Y��
����,�T����<���t�g�:k��:[h���d3������r��iR���`���'H{�H��I�?�{�������S����p|H��y�g/���+6%����h�v�����e��z��n�~7'����j�����!�s4i#u�X:��tR�x�O�i�3���y{�'Q���
�cN���x��<�p�H ��H<��6i����k����yn|��cs�w�l��N�����-n�o-���YO���^��s��w,�?�X���7���*WI�'x��!�u.(f6���m�n��U��l�.������?��C�~
�W7�/TG'�?���-g�\y�Zm����ix�����9���wI�<�(w���y9����:�?���@���s���bc���������>_���:1}�Nv��-n�������9�>���H7�Kj�N�WI:�L�7������!)=)�V��N�s�Q���(D�������Z?��+���?��J�h�v����><�������:�K=���v3;W�!��(5m+��/�+%������iE�a��xg�)��'���_�jR�����&@�!�|���_k�������n��aY�ug��m9�����R��{U?i�NK��fJPK�TsO�"c`o�*���R���A�������u�H��:��D'�HZ���,���o��N�����E���`EG�>�k�&��E�n;[w����k���JN)X��5���C�d��o���_����Y�Zi����@`��[��n���~#'H�#���	�78�������(�r[�[:;x���VN�Y�+;!�Z�����K7),,L����]�a��u'5-��ifY&\��Dk��$��8]mnP��s�\�t�g�"�9����E;A��$���R�fR��R��Nv��	�gH�;��c���-;�^�25�}�m*N�[�d��@�!PjY���w��n�o�����iN���S�>���u�����6��tb�.5?�EMv-S'�/��p��2'
�<���o����m*�9�	�#�3�\*%����)-�'�/���s�N��9g;�����{��R�@���/�4���')��V��F�r���������O�Rv���Y:�	��fk�OZ����Zm:9�~�z�}'o��9����T)���k-��4��iR����
R����	�Rc��������>���F�+��y�T�Q�.
�J��0u��$�U�9�Ki;��
P�[�R5��s��~��w=�y�x��zo�����w����7U��>E7���u?<��'������{
��/�������Au'���V�b��	�clS��
�\3C������W��|v	r�M��N9���\��S��Tv��<��,�v���B����]XD%'�����\��Qq��;"PJ<��dw���]QS�kro'_�@�<��g'�����c�uZ������N�_�f�Y#�n�.��y����M�#�1f�����Q��W3�\��^.p���~@����j��}�gf��g���}�NQ��\�7'�o*��$u�,U���6��<'��%m;K�	�=�w�}�V��R�8o�_����u%�u��������7���h`����iz��N@����o�R69�|��Lo_��g�W6���E�Z�M����q�Y�9wC�\;��%��}�������TI�����l��s�zN ^����Y��VJ�w�yR��NP^V����{sYG�/�.�>5�l}o��b$Af��)z���b�
U�\Y������^w����u�wh��e�Y����>}��6�$��`����@{u�g��������vK���������gsOn���?'A(w��7�����.U����+��0$Ad�������F�����^���W��]��cG�r�)�����~��i��i������o���h[��-�����0E��,�\�sP������$%}����L;��,�����oSk��;��	@9t��^~�e7�7���S�V�������+::Z��{��K���Sc��u��m��!�s�@!�t��J�w0�<���*(E�"F���_ma���6�����A��5�}���\�(f$A�V�Z����3����w����:K&Lp��������Z�j��%��?-������;<s���U�!����X���y[�m0.�p ����e���z���u�u�i���n?����:���k��5nA��������Z�5I�]�Yn��z-�G)�	�����[���������l���_� ��8��5K��s��d�"`������4}�t%%%��SO-���^����YY��s@���nf����g��
�i��W�������a�W�\<�`��$A������Z|���<�����[���a���x��[�`�;@�����$��g��?�h����u���Z�/=X�	���>h��+�VGJ� b���^{�F����n��)22R/������'k��I1b�bcc�
(I.|�c-^�#��Z���/=RY���1'KMlV��b�#%	@���o�v�Zw�������k��q�T��
r�1c�(..�������`�3�����r5��\��JR]{��R�����rEDz���v���M:c��B����V���3��<	Z���tK������=��#���<"�~�k��~����������"n��N,+U�-�>+Z��;Ai���:��2��������E����>��gD���@������o���������*�('U8�	��SL��d#T&�_����o���_������?R���T��b�%	�������1���r%�'�_����K'����T��b�!%	�����u��_2��������N����7RT�b�!�	�be3��q��s���u1J���c���������K���BB���;pH���?�'e����9q����4)��R�k����>$�����������e��/����N�b�&����Qo|�~��p�����W�����I���"c��r�r$fwB��~�����������J�y?�� ��P�H�Go~��>]�$���H���V_���c�&_K��c

$�T����}��������+�{�j}x�dU=�~���R�: �/.���|@=����)�We�Mo��Q����M��?F�py<����z`�*MVC�M-#wi��K���;��WSm$
�������9.^���tV�����4���9��!'���\���8N)�Z��������z������V�<�B����>��M�F�������E���n������s_��c����($���.M�I�_{Z�o9U���]jz���mQ�������B1T�����[.}��~�}��9p�~
�$���_��{�|La�Q� �+���(}��v����^SWc�J�zIaG�zR�!�=H���+�`���$�~� ���>��/�N��9��c����+��Xa�l��uA���h�������z�	�w����G�d=~�5:�e�W �/M��	��S��Y�Z�����]���R�=|u���Su��l]@�"B�������oGi����]XS�7��fE���!�7
��[.V��N���	PZ���KfI���fNP����,��>q^?F\����o��[���*�TY($@i��4�G����;�:PQ��7�p��������CW��.��V
��� �H��,=MZ:[�}�4�K�_�M�;�F��3Q9��kU)���B��kDB	P�l]/M�^�����/���5>����,M�h����+F���O���W�fuTaLH�`�i�4g�4q������;*iIX
}�RSu���so�oU?F�wm�k�j��Mk�� h������-��N�>.%�����KXC��A�tf���g(��.h���[����\_s�Cd��}���������Jw�s�������n���M���kZ�L�.h_]=�����(�v���@h#%-UZ8MZ9E����x����75�oa
�s���K5�~���-+����������*�� �H���i��+M�RZ�J)���[j]�R}�Ts�Y��yk�?�}
EGU�?zw��'6P�1����(;VHs?����,P��D�M��Ea5�CX�SkM
����v��-����[������U��D��@�c ���&m�,-�EZ�@�c��47�����l��a�i�*H�>�u��[5Z'6����>�����K(2$@�<�>'�_5EZ9MZ�T��������P
�W-�f��k�*�)������IMj����T�|���&��zi��/�*��+m�����h��Z��'��PS�j�*��}e�������>N'7��S��Q��
T�z����c @�H�!�_,-��y��V.��5�Z��������>?K��t���6p�X.9��������n�H�����	J����	�WH�'Kk~�����xQ���k����%,N[��j�NV����E�T.�
1�t^��j���NoYOm�j���$(���HJ��Jkg(e�<-_�,nH���-�WUs��h�N���;.�~\�
:�e��4����*�W���_����-e���HZ����������ai���W��=1�9Y���C����#q�8�N�sHl�����Z9���=t�etB��Es=�����ri�,�q���Vi��D-�QQ��*iyJUm
+�-������0O�p)�j�:���r�j���z��L
jUR8sn�Bf���;��C��-S��55t�P���'�9�]��&k���Z�d�f����me�:��6���RUW��:;��\>���Qa��m���U�Fu��Q
�����V�.�K(�HJ���d���K�V�~�4m�4u��]���W��-�
�n���S�|�"�X�V�6'iqb-8XM[=1���Js^�	�>u�<�����Za�W��Z�j���U��m}7��V�\�_@)CP�M�2E�������[����z����c�j��!�>_���Z�x��������i����O�h���������e��8-
\��uv�V��:�M5h�Xm��U�Z���'n�	@)�d��j��o���/\�0O�:�G����{�7mO���{����S��ub�>�.����R��'�N��:�M�����>@E� &L(�*7������w-���Z�fM��o��Zyh�����(s@m�U�\��V�����:�	�kU���Q��+e�V��W��@��=zw����7o������+)))��W��i��)���Ud��FR�jR��T�fS���'uT���Y&�2KlJ��E����{W2q�J.�]�
�$�\���5l�0y<���r�����c������T&����?]��Zkp��CP�u��M���z��5p�@M�<Y�&M��#�|��n���*��"(�,�7n�;��A��1c�(..����b@��k�3fw5HrP�J���
�{Wrq�J.�]��}+��w%W0�;�t���������\����{W2q�J.�]���!$@!B	BH�R	��?��GyDk��Q�j����j������g��;���e�T�fM
:T}��9�m8~S�Lq���+T�re������{�m������_�����[����~�^~�e�]z��7��h��a�S�3�~��;���/��������)I���I6m�����J�|��.��b��5K��uS��u��'�W�^<x����i���{��j���7n\�m-[�,�K.�v���=zh�������5�|w�,�g��r
�,�]q�:x�`f���C����{����wA����g��i��o=�������;�r�n�n���]������]�$�G������SO=U-Z����K������h�������K���Sc��u��m��!�s���}q,c�/��'�j��������=+A�}�Y�_��;w�O?���2}��u��u������;�r�wG�;\r�n5k���]������U0�B&���u0)))���o�y�f�s�9��������eU.T��U�
��V�Z����3����w�`g�u�&L��=+!6n����r��Eny��%�� cdv����]V|�On�-�w�-���U��B&���G��np�����nb��Qg��/�6.�"]z��n��&M�p�J���{�m��~��{W2�,���#�[�m�����]��Jm0z�h=�����g��q����68������9s���B�
JLL�;��xU�X���Pxl��5�\�����us�J���$�5J��W�:�]��w���;���n���{<r�w%�{Wj�-+3e�������|�r]~��*S��N;�4w�u���6{��s�Y�`A�`��lC��>}���}��:��33���p����V�F��ZF�w%�����`������6�]p�������+�	��<b��?���|��������xw��q���"##���/�S�M�<Y�&M��#�.B���g���^���s��{V2����Yq�J��%�����w��]p���d(	��R��F`���+�J6��R�JnB`������.B�m��A��=f������t����~��k������i���=+�d�:u������w�#--M���w�m��_��m�����^��;|h�����f�����c�&�$|�B&0��r����M�4c��B�����wo�'��p��������z�]��������H|������;����7�w��X�����]H%@�#B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!��7o�z����+W�=���N}��z�����W/w�I'��G}T}��q�����Z���z���uT�*U�`�5o��]��K%%%���!55U���Z�~������
�w�yG��rK���^f��������S��(�H�����)"""���;V����N<���u^x�~����`��i�P��&M���n����S�Nf��h�"U�XQ�j�������;�U���Gy�X����o��'(.$J�������oV�=��o�i���z���u�E����K�.u��Z�8���/��)S�����_]����+t��g��W_u��?��}�������o��s�9����7��_���:�&M�h����[�������}��m����q��ge�z��'����������-[��+�����u��W����{]��.��2����_����]����:�/���[r����������Kb�8��5J��������a���k��������g�yF������s:n���+���m���|��qz�����{V��:��
6�����
h����������Z������+ns�����g� ���?����YT�V���PH�J���p�B7�|��������'�t��g��_�_���3g�-��>��c���ZP������w�U����|�rm��E7�p��fl������_�]l,�����k��{�u������W�f�����������/�t��L�8��{N�n��i�&�.qqqn�o��g�}�&���n�q�����[�k�<x���)���[�!K`�����v����>����nb��i��{��%z����_,�����??��%\9�?�����o����s����]�-��k�.�^Z]- �eKF����]��#��%v�����-'���b�������]��n���<�#�
	�R�|���������ukm��!O�Yz�#o,����Z����[�f�����k�E��[o��;���?�S�Nn@g,��@��o;.===����~��
D-�7vn;611Q�+W����r��kW��u�]���q��'����r��e����s������cGU�T�-[���	t�UW�����,(���o���/�Z�s�nk��:������D*����a���&/�;��ge	��b`��2>���~;s�}v��\z���gl��X����?+������e���:�0�(�28�(���U7&&��l�������g���d��q��b����Z�|���}�vw��]����8;g������c���7j�H��~�['Z-�Y��\f����X�k����V��$�pX�����t�5�����{���n`����d����X�k^�wNu��~�����,�|��7��3��A�}?����vM�K����!����@�����d��wo��c}��<��o��8���b{�������:�k
���c]],�`��������'n�d�7������h���N�<�
,��-�����Y��5�<���^vN��d�cr�nK`���6�P~�s=��?����m��u��{a��1c��)�}~���O�=�����\�����~X3l��5���!r�_������<�^���sEdj�w�����Z�m�
`��w�/�
"}���r=��W�A��?��yz��g��lg���X�iA�M�i,��pk����u}���=�����X+���m�iF�����!C���������v���v�;��~	�n26�6�
����Kt,��4��geu�y�g�Z�-��{��>kdmI���6m�����p��-��d]�
�h��0k��@��W�������X�g}��M~Xp��U+�����^k����X����n5Zo3���_~���/
��$/A����g���6�d�s���fp�����w�Q�[?u�� #���Xa��f�A�6�����66���;�p���t����i���uYv���\-�����H�8�����ge�
X��-�{b���X0nc7l`�M�j������-����];w0t��k�z��Bp�H�J��=� �#��<��{��'s?�Q'�M�U����f�16s��,�����u���)����H��$K&����4���[{��]�rR�F
�?>�m9]��m2>s�����(?�y^?�#�6�6�%�\���y���#?+k�76��}Fv����9m��,c��SO=��a^�}f6���2d��@~����1��}�u������"�$)/	7@��V���Y�)��b3���P,{�dY���g ��!�!$@!B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH�B� ��!�!$@!B	BH�B����+%%Ee���~����
������C����j������;V\p��U���~[�lQ��u�w�^U�P�0�["�g<e�=���Z�b�*W�����_��{�q]����
	
4����	��������0����>�H����n���c�����)��r��4��g�{�n���C#G����_�����k�������t�R��(�/!��{��w�V�={�������/����}���z���������
�����P��w����������/���:w���o��in���]��GyD�]w��{�_�>�KZ�h��_]c����M���C����n�����U�V�M�6z������_��|���k��Z�n���m�'�|�my^�`���dw�W\q��B���K�-����?���{k���������3���k��o����;��ccc� ���N����_���~����gO�����=.	J�g|��!�;d��i���Z�j�e���!�@�-^��
4�m���O>Y��uS�^��`��?���^�vO8����V��s�����s�������~�O>��M�%����7u��Wj���n��I'�����g�����:x���n���'�]$�6m�I�&��@N��`����*99YIII���_�o�>�x��:��3����u;w�t������-[����{/r��?��c�������Y3w_���~��{/�{�=]~��np�a��yZ�p�{/Z�n����$(����7����%'��<������`$@������9���m�����w�}���[�N7��<y����7h�T����%��Y�F���s�k���K.��}�����
a��i�����D��rK�������+Vt���k�*!!!���)�������~D������~���.%��Z�����6c�����gx�5���uM����Z�je��~_|�8g��g���[t��5W�m����S)�T�U%�3��q��=�_*���~$����Wwo��[o�A�u�1�M���+����v�y�����;/����8�w���^���e��Z�����d$������:_0��7�u������&?���}���}�5k����j��~���4%�3�5k��|2���	���.���w���s�54~�x�r���n_}���\b}������\��Q��`�4o���c}$�(�r;_Id��4l��]���>��R�s���k�&���k��(���q���p�	��`UR?�����cp>������A������}�Q�F���w�%����+�h���nki�F����8Z�d}�mk���lp��c�@��{������{���~w@��w�����W����+�����=��}v6�>kGa����s���kI���u�2�-����q��~��}�����.w0�qk����QHJ�gl�l�����	��D `���
��N,iA�u?�����
���[�����}��[�_|�\����X`o���u��0a�;c�=���&��i&�E~�v���>k������Y[�uV�������%}��q�y��'�������<��o����
x-mS��������dc���>��?A�t�[@P��v�/��`%+���O?��������>��)��S�N�����Z��_
�.��������1�WYK�w���.������.]��}mZ���9�w�#��$%�3>r��A�A^��3P��!�J��]�P4���T$@!B	B�[�T���IEND�B`�

sata-prefetch.pngimage/pngDownload

�PNG


IHDR�Z
%QZIDATx���	|T����g!$�aIXED�(*��Ek�E-(RP����"���U+�V*��jU,(�J�E�DqaK�I$@������L6�0s�d���z���gf����{�Y�����"r�{����?��G�+ ���]�!B�8p� �A��!B�8p� �A��!B�8p�������C=$�|��deeI�
�}��2|�p��� QQQ�^���O��q��g���f���1c����?�����~��0`�t��Uf��]��Jvv�\q���g���~�+Y�hQP���[7�������A=� �hC�o����W/�1c�t��A���d��%r����W_}Ue#�����=z��_~i~��^�w����9��X��~�/��.5�gM��^���������+�Hzz����Kr�������z�-�>}�|��A��5�����U��?��O������_6l����_?i���,X�@v��)-[���}��������| 7�p��r����S~�g��2o����\.<xp��8�{���?4��E���Waaa�w?��{�n�����6u9Qz��J�<����BV���M#��F���G�S�������������3g��?,111~��d������.�����k�N����u�����O��]�����������9;v4�����c�vgR���O�?� qqq&�h�)}?�������<O_��s7�x�������?6���_��y���k�~v�N�n�*���2v�X��������W������r�}���-[L�����_����*����w�z������zu�����M=z��s�=�N5�7�uae�����3��F��i�L�Q�F�>???�4:�q�
=m������_�4��q���w/n�����!G�f\s�5�v�Z)**�[o�U���J�X����__�/_n��=O=�T�Z��2u�T���+����r�M7��_l�Hi�&}�6z�N@�]���n������wX��G���]������#�<b��v��8q���|�A�g���}:�"::�4�����}��Lu�;^=��?h� ��I�������~j����q"��7������j����\�����_�������#*����W����r�����ze}����=�
'���}������5v/R,�!�W����z���M(�~�8t�^I��,��U������2l�0�)�s��I�o����Pv��y���z�~�z���:tH���������3�<�2��y���1�]���L�b��|��7r�%�T��KW���H=���1����o�=G��w�{oa���������X�����G�Bk���s=]�t����k�-������Cm�����:����vG9����{���:���W			��������nY�_����R�A}�9��F�~W��O�u:#������������S��M��}��U���r"�������1��a�y����o8@X���4�G�����'��y�L�\����b�@����{��^'��:������)�Z�M�:�P��Ft���\�[��r�->e�c��*�����{/���T|N��Wqz���H=����>���B�#��&�����^+��_L� gaE��k��b����xy�����_4}�5������*���A�/���i���9�d�S���X�B���B���wo�}z�����%PZ��.��Rylll@_�W�U��B�v����v1��uj������n����u@��~�:s���+�����/J��7]���]����	&��B:T�����}O���yu���� �}����l���,|�=eiE����9��ys��H�����������������W���D(m�/[���?Y�w�zj���+�a��Y3?~��s�j��j��@����:������ht���+�:'���>*��r���Q�j��E�*:XX��`?�p�����;�<�U�����~F���aG�`��?�F���5�*�N��3��4�z����#f��N'��bv�����i�#{���-Z����+��(�+�`h
b:H���o?��Y���w�z�c��t[e�]������q5�7�����`@�]q�!��8�@Y����o4Ssjm��v�if�iu�!�SX�c��p<'��:��*��0���!C��+�:���^{���
K:O�~G�`��g�m ���_��i�{]3@�b�@_
d:-�~�:�A?�r�i:�Z����:���Ss���tv ���u�;7:ulM�
�� �� _��I�BW5�F��T��j�O �S��U���S�*zuW�322�}����U������u:�O��=u��?��O��:U�U���NU�;�zjh���:� �AP���e��!B�8p� �A��!B�8p� �A��	����.w�y����O���(3g��K/����Y�F&L� 6l0e��O��#G�T��6���K���d��y2l�0y���d����q�Fi���)�:u��7NV�Z%��=zH��*�����8��
+W��6m�����+���s�������S'���������>}���!Cd��f���i����A�Z��v�����g_�F��������B����O�^����������� l@��}%33S^z�%1b�,[�L>��INN���<s%�[LL���oP�m��o�����0�
d~��ll�����<?''G����.�h���RPP������&��6��Y{��y��fL�v��1c��&�r�L���k�W�^��kW��*8p`�}�-����`|T�q�V�B�������j�s��1W�/��2�W��DEE��Y�d��If������iB���?e@]�@�����6s�ggg����/��
�����<�L�b���������*B]�5j�(�S��������Z���|!B�8p� �A��!B�8p� �A��!B��������"�q"M��ow���a��"�J��OBC���m���  4x�@�`��<�O��+�h���:�,��| w�u�<xP"""d���2a�S�f���a�ILL�������#O�~����{/�<?C$1I��a"WL��na&l��d����h�"����l��U�8��������S���d���2n�8Y�j�<Xz��!:t���s��vd���d��w����Dvg��t�k}��P��M�6IAA�i���m��i��&��@tt��?�����G�",0���M�6��
|�4�lfe��,W+Ir�I�����x�P�~���o�������2|�p������~�~������/]�t�y�^��p����WNB5��T(wD�1���p��a�+�m�_����7�t�4i����O|�A�|�Ms%�[LL������pJ��u�����yc������?���K�,1]u233e������,�������������������K��.H���j�R�hV��5+W���������������
��
��/7�v������@��G��3f�����e���]+�z���]��UV��"�6555X����)��t�m������m��m���-�<���Jj�n6U0��
7a�<�LY�~��[��4����G}$�_�%�f�2�����M`�9s�	
���$T5
h��������Y\�����s�1
����J>l�]q�r�-�Hdd�,\��L�9e����?�������[?U����X��������!l�;v������z�����OU�pd�N�v�N�q��E|L/Dm�u@Q���[�zv%�/�z�V�*,`?�i@#J����}�gWRC��5
[���;Y�s�����ne}���+����}�,��ohC�����k~��=����93
�+9X�}�@������M��Q�"�^%�"���m���iD�������Mj�����%�U\y
)*���=�[�mmq������S�
p���X��V����S#�^��g*��CvK���I�b+�
~"�^>wJ���eI���H�,���4[�o���@V�N��V
J*�'�{�T���}�]I�"��QX#�^>w�_;���c�7n`C����b��}yRZ5���E����p���"����&V�(�`�*�d���m����5
k���;YQ�]��Y8����,@q"E���(Z�- I����W�"�^�w"�;{��ty��"��:�1��= "Vvo�&�R:���Hb���b��{�v�����a���6T(�`������X��}��aRC���	s��]P�c�����s<��EU�2�����oG��_��W�40���4��Z���xw��,]�w��|)�����S����[o��W_�����H&N�(�g��5k���	d��
���(��O��#G���[�Z��P����W�
pRRs�k��6:T>�y\PP ={��k��F�9"iii2u�T7n��Z�J,=z��:�U��sg?-@U�]�o~m:�������
a*��_�*}�����{��e�$::Z��o����#C���m��M�f�g��*�(*�M��<�R��jC���#@ff�������������.]��<G��gddHBB�_e�C�wH��?�,yR:�OCW��j�����/G��zH��k�����<s%�[LL���oYEK�.5��*Z�h��~��qZ�rJ��_���.Y���=e�"�i;A����|y�����o���������\�����H\\��e
8��>�855�d>@x��\d_��)���������q���j;Yv�>|�����}{����g_��]e���v���r�}k���^�z�]?{n"��~�<LI`
�`����������_�~%�f��I�&Izz�,_�\f��)���~��>���e�n],�<LI���'/��n����>m�/\��L�9e����?�����r�PK��S�y��&���8@��_|�����w���W���}�^�l�wy�tlcC��_���������8���C�Sl�P�#�&n�;;��C���i#)����vU,�`��/)����M�o���;dO��{T���m��)��6T���Oh"�2w{�����B�@�=|@�l�u��Fi�?%1��:9���@�xk
�(�.�����y�kt��@�B�=|�5��E����O��
r��v���!w����@���v�*�`�1�7�O��.2O$�fj����^w6o)�l��.����Q\>x��#R�4miS���{x��U$eM��Mcl��3`�1[��<��Z5��2�A�=����l�o����8�+�q���'�����o����J9����s��Tv70�.qKRG�&�W��������N�|��H�`"�z%=�Y;����8|��eG���y�����������g#��������=�I1.;j�(X��<l�n��Nn���g#�������>�mX8��^��f�����"���8KX���\�0a�,Y�D���/'N�{�����Y���m��Ae���2r���*�	��p�k@�D;j�(a��/111�}�v��s�\q�2t�P����������Se��q�j�*<x����C:t��W��'N�� ��1m[�QG	��g�Y�p�l��M����}�����Z�l���A���G�",0���M�6��
P�}��vG��R$M����V���������$y��'��^���H������o����K�.]|��W�322$!!��2�BI��]�=�Z��DZ�U#�����-[�l�f�����������S�N���g��{��B���2���A���yv��8$���W @�64m��4�u���������{����:@�[NN����Ill�_e-]�T


*�_�h��~4�:���%�cw�����i+Y l�^�?t���:��Qi�t��e���3�c��t���k�J�^��.�h��������������u�w)��]�c#�V�*��mh���0@���ny��G�����^{������	�f��I�&Izz�,_�\f��)���~��������P������
�����'7�p�4o����2e�4���A:���������%%%���p��G@���y��gw�&�����M�6f��t��]V�^�2��5��y�h���g#��: y�|��<���8�*.�
b<�������`�c��G��b�i@�Z�U#G!�Z��������'��%�9�[�k�3�rI�����.@Y�����C"M��U#G	jx��j��{��'H5@�(.
�����O�_$����F���+��3�����s������;
]v��q��u�&?��	=��/�fU*������
p=���o:==��>t��|���r���K~~���3G���e�����\��c�{p�������$j�p�
���d��I�d�����U+;v����kVT��X�����]-�6��6�cIX�r������N��o�)�6m�&M�H���8<BI� ���w�['�U��$DFF����e��I��=%>>^�n���))]`{A#����X�*����{��A���o���{���6m�t����� ��}�
��;�<��"�o���J9�%`��y��;�H��'V��������>k��J������~�DWnac����P�~}6l�����'[qh��H�A��m����>p4��*A
�:u�������C������|����"M��U#�	jx��7������
�7���4m�Tv��)�<�����/��@�96��]�3�mP(�b%`�5�����3f�|��Wf&�2�6+�v�m��B��5�z�h�8���8�%c���k����_ev��-��������l�)���
��,	'N�.]��.?			���#+V��[n���� Tg�_�x�h����8�%������C���,���3!����c�)9h~m�����V�$(��?p���X^Ii�],�m����dIx��������m��Iqq�O���C����F��m��H��,	����a�!��res�W��~��d+�<����[��#�8��F�U����7�����*k���	&��	e���2r�H��p��d��8���r44kU�h���}���%K���V),,���|3�h��K�9"iii2u�T7n��Z���M�-����_e�;w����Y���yGy��
K��HY�J��E���3����8��`���FU|||�2��4::Z��o���G�",0���M�6-(� ����=�����-���X��?���i��<����^�zr����'�|"M�41�h�����W��H��gdd�)J�)�	(�/��;z�o[�������sss�
�-[��t' �4h �F�2���_��?��t�IJJ���<s%�[LL���oN@q�l�+o��Ojnce����n�:3f�i�k�Y�����_�{�9III	�1��o/�>�������+�G����>�F�i�!
%���~�U�t�R)((��_�C8�KJd@�A�q�t@��e^�#�Y���
7���{�`���r��7��r0��__p�Ygy�i��~����kW3&��v���2ek���^�z�]V���+����SSS��qB_q�lXQ~�?�}Pz\4Pz��}��
B��]�v��~�p����O<�c���O�����`]����W^�7�xC���'QQQ2k�,�4i��������e�������W��(G6�+�9�N�m�V�$h���~�N�:y�m���\����{��������7wt-m�_r�%�\��T�S�L1
����{�#�[���
�wR$�E�l`I�)2������/��53�W�\��G?t���TE&�^�:�e�AQ�l������H��6V��,	��7�p�-[f�i�W�����y��"`q"��U�:�]yW���������[�n�_|��`��������q.K���c���2y�d�X��\�f��H��h�����0�U�}uq0K����e��j��������m[+�PP�_��wj�L�;Xt����Li���g������:��:(%Rz��������
�%`���f���;O����,@�~��<���V!`���G�[R��@{+�P��+�����_6P�-L�?))��� l�]��n�����-,	J��F�2W�[�je�a"��q{��*�-jK@VV�L�4I����������@W]u������*�f�v�x���/�a<�,���K.�^xA�u�f��=Z��/_~��U����)oz�mR���8�%`��M�����g�
����@��S��W��kx&���P�~}����\�W6l���+���my���i�`ce�������\`���g�����i@�}�Y+��������n?�R"��2)�],	����~�zY�t�0@Z�l)O=���
���m�.�v+���f��Xg�$:tH~��G���k���|�3g�<q�D3+���]{<�m�Dm���Yn�������o���}�Y@gz�����l�mw�g��
�Z�Xg�$�\����;o�����I�&��cG+�e�=��6 �;v�$DFF����e��I��=%>>^�n�[qx�,kO�g;Y@�6���,	�{��A����~+�=���7m�4��������v����l'58"R?��
9XP�W_}%���/e��y��;�H����8��ys�i@�����������vRl���q�����G��t!�a����M�<��� �d�����H�*������^RRRN��YYY��
lR�v����fgrs���)�`��u�|{���������c)�F���\#gj���S0�u��=�=�If0f��%�����;��N���M;�XTY{����mm�
rss�[�nr�e���9s��5k���	d��
���(��O��#G�T*�I��N����_���~��m[��u_��v�m��A��#G�HZZ�L�:U��'�V����K�=�C�~�u��9���.����g����)�U���_@�.]$""��Cz,^��\����k=S��X�B���e����q�>}d��!�`���O��l������l'� ���6��]x��V�Gvv�	�������������7a��^�����*@��v��l'������ _���}���%KLw+�r�-r��7�i����?//�\��c��[V���K������E���q����P�:-�=d3K���3fH�F�$..��,Xc�~�m��u������bcc��`o999�n��U4p��J��{HMM����IE��������M����P��
F����^"##�8�����*7n��;���p/**2����nH�n��\�it�����W/����_e��{��'���kx6�`I����������wK��-��jH�=���n�NZXX(QQQ2k�,�4i��������e�������W*��=�;O$>�m@�%`��u2f�����M����|����<��s���bE|h#~���f*�)S�������=u������)@���X(K�N�y�
7x���0w�\����`Y+�s�=>��������|��e����|
�$]�iSkeI��k�i������z�<��V6��
m4����6P��r��?H�N�<�t0n����8<l�������v�6�������L9���4k��^�r�<���V6���k��``�Y�.��{�,[��t������h�����l'�;bm�
�%@�k�N~���Zu8����y����G�#�X�����j�*9����}��
@���;"����R,->��nA
�W||�<��C��m�J�L�	�2����Z���m��.�����6��N�*>e�K/�Tn���`V6�^,�}4�EB@P���{�]w��������B0`@0m�����,�@�nA
c������N���
3��7]�I���+�;f������5�
�,@.�K�z��*��L�"��Ov`�L�U��P|���AK���u�i����ORRRb�����7���]�]��"`�l�
�XF�%�;w������
������g�yF.\h��`�,�;m�@�6�e,������Y�f�@0x�`1b�|���VT��3�����~����U��z����M�$%%E�n����_�7o.��m����X��B�{����'%�����!����}�t�b������o����}{i���*���/�aI��@�"qqq��?�Y�u�&{��5]�~����l����	�S�������b��Wzz�y!W_}u�y�hg�%������Hi���,X�@�n��O����.V�t���0���4�			fq�2?[v{��="�",	:�TT�O��k�N�1��B�%�g��V!b���c"�(DXt�����7ZQX�����7�i��;!����s��<>p�����r���[qxXh���RXTb����q|QiAd�}���%��������K.�A�������,�������o!��P��$T%;;[6m��c,]�T���.�������w�y�L�8���Y�F&L� 6l���D�>}��9���P1��.��!��1n�[233MC:X�����+���^{M.��R���/�_�~��W/9��3$--M�N�*����U�V�����G���C�����4�k�X��e�.���,IIIA;��=�6���g�-��z�|��w��Ctt��?�����G�b+�m��M���P�l��u@�H`
�Pb��`���wTaa�,^�X�o�.]t���_��.]��<_����e�P�?e(�y��@�"`�&��S�N5�}�+��"�F�2c���k�A^^����-&&�����"PPPPi��E���B��?n�l�-��#O���
�t�O�W�u��o~����s�YX����W_-���/����V�������������������V���������H!����x���5���6�K�y��jb�������#_}���\.O�������v�mA9��u�������/�\���'��s�i|k�Y{f��a�	��i���f�p��]�*���������^�R$-��E6�P�d���{�U�V�Zy����[���_��N�^���?K�,��}���F��S������(�5k�L�4I���e���2s�L�E��2T�h�@�2,	:��������n3+V��[n�%h�����<���f�N���qct���Y�p�)�2e�i���?_RRR�k�-p:�`��$�PaI����e��������}�L�����{��A=�v=�����W�^�2����]�@F�=B%���w*N�	������p���D_jc�������y��"`ec���<k"�A@���U�#E��F��������@Y�h���f��<�v��"�mq&�X#TD@P�o!�:�mwU����(*�0�g����+����ZD��B�)������2�%��w\
�>r�]x! �b�H���hhwUP�A��PC@@����3�������8RXRygD�+�p
]�fb@�! �����t
9FIQ�}9D����wFD[_����(*�b��1������868�1������f@�! 0�
*�s1 ����*�dp�! �\�cq���p�]GJ7X	8�<���`��r��w����I�&r�������j���Y#&L�
6Hbb�L�>]F�yReNV|�P����� ''GRSSe��9r�5���_-}���^�z�Yg�%iii2u�T7n��Z�J,=z��:�U��sg�?2���b��5a


d�������w�.]�t1W����$::Z��o����#C���m��M�f�A�"���9aZ�h!�]w�����[M�����h�"�����IHH������mE�ZT��[ff�\v�e��~��=w���������
�Lj�>|��r�UW�.:ewbcc%77��y:f ..�����.]j�!U�w�������#|�-Y�T��t
%a>��3>|����Kr�x�w��Uf��!n�[\�V�X�v� �oYE��O�:0 ��: ��&�}�/v���*�_V](����/#F��y���4�U�~�$**Jf��%�&M���tY�|���9S����*@��z���h����
����l������m�����������T�S�L1
����KJJ�y��e(��U��!*l��a�LW������W�h�����>vq�?�m�������`�PD@@�x�����" �� `�I\=�m��D@`�>f��D@@��/���,@!������=��,@!���ku���]
T����d@("  �+�`��D@@,(���,@!�����qE�]
T���3+G08$�vWU   
����^��������*�M�P�Y@�5�>�"�]
T����vs DqV�
�]�ED�]	�k�+����zD��U�@7#�m��G@�5jP��E@���(a�E@`���>�@H"  T����PD@@�\������O=���q�2c��4i�g��5kd��	�a�ILL�������#O��7,d��`�����~9���}�9rD���d���2n�8Y�j�<Xz��!:t���s��6}J����U�\�����`�����W/����}��X�B���e����q�>}d��!�`���O��i���p!���� ��T5
������_�^�t���O��gddHBB�_eNw������������3W���������U�t�R)((����E�������u���x���Aubcc%77�g_NN�����]V���+���jjj>@���3+=��
\��vO�Xu����k��fV ��-.���[�v��2�o��pE�W
����_�~%�f�2S��������e�������W��{�����
�����Q#�]XX(���y��f����g������)SL�~������b��o����x]�wE�W�(l@dd�>|����������Z�R��m3���� @��5��@�" ����8T�*]�a+���=�i@C����U@5��H����3�a!�����8PU�������FD�x�U�a+�q��vW�  �\.����3��p{����*b�x���,@!�����ndwP
.�I��U@5�H�+�jp�
�J|wD�(Tp�����p�X�"���x�Z_�Ebl�
jB�I�9x����}D$2����&��M;r<�)�\��0N�O�wx�;����8k��p�~�����ArD\L�8i?����`@�"��m����Ni�ocMp<�ZZ�f�L�0A6l� ���2}�t9r�����O�t�h�����4�j���#���&S�N�q����U�d�����G������ x��'��T�Cyv�$�dJ�\1T>s%�]�P������0����&�ZX�b�DGG������>}���!Cd��2m�4�k��D��)"���������1@
S����u�8J�%���������_/]�t���W�322j�>���\i��'������8 h��{��g�]
��Pyyy���������K�JAAA���-
Z��=�^�Bn8��j��!n�O�[W)��������8�Z������\�}999Wy���V��RSSE�"huBM�������8��nG�9m�HTk�kQ�y�^D������j,�f������7�Y�bUp"�B��]e���v�=	m�����W�Z�O��{�P;q��-9
���#�B�~�$**Jf��%�&M���tY�|���9���'�P��_�p��t��)���,���������@-u��]V�^mw5�,owPK�����U�p���W��9�[�:_�����*��8gu������-����sV�Xu���!B�8�k���	&��
$11Q�O�.#G���Z���SO�w�!3f�0+;���|q.��b�
���;e�����I������[o5e����t�R���������M��s7q�DS��
m�����[7����d��9f�,�4j�H���}�}��7r���r�B�����~��%R�~}�w����k��>g� ;r��������S�
��V����K�=�s��vW����������O��_�������&999���j#�\s�|���f��^�z�Yg��91YYYr��W�k��&�^z�|����_?s��8��W������A�����z
%??_v��)-Z��)�|�����KLL�l�����+��B�j�{�� ���ett�9��O�>2d�Y�`�L�6���9���cMc���/��_���m��=


d������tE�.]���!yyy���v�����3�u��g����}��dggs�B�������k���9����G/�����|��={����e��m��o��\�W��-����l������M[FF�M5���Uj:_			�K��������x�������Y�h�,�$''�;J�Tj�R�x]t�E������4�M�<Y�}�]y���=��{1����W��	j�|���y�-��n"����w������'��^xA"##��o4�Y�8g� +�:�Mo��~�����24dff�����c������+���Q����s��`��
]���������N���9=�EK�lix������?7�B����
M��l�"��53�����J�N�l9g� ���5�>������8�j���t�8�����W]u���YvG�s����j>|�|������|����~��Y�+�q�B�vy��g=��=�\=z��b�}�9_�G/�hC�l2��]����������o�9#��`�eF���\.�o����vA��j:_�K{}��g���K/�\���9=�������^.��r�M��s�1���+�����
=�����aK��)mp�}w�}7�,��R�K�� ���R:�'�&����!s�^gpRz���l9g� ��/����5�L5���.��/��3g�]5T�����8����.F�!����i�+�Y����vO���t��M�6�m�
���4x{��LSg��q�����O?�������7�Buf�r��o�g,D�k�N`�#�<b�H��������dz����N�4e�s"���/)))vW��t�������}����a��Hg���|q.���7o�lf>���R}�Q�Y�����<����{����76�@g�����|�1��w�sf=�3��^]���n@���M��K.1������n��i������`��A���sF��NY�z�j���ct������-��|q.�1l�0s��:���3f��S�W����{|s�B���6m����U���!B�8p� �A��!B�8p� �A��!@�����'��
�~���c�t�M��K/�/� iii�7��{�93f���K���y��_�`�|��'r�g�V��v�����"pX��������NRSS�?���l��]f��-�
�U�V��w�}g�����u�^z��X�B�����'��G}T6n�(^x�<���u.�K~�������.
4���zJ.��"S�x�b����$��cG�7o��n�Z���K;v�t��M233e���>����������PZ�j%���?�s��2t�P���������?l>�����o++W�4��������~Z6lXm��3�����h�Bn��Fy��e��M�����M��W\!�v�2����r���VYw��^Y���0
�x@^}�U����M#�D^������/r�5���I����?7W���|�=���j�Y��<��<���2~�x����e��2j�(��-+������7����(�mH�#F��9��-[Lc\���S'(�!��g��[o�%�����H�6m<�y��w���=�~��L���/~Qm��_/=��|�����iS.���W�����������;w����{���7���� \�B�5������]���m�N�u			��O��
imxGGG��zU^������^{��?r�H����e��=�t�R9���L�Xi0�n/z�[_WRRb��W�l�2������������J�&M����G�����0��P:N��:�����������G{B��~��k�V��'��WWw'�P\\�g;""��8=m�����111>��}��y�f�<}�5h��2�]\��xJJ��u����{�n��
l�OE�:}�2���c���}����3g���o~����������w�^s��L�v�j����O����������!@�������z�����W����C�L^�l_|�����oWz��!��WE��h��L~~��i��e������G�������Sc4L�]�2[�n=���x��DUU�;������PE����,YYY����[�/�?<h����V��]]���%�\b�:h��SN1��_|�Ey���j|?�#����O�v�A�\p������=���'��M3
{�]�j��9��c�����������g�1EEE���g>_Uu�pB����d���{����e�]&}�Q��C��t��E������zU_�J��?���r��W�+����Yp�G����������s��D�������W�^2a����:�st|��g�i��ct�����3jC�W����o���S��*��	,��S��V|�����N�<��<����NS��l�P�3�(��"
�s�zUT�%�*z�����������*�_]����7��OOO7���w���D?c��:H�����*��������}>��S9��������{�]-K��=��iW��={�u�G�]pL�E�8p� �A��!B�8p� �A��!B�8p� �A��!B�8p� �A��!B`�M�6I����P���_?����m���'��O�����2q�D��&,�K.�D�6mZ��v��!�[��Hlll �['���X�B���N��q�4i�Dn��v���[O�s��W�#�m�V�o�N�����q�\��:��������Q�F�����9�����8u�P��srr$55U���#�\s�|�����o_����������P7�/!�=��s�*����e���2k�,����%K����G���5
�)S����_oP��v�,^�X"""�����gK������Z��\���o�DGG�]w�%W_}��1�n����A��SO�'�|R���/YYY��gOy��M#�D�/�����r����C=$o�����_|���9S�l�"��u�����\y^�v��~���C�5W����?���~+��_��~�
&?���4H|�A1b���?�1�N����d��=��s<�|��w��0�C���������.����6�U����K�.�a���[�n�ih���K~��_J�~�$--�4f^~�e����L�����_}�����{�u
4�w�}W^{�5�J�>���r�W�O?�d���y����Q�*�%���;w�|`�H�������� P��ic���@u�������>�H<(g�q�\p��G����{���U�g��m������6������W_5�����/�:u2��������s��/���_n�e����IFF�9]�v5�@�[��������<G��������_%��i������������ty�������[n1Wy[�je:���r�UW�FK����s4hC������$�
�^����m~��~h�BhCIu���<GDc����~�G�6����Lcj������]���P�u�H��'������V�]��D���Uc����-+�;5}�?sM���)�4\�h������]z����\������W���5��pb���������������������c����f�����z��gL#G��(�������_����������t�����.�����W/SRR<��W�������
�
�����z�P��o�:U�]+m8��|��>����'&&�<'!!��\�A]S���/�����i�����,�]"��ic���s���������������������D�V�OQQ���WG�K��8��SL��t������"�s��];����~�e����3��\=7�Z�li�Q��m���v�i~}�PUW���>����y���L�%�A��2?������{���4��?���?.�w�6WK��oo�:�QO��Z��W*�y:8X�1kCG��W��q�����nAyyyf@�M7��s��D��~:^��f-��1�I�;�����������|����I���v�R�-E����M?�7�xCn��f3��u����#@��w��t���y�h�D`�z��x�
���F�v?���6rt*D���]s�N�j��k��7�|�\����a����v�X�h��1e���&<�UM0��(j��������������Z�X{��g�����d����N�}��g�o����q�r��7�����p���.~�.����-���<���'��i���0���3�h_f��o:w��e>����J��Gi7���w�y�����j��;��:�l��'�|r�����J��	|�U��k���}��	?W�U�v�E�Y�*:�9�K��w\��s ���>�cu
p�!�.O�cup� �A�?�<�w���IEND�B`�

sata-prefetch-log.pngimage/pngDownload

#24

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Tomas Vondra (#23)

Re: WIP: WAL prefetch (another approach)

On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote:

...

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
0000000100000108000000FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
000000010000010900000001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.

I suspected it might be due to a somewhat hackish restore_command that
prefetches some of the WAL segments, so I tried again with a much
simpler restore_command - essentially just:

restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p'

which I think should be fine for testing purposes. And I got this:

LOG: recovery no longer prefetching: unexpected pageaddr 108/57000000
in log segment 0000000100000108000000FF, offset 0
LOG: restored log file "0000000100000108000000FF" from archive

which is the same segment as in the earlier examples, but with a
different pageaddr value. Of course, there's no such pageaddr in the WAL
segment (and recovery of that segment succeeds).

So I think there's something broken ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#25

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Tomas Vondra (#24)

Re: WIP: WAL prefetch (another approach)

On Fri, Jun 05, 2020 at 10:04:14PM +0200, Tomas Vondra wrote:

On Fri, Jun 05, 2020 at 05:20:52PM +0200, Tomas Vondra wrote:

...

which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:

prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
0000000100000108000000FF, offset 0

prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
000000010000010900000001, offset 0

Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.

Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.

I suspected it might be due to a somewhat hackish restore_command that
prefetches some of the WAL segments, so I tried again with a much
simpler restore_command - essentially just:

restore_command = 'cp /archive/%f %p.tmp && mv %p.tmp %p'

which I think should be fine for testing purposes. And I got this:

LOG: recovery no longer prefetching: unexpected pageaddr 108/57000000
in log segment 0000000100000108000000FF, offset 0
LOG: restored log file "0000000100000108000000FF" from archive

which is the same segment as in the earlier examples, but with a
different pageaddr value. Of course, there's no such pageaddr in the WAL
segment (and recovery of that segment succeeds).

So I think there's something broken ...

BTW in all three cases it happens right after the first restart point in
the WAL stream:

LOG: restored log file "0000000100000108000000FD" from archive
LOG: restartpoint starting: time
LOG: restored log file "0000000100000108000000FE" from archive
LOG: restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ...
LOG: recovery restart point at 108/FC000028
DETAIL: Last completed transaction was at log time 2020-06-04
15:27:00.95139+02.
LOG: recovery no longer prefetching: unexpected pageaddr
108/57000000 in log segment 0000000100000108000000FF, offset 0
LOG: restored log file "0000000100000108000000FF" from archive

It looks exactly like this in case of all 3 failures ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#25)

Re: WIP: WAL prefetch (another approach)

On Sat, Jun 6, 2020 at 8:41 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

BTW in all three cases it happens right after the first restart point in
the WAL stream:

LOG: restored log file "0000000100000108000000FD" from archive
LOG: restartpoint starting: time
LOG: restored log file "0000000100000108000000FE" from archive
LOG: restartpoint complete: wrote 236092 buffers (22.5%); 0 WAL ...
LOG: recovery restart point at 108/FC000028
DETAIL: Last completed transaction was at log time 2020-06-04
15:27:00.95139+02.
LOG: recovery no longer prefetching: unexpected pageaddr
108/57000000 in log segment 0000000100000108000000FF, offset 0
LOG: restored log file "0000000100000108000000FF" from archive

It looks exactly like this in case of all 3 failures ...

Huh. Thanks! I'll try to reproduce this here.

#27

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#21)

Re: WIP: WAL prefetch (another approach)

Hi,

I wonder if we can collect some stats to measure how effective the
prefetching actually is. Ultimately we want something like cache hit
ratio, but we're only preloading into page cache, so we can't easily
measure that. Perhaps we could measure I/O timings in redo, though?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

sfrost@snowman.net

over 5 years ago

In reply to: Tomas Vondra (#27)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

I wonder if we can collect some stats to measure how effective the
prefetching actually is. Ultimately we want something like cache hit
ratio, but we're only preloading into page cache, so we can't easily
measure that. Perhaps we could measure I/O timings in redo, though?

That would certainly be interesting, particularly as this optimization
seems likely to be useful on some platforms (eg, zfs, where the
filesystem block size is larger than ours..) and less on others
(traditional systems which have a smaller block size).

Thanks,

Stephen

#29

thomas.munro@gmail.com

over 5 years ago

In reply to: Stephen Frost (#28)

Re: WIP: WAL prefetch (another approach)

On Sat, Jun 6, 2020 at 12:36 PM Stephen Frost <sfrost@snowman.net> wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

I wonder if we can collect some stats to measure how effective the
prefetching actually is. Ultimately we want something like cache hit
ratio, but we're only preloading into page cache, so we can't easily
measure that. Perhaps we could measure I/O timings in redo, though?

That would certainly be interesting, particularly as this optimization
seems likely to be useful on some platforms (eg, zfs, where the
filesystem block size is larger than ours..) and less on others
(traditional systems which have a smaller block size).

I know one way to get information about cache hit ratios without the
page cache fuzz factor: if you combine this patch with Andres's
still-in-development AIO prototype and tell it to use direct IO, you
get the undiluted truth about hits and misses by looking at the
"prefetch" and "skip_hit" columns of the stats view. I'm hoping to
have a bit more to say about how this patch works as a client of that
new magic soon, but I also don't want to make this dependent on that
(it's mostly orthogonal, apart from the "how deep is the queue" part
which will improve with better information).

FYI I am still trying to reproduce and understand the problem Tomas
reported; more soon.

#30

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#29)

Re: WIP: WAL prefetch (another approach)

On Thu, Jul 02, 2020 at 03:09:29PM +1200, Thomas Munro wrote:

On Sat, Jun 6, 2020 at 12:36 PM Stephen Frost <sfrost@snowman.net> wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

I wonder if we can collect some stats to measure how effective the
prefetching actually is. Ultimately we want something like cache hit
ratio, but we're only preloading into page cache, so we can't easily
measure that. Perhaps we could measure I/O timings in redo, though?

That would certainly be interesting, particularly as this optimization
seems likely to be useful on some platforms (eg, zfs, where the
filesystem block size is larger than ours..) and less on others
(traditional systems which have a smaller block size).

I know one way to get information about cache hit ratios without the
page cache fuzz factor: if you combine this patch with Andres's
still-in-development AIO prototype and tell it to use direct IO, you
get the undiluted truth about hits and misses by looking at the
"prefetch" and "skip_hit" columns of the stats view. I'm hoping to
have a bit more to say about how this patch works as a client of that
new magic soon, but I also don't want to make this dependent on that
(it's mostly orthogonal, apart from the "how deep is the queue" part
which will improve with better information).

FYI I am still trying to reproduce and understand the problem Tomas
reported; more soon.

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#31

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#30)

Re: WIP: WAL prefetch (another approach)

On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jul 02, 2020 at 03:09:29PM +1200, Thomas Munro wrote:

FYI I am still trying to reproduce and understand the problem Tomas
reported; more soon.

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

No luck. I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

By the way, I also looked into recovery I/O stalls *other* than
relation buffer cache misses, and created
https://commitfest.postgresql.org/29/2669/ to fix what I found. If
you avoid both kinds of stalls then crash recovery is finally CPU
bound (to go faster after that we'll need parallel replay).

#32

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#31)

Re: WIP: WAL prefetch (another approach)

On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Jul 02, 2020 at 03:09:29PM +1200, Thomas Munro wrote:

FYI I am still trying to reproduce and understand the problem Tomas
reported; more soon.

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

No luck. I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.

By the way, I also looked into recovery I/O stalls *other* than
relation buffer cache misses, and created
https://commitfest.postgresql.org/29/2669/ to fix what I found. If
you avoid both kinds of stalls then crash recovery is finally CPU
bound (to go faster after that we'll need parallel replay).

Yeah, I noticed. I'll take a look and do some testing in the next CF.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#33

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#32)

3 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

No luck. I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no
functional changes. Soon I'll post a new improved version, but this
version just fixes the bitrot and hopefully turns cfbot green.

Attachments:

v10-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 630b329de06705c09ab2372f2fb8f102d1f1f701 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v10 1/3] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v10-0002-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Allow-XLogReadRecord-to-be-non-blocking.patchDownload

From cdfaedd530b5b60171952dc21ad3e2a5a3e6451b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v10 2/3] Allow XLogReadRecord() to be non-blocking.

Extend read_local_xlog_page() to support non-blocking modes:

1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlogreader.c |  37 ++++--
 src/backend/access/transam/xlogutils.c  | 149 +++++++++++++++++-------
 src/backend/replication/walsender.c     |   2 +-
 src/include/access/xlogreader.h         |  14 ++-
 src/include/access/xlogutils.h          |  26 +++++
 5 files changed, 173 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 67996018da..aad9fc2ce1 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -261,6 +261,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
@@ -550,10 +553,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
@@ -565,8 +569,9 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the page_read() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the page_read callback).
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -663,8 +668,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
+	if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+		return XLOGPAGEREAD_WOULDBLOCK;
+
 	XLogReaderInvalReadState(state);
-	return -1;
+	return XLOGPAGEREAD_ERROR;
 }
 
 /*
@@ -943,6 +951,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	char	   *errormsg;
+	int			readLen;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
@@ -956,7 +965,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -1037,7 +1045,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+		XLogReaderInvalReadState(state);
 
 	return InvalidXLogRecPtr;
 }
@@ -1096,8 +1105,16 @@ WALRead(XLogReaderState *state,
 			XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
 			state->routine.segment_open(state, nextSegNo, &tli);
 
-			/* This shouldn't happen -- indicates a bug in segment_open */
-			Assert(state->seg.ws_file >= 0);
+			/* callback reported that there was no such file */
+			if (state->seg.ws_file < 0)
+			{
+				errinfo->wre_errno = errno;
+				errinfo->wre_req = 0;
+				errinfo->wre_read = 0;
+				errinfo->wre_off = startoff;
+				errinfo->wre_seg = state->seg;
+				return false;
+			}
 
 			/* Update the current segment info. */
 			state->seg.ws_tli = tli;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b2ca0cd4cf..3bc647eff1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -808,6 +809,29 @@ wal_segment_open(XLogReaderState *state, XLogSegNo nextSegNo,
 						path)));
 }
 
+/*
+ * XLogReaderRoutine->segment_open callback that reports missing files rather
+ * than raising an error.
+ */
+void
+wal_segment_try_open(XLogReaderState *state, XLogSegNo nextSegNo,
+					 TimeLineID *tli_p)
+{
+	TimeLineID	tli = *tli_p;
+	char		path[MAXPGPATH];
+
+	XLogFilePath(path, tli, nextSegNo, state->segcxt.ws_segsize);
+	state->seg.ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+	if (state->seg.ws_file >= 0)
+		return;
+
+	if (errno != ENOENT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m",
+						path)));
+}
+
 /* stock XLogReaderRoutine->segment_close callback */
 void
 wal_segment_close(XLogReaderState *state)
@@ -823,6 +847,10 @@ wal_segment_close(XLogReaderState *state)
  * Public because it would likely be very helpful for someone writing another
  * output method outside walsender, e.g. in a bgworker.
  *
+ * A pointer to an XLogReadLocalOptions struct may be passed in as
+ * XLogReaderRouter->page_read_private to control the behavior of this
+ * function.
+ *
  * TODO: The walsender has its own version of this, but it relies on the
  * walsender's latch being set whenever WAL is flushed. No such infrastructure
  * exists for normal backends, so we have to do a check/sleep/repeat style of
@@ -837,58 +865,89 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	TimeLineID	tli;
 	int			count;
 	WALReadError errinfo;
+	XLogReadLocalOptions *options =
+		(XLogReadLocalOptions *) state->routine.page_read_private;
 
 	loc = targetPagePtr + reqLen;
 
 	/* Loop waiting for xlog to be available if necessary */
 	while (1)
 	{
-		/*
-		 * Determine the limit of xlog we can currently read to, and what the
-		 * most recent timeline is.
-		 *
-		 * RecoveryInProgress() will update ThisTimeLineID when it first
-		 * notices recovery finishes, so we only have to maintain it for the
-		 * local process until recovery ends.
-		 */
-		if (!RecoveryInProgress())
-			read_upto = GetFlushRecPtr();
-		else
-			read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
-		tli = ThisTimeLineID;
+		switch (options ? options->read_upto_policy : -1)
+		{
+		case XLRO_WALRCV_WRITTEN:
+			/*
+			 * We'll try to read as far as has been written by the WAL
+			 * receiver, on the requested timeline.  When we run out of valid
+			 * data, we'll return an error.  This is used by xlogprefetch.c
+			 * while streaming.
+			 */
+			read_upto = GetWalRcvWriteRecPtr();
+			state->currTLI = tli = options->tli;
+			break;
 
-		/*
-		 * Check which timeline to get the record from.
-		 *
-		 * We have to do it each time through the loop because if we're in
-		 * recovery as a cascading standby, the current timeline might've
-		 * become historical. We can't rely on RecoveryInProgress() because in
-		 * a standby configuration like
-		 *
-		 * A => B => C
-		 *
-		 * if we're a logical decoding session on C, and B gets promoted, our
-		 * timeline will change while we remain in recovery.
-		 *
-		 * We can't just keep reading from the old timeline as the last WAL
-		 * archive in the timeline will get renamed to .partial by
-		 * StartupXLOG().
-		 *
-		 * If that happens after our caller updated ThisTimeLineID but before
-		 * we actually read the xlog page, we might still try to read from the
-		 * old (now renamed) segment and fail. There's not much we can do
-		 * about this, but it can only happen when we're a leaf of a cascading
-		 * standby whose primary gets promoted while we're decoding, so a
-		 * one-off ERROR isn't too bad.
-		 */
-		XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+		case XLRO_END:
+			/*
+			 * We'll try to read as far as we can on one timeline.  This is
+			 * used by xlogprefetch.c for crash recovery.
+			 */
+			read_upto = (XLogRecPtr) -1;
+			state->currTLI = tli = options->tli;
+			break;
+
+		default:
+			/*
+			 * Determine the limit of xlog we can currently read to, and what the
+			 * most recent timeline is.
+			 *
+			 * RecoveryInProgress() will update ThisTimeLineID when it first
+			 * notices recovery finishes, so we only have to maintain it for
+			 * the local process until recovery ends.
+			 */
+			if (!RecoveryInProgress())
+				read_upto = GetFlushRecPtr();
+			else
+				read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+			tli = ThisTimeLineID;
+
+			/*
+			 * Check which timeline to get the record from.
+			 *
+			 * We have to do it each time through the loop because if we're in
+			 * recovery as a cascading standby, the current timeline might've
+			 * become historical. We can't rely on RecoveryInProgress()
+			 * because in a standby configuration like
+			 *
+			 * A => B => C
+			 *
+			 * if we're a logical decoding session on C, and B gets promoted,
+			 * our timeline will change while we remain in recovery.
+			 *
+			 * We can't just keep reading from the old timeline as the last
+			 * WAL archive in the timeline will get renamed to .partial by
+			 * StartupXLOG().
+			 *
+			 * If that happens after our caller updated ThisTimeLineID but
+			 * before we actually read the xlog page, we might still try to
+			 * read from the old (now renamed) segment and fail. There's not
+			 * much we can do about this, but it can only happen when we're a
+			 * leaf of a cascading standby whose primary gets promoted while
+			 * we're decoding, so a one-off ERROR isn't too bad.
+			 */
+			XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+			break;
+		}
 
-		if (state->currTLI == ThisTimeLineID)
+		if (state->currTLI == tli)
 		{
 
 			if (loc <= read_upto)
 				break;
 
+			/* not enough data there, but we were asked not to wait */
+			if (options && options->nowait)
+				return XLOGPAGEREAD_WOULDBLOCK;
+
 			CHECK_FOR_INTERRUPTS();
 			pg_usleep(1000L);
 		}
@@ -930,7 +989,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 	}
 	else
 	{
@@ -945,7 +1004,17 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 */
 	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
 				 &errinfo))
+	{
+		/*
+		 * When not following timeline changes, we may read past the end of
+		 * available segments.  Report lack of file as an error rather than
+		 * raising an error.
+		 */
+		if (errinfo.wre_errno == ENOENT)
+			return XLOGPAGEREAD_ERROR;
+
 		WALReadRaiseError(&errinfo);
+	}
 
 	/* number of valid bytes in the buffer */
 	return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 460ca3f947..e6a3b5073b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -830,7 +830,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+		return XLOGPAGEREAD_ERROR;
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b976882229..ede9b71b64 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -57,6 +57,10 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR		-1
+#define XLOGPAGEREAD_WOULDBLOCK	-2
+
 /* Function type definitions for various xlogreader interactions */
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
@@ -76,10 +80,11 @@ typedef struct XLogReaderRoutine
 	 * This callback shall read at least reqLen valid bytes of the xlog page
 	 * starting at targetPagePtr, and store them in readBuf.  The callback
 	 * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
-	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
-	 * requested bytes to become available.  The callback will not be invoked
-	 * again for the same page unless more than the returned number of bytes
-	 * are needed.
+	 * XLOGPAGEREAD_ERROR on failure.  The callback shall either sleep, if
+	 * necessary, to wait for the requested bytes to become available, or
+	 * return XLOGPAGEREAD_WOULDBLOCK.  The callback will not be invoked again
+	 * for the same page unless more than the returned number of bytes are
+	 * needed.
 	 *
 	 * targetRecPtr is the position of the WAL record we're reading.  Usually
 	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -91,6 +96,7 @@ typedef struct XLogReaderRoutine
 	 * read from.
 	 */
 	XLogPageReadCB page_read;
+	void	   *page_read_private;
 
 	/*
 	 * Callback to open the specified WAL segment for reading.  ->seg.ws_file
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..6325c23dc2 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,38 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+	/* Don't block waiting for new WAL to arrive. */
+	bool		nowait;
+
+	/*
+	 * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+	 * provided.
+	 */
+	TimeLineID	tli;
+
+	/* How far to read. */
+	enum {
+		XLRO_STANDARD,
+		XLRO_WALRCV_WRITTEN,
+		XLRO_END
+	} read_upto_policy;
+} XLogReadLocalOptions;
+
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
+extern void wal_segment_try_open(XLogReaderState *state,
+								 XLogSegNo nextSegNo,
+								 TimeLineID *tli_p);
 extern void wal_segment_close(XLogReaderState *state);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
-- 
2.20.1

v10-0003-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Prefetch-referenced-blocks-during-recovery.patchDownload

From c4606290029e395b1c124931be1d097aef6ca3d2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v10 3/3] Prefetch referenced blocks during recovery.

Introduce a new GUC max_recovery_prefetch_distance.  If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks.  The
goal is to avoid I/O stalls and benefit from concurrent I/O.  The number
of concurrent asynchronous reads is capped by the existing
maintenance_io_concurrency GUC.  The feature is enabled by default for
now.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  45 +
 doc/src/sgml/monitoring.sgml                  |  85 +-
 doc/src/sgml/wal.sgml                         |  13 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  16 +
 src/backend/access/transam/xlogprefetch.c     | 910 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               |  96 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  47 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlogprefetch.h             |  85 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 16 files changed, 1366 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c550..a3b6d5babd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3255,6 +3255,51 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+      <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The maximum distance to look ahead in the WAL during recovery, to find
+        blocks to prefetch.  Prefetching blocks that will soon be needed can
+        reduce I/O wait times.  The number of concurrent prefetches is limited
+        by this setting as well as
+        <xref linkend="guc-maintenance-io-concurrency"/>.  Setting it too high
+        might be counterproductive, if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.  A setting of -1 disables prefetching
+        during recovery.
+        The default is 256kB on systems that support
+        <function>posix_fadvise</function>, and otherwise -1.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.  This
+        setting has no effect unless
+        <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+        number.  The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7dcddf478a..f59b78abab 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -323,6 +323,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2702,6 +2709,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-max-recovery-prefetch-distance"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-max-recovery-prefetch-distance"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4632,8 +4711,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         argument.  The argument can be <literal>bgwriter</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
-        view,or <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view.
+        view, <literal>archiver</literal> to reset all the counters shown in
+        the <structname>pg_stat_archiver</structname> view, and
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 7a13d8d502..06a700676a 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,19 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed, in combination with the
+   <xref linkend="guc-maintenance-io-concurrency"/> parameter.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled,
+   but it can be disabled by setting the distance to -1.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..510f1f079a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -7169,6 +7170,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 
 			InRedo = true;
 
@@ -7176,6 +7178,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7205,6 +7210,12 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Peform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch,
+							 ThisTimeLineID,
+							 xlogreader->ReadRecPtr,
+							 currentSource == XLOG_FROM_STREAM);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7376,6 +7387,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7392,6 +7406,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12150,6 +12165,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..6d8cff12c6
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,910 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.  Currently, this is achieved by using a
+ * separate XLogReader to read ahead.  In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer().  Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs.  When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching.  Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int			max_recovery_prefetch_distance = -1;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	XLogReadLocalOptions options;
+	bool			have_record;
+	bool			shutdown;
+	int				next_block_id;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+	XLogReaderRoutine reader_routines = {
+		.page_read = read_local_xlog_page,
+		.segment_open = wal_segment_try_open,
+		.segment_close = wal_segment_close
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  We add one to the size
+	 * because our circular buffer has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+						 sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->options.tli = tli;
+	prefetcher->options.nowait = true;
+	if (streaming)
+	{
+		/*
+		 * We're only allowed to read as far as the WAL receiver has written.
+		 * We don't have to wait for it to be flushed, though, as recovery
+		 * does, so that gives us a chance to get a bit further ahead.
+		 */
+		prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+	}
+	else
+	{
+		/* Read as far as we can. */
+		prefetcher->options.read_upto_policy = XLRO_END;
+	}
+	reader_routines.page_read_private = &prefetcher->options;
+	prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+											NULL,
+											&reader_routines,
+											NULL);
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	/* Prepare to read at the given LSN. */
+	ereport(LOG,
+			(errmsg("recovery started prefetching on timeline %u at %X/%X",
+					tli,
+					(uint32) (lsn << 32), (uint32) lsn)));
+	XLogBeginRead(prefetcher->reader, lsn);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	XLogReaderFree(prefetcher->reader);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (!prefetcher->have_record)
+		{
+			if (!XLogReadRecord(reader, &error))
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->have_record = true;
+			prefetcher->next_block_id = 0;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we too far ahead of replay? */
+		if (distance >= max_recovery_prefetch_distance)
+			break;
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			prefetcher->have_record = false;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < reader->ReadRecPtr &&
+			XLogRecGetRmid(reader) == RM_SMGR_ID &&
+			(XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+									reader->ReadRecPtr);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->have_record = false;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	XLogReaderState *reader = prefetcher->reader;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= reader->max_block_id;
+		 ++block_id)
+	{
+		PrefetchBufferResult prefetch;
+		DecodedBkpBlock *block = &reader->blocks[block_id];
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably an
+		 * extension.  Since it might create a new segment, we can't try
+		 * to prefetch this block until the record has been replayed, or we
+		 * might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									reader->ReadRecPtr);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+	prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail++;
+		prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+		prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	max_recovery_prefetch_distance = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8625cbeab6..d17036fdae 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -823,6 +823,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 73ce944fb1..3eda4ee590 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -282,6 +283,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -354,6 +356,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1370,11 +1373,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2692,6 +2704,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4443,6 +4471,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4647,6 +4692,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4918,6 +4967,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5177,6 +5233,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5264,6 +5321,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5563,6 +5632,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5628,6 +5698,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6425,6 +6507,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..d80c9079a6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless max_recovery_prefetch_distance is set to a positive number.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2636,6 +2650,22 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+			gettext_noop("Set to -1 to disable prefetching during recovery."),
+			GUC_UNIT_BYTE
+		},
+		&max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+		256 * 1024,
+#else
+		-1,
+#endif
+		-1, INT_MAX,
+		NULL, assign_max_recovery_prefetch_distance, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2956,7 +2986,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11608,6 +11639,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..e6412ad517 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+											  XLogRecPtr lsn,
+											  bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+			 TimeLineID replaying_tli,
+			 XLogRecPtr replaying_lsn,
+			 bool from_stream)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (max_recovery_prefetch_distance > 0)
+			state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+													   replaying_lsn,
+													   from_stream);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 082a11f270..8de6bf7b20 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6148,6 +6148,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1387201382..65624a0159 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -62,6 +62,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -182,6 +183,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -453,6 +467,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -597,6 +621,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1459,6 +1484,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1474,6 +1500,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 601734a6f1..dea7194bc2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,6 +1859,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

#34

Sait Talha Nisanci

Sait.Nisanci@microsoft.com

over 5 years ago

In reply to: Thomas Munro (#33)

RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

I have run some benchmarks for this patch. Overall it seems that there is a good improvement with the patch on recovery times:

The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 3000(so it doesn’t fit to memory, ~45GB).

In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value.

The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The recovery times are measured from the REDO logs.

I have tried combination of SSD, HDD, full_page_writes = on/off and max_io_concurrency = 10/50, the recovery times are as follows (in seconds):

No prefetch | Default prefetch values | Default + max_io_concurrency = 50
SSD, full_page_writes = on 852 301 197
SSD, full_page_writes = off 1642 1359 1391
HDD, full_page_writes = on 6027 6345 6390
HDD, full_page_writes = off 738 275 192

Default prefetch values:
- Max_recovery_prefetch_distance = 256KB
- Max_io_concurrency = 10

It probably makes sense to compare each row separately as the size of WAL can be different.

Talha.

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com>
Sent: Thursday, August 13, 2020 9:57 AM
To: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Cc: Stephen Frost <sfrost@snowman.net>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@2ndquadrant.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra

Any luck trying to reproduce thigs? Should I try again and collect
some additional debug info?

No luck. I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no functional changes. Soon I'll post a new improved version, but this version just fixes the bitrot and hopefully turns cfbot green.

#35

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Sait Talha Nisanci (#34)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Wed, Aug 26, 2020 at 9:42 AM Sait Talha Nisanci
<Sait.Nisanci@microsoft.com> wrote:

I have tried combination of SSD, HDD, full_page_writes = on/off and max_io_concurrency = 10/50, the recovery times are as follows (in seconds):

No prefetch | Default prefetch values | Default + max_io_concurrency = 50
SSD, full_page_writes = on 852 301 197
SSD, full_page_writes = off 1642 1359 1391
HDD, full_page_writes = on 6027 6345 6390
HDD, full_page_writes = off 738 275 192

The regression on HDD with full_page_writes=on is interesting. I don't
know why that should happen, and I wonder if there is anything that
can be done to mitigate it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

sfrost@snowman.net

over 5 years ago

In reply to: Sait Talha Nisanci (#34)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Sait Talha Nisanci (Sait.Nisanci@microsoft.com) wrote:

I have run some benchmarks for this patch. Overall it seems that there is a good improvement with the patch on recovery times:

Maybe I missed it somewhere, but what's the OS/filesystem being used
here..? What's the filesystem block size..?

Thanks,

Stephen

#37

Sait Talha Nisanci

Sait.Nisanci@microsoft.com

over 5 years ago

In reply to: Stephen Frost (#36)

RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Hi Stephen,

OS version is Ubuntu 18.04.5 LTS.
Filesystem is ext4 and block size is 4KB.

Talha.

-----Original Message-----
From: Stephen Frost <sfrost@snowman.net>
Sent: Thursday, August 27, 2020 4:56 PM
To: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Cc: Thomas Munro <thomas.munro@gmail.com>; Tomas Vondra <tomas.vondra@2ndquadrant.com>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; Andres Freund <andres@anarazel.de>; Alvaro Herrera <alvherre@2ndquadrant.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Sait Talha Nisanci (Sait.Nisanci@microsoft.com) wrote:

I have run some benchmarks for this patch. Overall it seems that there is a good improvement with the patch on recovery times:

Maybe I missed it somewhere, but what's the OS/filesystem being used here..? What's the filesystem block size..?

Thanks,

Stephen

#38

sfrost@snowman.net

over 5 years ago

In reply to: Sait Talha Nisanci (#37)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Sait Talha Nisanci (Sait.Nisanci@microsoft.com) wrote:

OS version is Ubuntu 18.04.5 LTS.
Filesystem is ext4 and block size is 4KB.

[...]

* Sait Talha Nisanci (Sait.Nisanci@microsoft.com) wrote:

I have run some benchmarks for this patch. Overall it seems that there is a good improvement with the patch on recovery times:

The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 3000(so it doesn’t fit to memory, ~45GB).

In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value.

The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The recovery times are measured from the REDO logs.

I have tried combination of SSD, HDD, full_page_writes = on/off and max_io_concurrency = 10/50, the recovery times are as follows (in seconds):

No prefetch | Default prefetch values | Default + max_io_concurrency = 50
SSD, full_page_writes = on 852 301 197
SSD, full_page_writes = off 1642 1359 1391
HDD, full_page_writes = on 6027 6345 6390
HDD, full_page_writes = off 738 275 192

Default prefetch values:
- Max_recovery_prefetch_distance = 256KB
- Max_io_concurrency = 10

It probably makes sense to compare each row separately as the size of WAL can be different.

Is WAL FPW compression enabled..? I'm trying to figure out how, given
what's been shared here, that replaying 25GB of WAL is being helped out
by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting in
the HDD case entirely makes sense to me- we're spending time reading
pages from the HDD, which is entirely pointless work given that we're
just going to write over those pages entirely with FPWs.

Further, if there's 32GB of RAM, and WAL compression isn't enabled and
the WAL is only 25GB, then it's very likely that every page touched by
the WAL ends up in memory (shared buffers or fs cache), and with FPWs we
shouldn't ever need to actually read from the storage to get those
pages, right? So how is prefetch helping so much..?

I'm not sure that the 'full_page_writes = off' tests are very
interesting in this case, since you're going to get torn pages and
therefore corruption and hopefully no one is running with that
configuration with this OS/filesystem.

Thanks,

Stephen

Import Notes

Reply to msg id not found: AM0PR83MB0241677C74E7A8BE73FF30C491550@AM0PR83MB0241.EURPRD83.prod.outlook.comVI1PR83MB0254FB140602E019430A796391540@VI1PR83MB0254.EURPRD83.prod.outlook.com | Resolved by subject fallback

#39

andres@anarazel.de

over 5 years ago

In reply to: Stephen Frost (#38)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Hi,

On August 27, 2020 11:26:42 AM PDT, Stephen Frost <sfrost@snowman.net> wrote:

Is WAL FPW compression enabled..? I'm trying to figure out how, given
what's been shared here, that replaying 25GB of WAL is being helped out
by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting
in
the HDD case entirely makes sense to me- we're spending time reading
pages from the HDD, which is entirely pointless work given that we're
just going to write over those pages entirely with FPWs.

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#40

sfrost@snowman.net

over 5 years ago

In reply to: Andres Freund (#39)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On August 27, 2020 11:26:42 AM PDT, Stephen Frost <sfrost@snowman.net> wrote:

Is WAL FPW compression enabled..? I'm trying to figure out how, given
what's been shared here, that replaying 25GB of WAL is being helped out
by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting
in
the HDD case entirely makes sense to me- we're spending time reading
pages from the HDD, which is entirely pointless work given that we're
just going to write over those pages entirely with FPWs.

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..? I understood how it could be helpful for
filesystems which have a larger block size than ours (eg: zfs w/ 16kb
block sizes where the kernel needs to get the whole 16kb block when we
only write 8kb to it), but that's apparently not the case here.

So- what is it that pre-fetching is doing to result in such an
improvement? Is there something lower level where the SSD physical
block size is coming into play, which is typically larger..? I wouldn't
have thought so, but perhaps that's the case..

Thanks,

Stephen

#41

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: Stephen Frost (#40)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42

sfrost@snowman.net

over 5 years ago

In reply to: Robert Haas (#41)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely? Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL. I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most? a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.

Thanks,

Stephen

#43

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Stephen Frost (#42)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely? Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.

I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows how
many checkpoints that covers? In the cases with FPW=on this may easily
be much less than one checkpoint (because with scale 45GB an update to
every page will log 45GB of full-page images). It'd be interesting to
see some stats from pg_waldump etc.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL. I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most? a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.

Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that the
prefetching on a particular SSD has more overhead than what it saves.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#44

sfrost@snowman.net

over 5 years ago

In reply to: Tomas Vondra (#43)

Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely? Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.

I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows how
many checkpoints that covers? In the cases with FPW=on this may easily
be much less than one checkpoint (because with scale 45GB an update to
every page will log 45GB of full-page images). It'd be interesting to
see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a
checkpoint, as you suggest. Also, given that the comment was 'run is
cancelled when there is a reasonable amount of WAL (>25GB), seems likely
that it's at least *around* there.

Ultimately though, there just isn't enough information provided to
really be able to understand what's going on. I agree, pg_waldump stats
would be useful.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL. I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most? a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.

Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that the
prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter
either, but it's entirely possible that assumption is wrong and that it
does matter for some reason- perhaps with just some SSDs, or maybe with
a lot of them, or maybe there's something else entirely going on. About
all I feel like I can say at the moment is that I'm very interested in
ways to make WAL replay go faster and it'd be great to get more
information about what's going on here to see if there's something we
can do to generally improve WAL replay.

Thanks,

Stephen

#45

Sait Talha Nisanci

Sait.Nisanci@microsoft.com

over 5 years ago

In reply to: Stephen Frost (#44)

RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Hi,

The WAL size for "SSD, full_page_writes=on" was 36GB. I currently don't have the exact size for the other rows because my test VMs got auto-deleted. I can possibly redo the benchmark to get pg_waldump stats for each row.

Best,
Talha.

-----Original Message-----
From: Stephen Frost <sfrost@snowman.net>
Sent: Sunday, August 30, 2020 3:24 PM
To: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Cc: Robert Haas <robertmhaas@gmail.com>; Andres Freund <andres@anarazel.de>; Sait Talha Nisanci <Sait.Nisanci@microsoft.com>; Thomas Munro <thomas.munro@gmail.com>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; Alvaro Herrera <alvherre@2ndquadrant.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:

Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW,
how is pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this
point, pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely? Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.

I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows
how many checkpoints that covers? In the cases with FPW=on this may
easily be much less than one checkpoint (because with scale 45GB an
update to every page will log 45GB of full-page images). It'd be
interesting to see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a checkpoint, as you suggest. Also, given that the comment was 'run is cancelled when there is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* there.

Ultimately though, there just isn't enough information provided to really be able to understand what's going on. I agree, pg_waldump stats would be useful.

Admittedly, I don't quite understand whether that is what is
happening in this test case, or why SDD vs. HDD should make any
difference. But there doesn't seem to be any reason why it doesn't
make sense in theory.

I agree that this could be a reason, but it doesn't seem to quite fit
in this particular case given the amount of memory and WAL. I'm
suspecting that it's something else and I'd very much like to know if
it's a general "this applies to all (most? a lot of?) SSDs because
the hardware has a larger than 8KB page size and therefore the kernel
has to read it", or if it's something odd about this particular
system and doesn't apply generally.

Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that
the prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter either, but it's entirely possible that assumption is wrong and that it does matter for some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe there's something else entirely going on. About all I feel like I can say at the moment is that I'm very interested in ways to make WAL replay go faster and it'd be great to get more information about what's going on here to see if there's something we can do to generally improve WAL replay.

Thanks,

Stephen

#46

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#33)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Thu, Aug 13, 2020 at 06:57:20PM +1200, Thomas Munro wrote:

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:

On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

No luck. I'm working on it now, and also trying to reduce the
overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no
functional changes. Soon I'll post a new improved version, but this
version just fixes the bitrot and hopefully turns cfbot green.

I've decided to do some tests with this patch version, but I immediately
ran into issues. What I did was initializing a 32GB pgbench database,
backed it up (shutdown + tar) and then ran 2h pgbench with archiving.
And then I restored the backed-up data directory and instructed it to
replay WAL from the archive. There's about 16k WAL segments, so about
256GB of WAL.

Unfortunately, the very first thing that happens after starting the
recovery is this:

LOG: starting archive recovery
LOG: restored log file "000000010000001600000080" from archive
LOG: consistent recovery state reached at 16/800000A0
LOG: redo starts at 16/800000A0
LOG: database system is ready to accept read only connections
LOG: recovery started prefetching on timeline 1 at 0/800000A0
LOG: recovery no longer prefetching: unexpected pageaddr 8/84000000 in log segment 000000010000001600000081, offset 0
LOG: restored log file "000000010000001600000081" from archive
LOG: restored log file "000000010000001600000082" from archive

So we start applying 000000010000001600000081 and it fails almost
immediately on the first segment. This is confirmed by prefetch stats,
which look like this:

So we do a little bit of prefetching and then it gets disabled :-(

The segment looks perfectly fine when inspected using pg_waldump, see
the attached file.

I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

000000010000001600000081.log.gzapplication/gzipDownload

�5GN_000000010000001600000081.log����$K�&���)~@7Hj���[W�q�"��8��+bDn
��Cbs ��+��+����Zn�k��u������������_��O�x�����G���������_����W��������Qp���H>�����?�S�s&������������a���f�Lt�����_������i���������������o�����o���r���������r���������?�����N����������9�Nz�������&��?��������/����w��?~[���`��s1�{��`���z�?�t����������?����������c�O������������l���������_�3}�c���L?�����*���w����w�86�M2�����+�4������h����:���^���������O���������������/��9%��>��������"�w7���{��W�.^>x\��$�!c� �yH?N��'�O�������l�NF�G�������xJ���{�(��G��)�be�^o���|!%}�<Q���>��t;�3�
�ltL�x�r��`���l��
��,S���`g�`��o��V��`���Cks>.�XP��=0S
����{��|����}�y����
>L��U��������T������#�*(��-)!�_�m���-FI�:��!#��6_)��t���2N�Yf�[�
�r����2i���W^P.S�����Ez�����h���)���P=�&�_�k2�g����!��V��H�9����b���-�@Q�sG�r+��)�!�F���k��^��S�+3��<��yr���@>g�����*�~`1z�,3M�*�E��W4�f���d&�9������C�i�(�*~�V.�����F�����������������:����~��������]���������v��1�����M�'������UP��l������  �D��-Su�S*r���.{F��w��]q�o���^��3�S|2�R���J9��p�8�53�S����L��Bi�"m�5��1V��C�����15���A}�2�<�:��,2����[�.eM�?�� �&J�A�����x����A�j(s�P���h�s�w���y'| �.�C���rx�4���R���$���<���[�.����hg������u�8���(��2D�;|�k@�_�9���W8L|�k��\����$���Cs��O��6�Y��si�?�e�v���@���n�A���2�{k�������x� <B1Y!j�Bf2�B����b�@DH!�q �Y������6B��������Y��L
�1#�uB@��GO9�N��[m�t�U��lD��]�5!�/��Sv��(W������nMl@
��?���Pa�g'�;X��C�C��&DE��L����a@&�}��b���g1/=��H�Y��47"&c�fS���vP3�%���������vU����g���U/���'�����'Xi�3v����Z���3�����H���o��������K�'V��!��`����h{</^P�x��&y!*RT:k���;���xA��q������d���Y�"��LJg���S�')�(_�C�~ ��sf5�+����%����hV���O�J��[c�B&��}�|@�(2��(kQ���g�����X=65��P�v)���� )�*�0��h�l8)R3�w.��7y���#
$.�xF��%�z�w$��	
5Rw�Rw�����|W�A�-�l��WT���>C�1��?��h*Te6T�%�]���R>g�
�������uR�d��j����p�P�X�#�uny]92��%�� n�)?���M&�p���+�HO1==���z�|zRxK�Pkg�����O�'Q��G�'�rV�!;L��4f=����j-���$7�A��L���:dJ��b"q����3H�
M�|r�2D�l�
R����.�6n4��?gfd�������?�;$�|����DC��p��&�Z~�����h���K����p���G�8�Q���� �hB�xD<��y|�^�����}����i�C2xH?�R�����
��U�����o�����g��9g�*H���&0}�@�7V������k���������5��o$����\��#�:��>�]���*3�J��#+��
���2ac��I7���"Sm������	��|"�l�`�TJ�U����7�����������A
�A�n�G�My#��Bm�M&�2����x��������z���f
������/H��="?*���w��<�S��U�i�hH����YP�1(����4���h�LAp^�~x�t�/O�ML��b�^!�p}��p%|����C�9�@����q������������	�:h�h�1>eO�U+�lB8����I���dg�"�{��'����<�9~>��+�����L��!v�{v^c������*NA��A�@_<�\+�h����PI����\G��|{ZfT�m$���U�L�w�C�Yc����"+��5��S��{��� ���yb��U.�(/HywH�>�O����Pd�Kz}�2�	�g(�tu�	���a�?��$_�f��o�ol�����v�k��\x����t�����A�|���H�T�h�H�A�u������4:)��.�tg�����J@�Q�������p�Ek�w�����5�;�
��8Ax�ATq�e�� �W�42MZ����=?��(
�F�Y�i<��vH��C~�r�d��#R�;����������x��[A8�:�t��:�{�tf�7)tCD(��u�n��#�N(["B�H[����w�u��N�`� �@���C3�aB_���h�8t�5;�0��r�;0�C�sY���Y"�L��4�Ha�������a��F�hzz���C�3�Fo�hv@g[��f��K��Q�B���o3B��y}�w���l���y��#nM�
����%����Dl[:C�F�n3�����|$��
	r������pm���m�����r���"XV���]�"N�k^#�..Q�c�l.3j�e�a���+)� "�;K/��_|(J'EZ���D�!��7c�]�������j���8!�~���Ev�F�v�[�U6��\��3�&4I���9{v��n9vw}s��LH:�
��N��Yj���?���ERt5��4�q�M7�|��"�t���$^3���Lv�Z�F#��z|�E1��m���}��}��U����j�-Sb�\o����"&3���N�e�,@J�R���<�h�F#��������|>Lw��	|�Fc��wv�XJ����sK(��i��������Wg�S��C������d/��Mi� ����(��~2|s���j��������1g�����\t�<�A^O���AYSi�_��zR��#�&���*x�P��Y������t
%���#B��j�����Fy�Y�k*3�\�������nq�,��Mb
�F~S�v"/O��6�t��������K>��7I��_c���Bs��POnc]���������'�V�����$����V���n����+M4���X\W�d7/���/����1^�������'�I������������T��T��4C�u~��%Y�e�,wH:4?5f���[���)��e(;u��	��5f�a��42�}����8�����E����7��y�{���.#�#�+��������eF�:wtY
��T���;�����/f�v����6�pN������������6o�:D��@���H42x��!�`����I�O����
	�swNn��;���j���MW�XvNI�gR�A��N�
�P;�[u}������0�Z?y�x��	�u�
~o}����������+��
H:����TY���!�,�xW�����5�?.�a^<�S��f`����3i���rLMa�5�1������I�O9����d�����(s����_a�$��0�D��� D0���bh������2=q�*d�x��E�5��xaV����Y�~N�_���~W<A�H��G��T�L���������l)�q$��u�4����{�\J�uC5�&]����F�(���
g(avUj���w
��O2�`)-5��c��%N����q� ������.����
;�uvJ
�K��Y��osx	5 ������ �pD7n����a�� �5;t/�v�}�A~m��>��������Q����u���x�����cv��&��iy�4����R�����y�ha]���A���?x�
�"8�~��v��q����wT�;�;�!�~�������yk��9�����PA�x{���l�Dp`[�~��se*Z����y
�Cz�y~�����
���Y�9�E
k���zn���ixC����
(�u�.v�O`6tZN����\�#�"��qACs��;�+b`8x$�\��C�[ }����r*���4v>W���|�Zg���\��Z&�������mt�S�C���o`v!4�e�0$=��[�A%y������z��A�=z�?}��������,v&W�[k?�j�sM�!�����'�=����2�JT���Sb�9��SF8e�r++���O�7��3TO��d��.��nB�����i�g��������I��
|�9���?�O!��,�j�]���������m$R5���bS�
U_ D�y�s?otD)e���� &��h
�8����W���|N
����|��_I��I�����$	��v��5
{��-�U�=�����i�N����o�>�����8N��%����H��b�q`n�):[���h�X�1��>��2� l4v��\���zz?b50�(���e@�!��XGyZ�N�Z�=��:�y����yZ:�n���?�zZvSH50DG�
P�	����N��FG����r9g��3���3�,Nm��TBpr���6�k3(g*�����lt�����F��L�������o����� lA�x ����
}'�i�xj��H|��5���d��#� �3�x�9��Q�C���4�E&D2������F,~��Z�*�������X\��������gV�T�xc<�KX��o���w�*:j��H�"w�J�,e���+�h� ��e����(�)���J��K��S�-���oE&:'T�E	�����m���6�!��
��!W��k�1�m����
A��������2)���h�WFiVi�m����Q��O��\�K��H��*(�*3~�������L'���2A:�����HAn=�Z=��1�|�eM(�*��������F�"�Wd�� �P�x���R����x"e=��=�A��1�+�����D��!��������PL�so]�����t<��Ni�]�
2���C!T��+����Q��H�loiC���SP?Wf�Kj���c,
���W�+���x�VqS/����<�Ok������_�b�T(�m�	�r{�\�>)D����5����S���iks���SPHk�]f
?C���9y!X�k����A$���R���zv�%9�����~���U���0���NU|�3DHth|38��+����W�Q��?���Q4��kZP���3T�6�����)[��Wj��o���&���������x[��~JY0g@�C�<�<N����[?��WQ��&��P!Z��s$�1
}5�;�+D�RsY�
D\x%��2���z@��
<��5
��?�����4���fQ�H"E1���>� �d�=W�+sz�=��7�tO$���"s
��N�$4�*D ���[n8.�A���}|�d����^R��fk�B�;�[B�B�i�}nl�:0�������/l~��d#"���0k���0�y�Jf�tO�V
<�B>Z�H��K��������S������)������������
�(+�wP
r�D��*
���pZm��q�dS=�g�Y����:�1�}�mvx�3N0����(�C�;�gf�-��*t"����,gf��u�A ���DOf�����J�Lz�~��\q�~�g�����os@g`�S�G���>�J�,������m���0�����]K�j�n6���c}�9)�k�-g�}����v@�G3�����������)"��_V�eI/`�7fe�4Yk;Y~��f�#��
K��fQ���T���?(��(���[�`z6��pR���R�)=h�&)���m��Z������������ ��a�m���k$��Z���N�<��E�jWm[�~��������s�������@�>��j4��#g�X����8rG�W�e�i�tF������A��*����b�WSh�1 A�~��@�H����� ���4�vU�8`��VTQ/�vHOxJ��4��f�8y�4��O	J�gy��y�f#n�U5Dd� ��3���j�.'����`��1-E���������0���Z,�w	�3fT��}��A+$`��3y����9���4R��L���P�����@n�	G��O���{����[,>�

���Z���W>�|�g���+�DYLb��+k8�t������}Z��R�2�I\'x���dvd%�}O��	��5'�
��)9uz�&d��u��?Ur��v��n�k���p�>/[+���O�jQ^�����h�)�3�37e�ZBu�?����T41����jG����<$?�ox�4�}�~�r�D�W�Y) j<l8�fa�d��3��y0���� ����]�z�����?wC�(�Y�;^YAS`�>���$J����h�Uonb�����|���2G`m�O��0_@N!�0W��T��~����*PSUF"7�f�b�ip�^���#-g+��A����4*D�h��� .Z�V�������}�V������D#]��>�,��	����G�}�Q3jGB*�������5MHy���>qm����
v���+�_�OO��l�����	����n�=��m��s�� �j�9cQ�W��pn���m}�����3v�����
����d�=�(^����4{��� [�r
��n��b�]���8�"������2;�4E������Bd2V�s�#��Z�������p�\k��v#�R�^��������C�
e�����nc��s��5�:_��C��y�\����l��b�Y�:�T�B2@�M�~3�@�Q�U�P#B�qj���c&&)��v�W��vC\r��2��Q��Y�����@o��@���'k�9��X#�����'��4:3�;1+���X��6Lk�!�������'�w���1\��(C��V�j:g�W�^(*0�� �NS���`��	������	��w��`�qHw���2�u�tg��qL~r�T
7" t�*D,�:S�Z��������N��d��
�vl�Y-����� ���Q/��t���[yA����`s��c{m��	�ju�Y~=�
��h}R��	�GV��]	?����z�J	�}L�WgdV�J������Bq(�:�\}M�#�TQ����7S������tT��DB��ks��_�J�<���
j*~T�
�W��]��S��%;m(Y+���]Ps�4 {�O}x���)�v�~�����Q�Mer�F�L<K���"�e�f�ip.<��J�����H�����[�j=_6[k�M�x��\�Fc��>g7G��c�����Q���0���Jn���t�|E��w-��v^Y'�~��=�g)tT�ZH�8JI�]v^�������qv4�p;���=L��������Y[��������U�]c�����n���H��)RLf�7s�k��.+
�J"z3'=j��S84��<�����Fb_�u>�H5
�(
���:Sd����V��3���5L�]��w�������G���������J�"(]o��/0����}�EHZ��)N���Qo	)��U����2)����B~�/#H�����7:����T4�d�����������~[;�������B���s��Fn�������d�U��ns�����L���h��#�)f�L��2q��a��7/P��a���)�[�}�("���$�_&�[��u���=^�n��N��T?0���=~��-�]3n�`�6���>��-���t5����vm4�&Q�
�z�LA��<Eoa=�7�D����1����.*+v�H�����������4�Z+R�tPg����,��1O4���^�ji�lRx#��!ys����<QA��������=��&$����XTgR�sM�5R!�<"�/���:"�L�E����P����^������~g7�`�Z�4����u������N���:S��������������A[�FsX��lP����P������d�j������C�`c��rM
��y��L�Ws����GP�w��X�"���
ZxNN����B���CAk�H��}��-�����
����S	����w��-�=A�A�O�rwT�M��tc0��
��v�4�D�F��T��He�2�������
��b�d��[�T������`���d =��tz��-��d�������h��xs�/������
m�x���L�O�v���%_�C,:�4e���iae{� �"V�5�m��1�����H���y�7���n! |�Z:�Z��1�E�M��9��e��V�]�"�
"���n�ChY�� �\{�8�\G�rTXG����_������Kfe*������
���%���h������A�Ad
�8��Cz�������Nu���JL����7����2�����;
�������L��������`�JPh7�����0
s/�{�����v�8�qo��NCbr�r��KTX���Q�9d1�1j�	����i�*��$���9:���
����;�����C��bW�(&�O�+z?����M����C��bg\H?M
�|��6|!�����(����G��B�R�v�].�7iY���[T��Lv6�d�K�j{���p�
d�*��l$��)��N��5{���0cG9�����)��9��*/�e��Y�
>���
"i�u�A��{�����1����;KEQ*2�ffF*��{�Q�n�jq����Lf��	�>�`�$����~l����m)����$EW'#N;���~���9M iUe��})���tU��b������sFZ�*������ZF�����h�3�P�U�]�[����~d�;��B�8iu��2�l��t@.��C���w�}:o�lj���=U��3��f�4f��U��T�C�{'/T�p=�A�����2��X�Ra�6
��Z}S���I�t�0{������s�%E��L��D!����?Q_�n�������4fP�uL�*�3�U\�������������&y�N���i��Uob��iVe-������0B���5���o}�5?a������e����!�0=����?wS+�����N���������9-���Dt5~���Xm5�P}��b�1��rp��p���V%y���;�����4D��l����]c�)/�����}���T�J���[�0��T-!��UX_�`Z
�4�����r��F�#h�-��������P�F���j��7�t�x&l��;p\��r�{P'��{qNp�DU�������@Fz�C@���
���S[G������[)��n4��8�j���m�|�.���
R1�}G��u��F�-��t�Z��dAyp�%����oI�S��������d�'P<x.x����f��;|����+G����G�l[�j��R�&9���4E�F�z0����]��7
�`������@�&?�xf�?���2����uI����%��f�DA����`��
�������/K�v7�
<!�={i��F�!-�_kw)$�HN���#�q��_9T�(��t��7f�������)rC��3��
� �^[�H����d�,���/f����e���5V�?���T}�
���N��c.
~?��sL�EZ��Y��2��s]!��x��?qJ��~�z�+b5��� >Y��u ��9"�V���df@���&���W���Xo+�����~>��	�B���5��kiz� �_L����g]x�:�*s�8��H��w(aH����s��B�s���F�N!���C@!n)��0��b���K�"n�5��l>��`8O�9���)8:A��A�il`�L�tP�Rw�V�����i#�Z@�

�Ra�T�v����g]H����}����09���a�:e��
Op����n���n���:��
����k~�.*�� ����RU�6�����!��$)+<��u�e`NZ��(����>�~�8t��q���/&���wXw%E6�
�"����HeN{w�q�����V���;�4��������d4�Cr��t�{�Cd���[�r��qY������,s���	�_�v�o+.���u�4���d]q^�M��������r���8����:�H-P�rgWe���H-5�v���[T���&:����� r�����a���
�����������7%��H��C�^��Ui��(�PI�����J���u�G��_�
��y�y��-6=rS7�����T�X\�����8s�%>h�&)�cw�>m4bi�I�^e���'�qA���lF}+�d�_Q+���LY�m�nvf�6��?��@ZO��9Xo��<+@���������mR(��)U�(>�N�����Y���r
3�7 5�>���Y���,�e2(��4��_�+Q�F<n)j$2�
�I������{'�D��������d�8��l���E�E^�!������s�a��F5��Hm���������o;��1���F��~Z�Z�\�)���P�����qIC��?�R�P2��"�(i*3�,wdVN�)�[����������7��j�9(fwi%�WW�t�}5f��g�Z�_6v�y\��h���Y[����V���:��L��c��1��#���E^!N
���^g�5{	���$�[��z����4�L���A��5�'@��x~U����+�p��u�;/��P�\rJ�%E��_� �F#���,���9�_���3I���������R��������S������z�` )��N�kY�tm4�2g�P�)F"����/�Zv�@�.�t��A��a��r*��I'���r�'�������3��������
�2�h���������4��(]&��5��4g�E�yE�j�Dz������ZU����	H/SG�����U���P�
<����������_E�>��>;y����k3o�)����4�<�-��Vum��om��3�[g����A����F��g�y/:��lI>�<�%m`^-?��oG'����������Y�H�Ao4fP�P�,S6N~�������DE����^���4]]����q���4���V�+y�4|@P��B��Ht�2��I�>��!�4p/(�*|�����%n�H=*����I��o�S��F(gz_UD�SG�)�i���m���k��]M��'<���f}����P��:l�h�����5M�v�*(���-����D��" ��jU
6d�o9;�����}�?_)%��*�94�U�L3�B�iA��}�?�e1}���A-�L��%�j��q��k@��hB��@�i�_q����:r�i>p�R�3�T(�Q�b����E���������[�:%E�����tPo�^��W���y��2�R�`�Ud]1x���T���Q6��S:%"����>X�����8��A4x��:�o����R�9�)�y�������=���0�7P!�]�zc`���k����A���;'df�T��=��~�;'dNl)4yw���tb6y���N9���:�g^��O�	)=�6��t��p��$T7(L_���yA>?���W�t�p�(�DM�7���\/fsy
�z�-Sw��������"�(*���[n"�.��t�u^�g������tv��q	��3FKU��$����|Y��#f+5f��==3H�M�O
�()`S�y_
~��!��=��
��js~���z�������<���>0o�/��*��YI$���O�+��f������6��2�,����j7]�W�tG��Q�V��K`J���{0�������`�����6��T�sg����SG9:	���W�lM�<���7?���I���r���-���
]���|&�D���5��t�<y����x[7m�����������^v�py�����[7�m�9��sD�����$o�N��~l)@�d�9����N���f�!~�/fI�-�J�Pl#�$Hm���^�X���
�ym�&��GTG�H�	���K&�(/H9��/~Q�D)�P�<���
O��@9N��~N�������*%t�Ek�u���Y#���ck~����4T}X��T�L�Q�h��y��/�Ma$��l���Kv�#�!�?��K|����K:Y����<J�g���L��o���YR�#>){����##�=Cx�1����06�x�a���X�$� ��b��|���:5�eDv1_p���'���I/w���|�j�������t#���e�����r;�a���4�(>��v��7�:'�&�6�"��Dx�rW?�i�@{��{�^h�����h2.�"����i/�������3�t�N�gx�h�y]���@��Ik}����./��������7T�hD�s'��7+#=��X����~���f��b�Z{��d�tl,�k,����lB�s�j���+����s���z�M�9a��f�T���Tn�c2L>D���,����*�}��K~\)��s���L�A��
�B�>����a�*Y�MW��^Kl��'�<�LZu�
5�/�Z�cq�9�:�9s���&���(J;2���W�S�P���C;�|�a�+����W�+30�|2Z�<%w9���g9O�$z`�1��`6y�����_����s��Q�5������R�����x���=.�� �N���2Y_$D����lM����PQ�U��8_�U�a��O�?����\)�_���FF�S�X�����&�NWm��U5jMy���������uE�����)�,4�����'#���;~3�'��
,*�
�ns��"�����73%Z�a��O�<X�x�������Q���o����#��������e��{E�>F!���\��'u��t�h�U��Q�Y)��}c�@*������P�N��0n�0Y�SnB}uP�;1�C��I� yBA9Y��H����<b������������c!.�9BA�3��D�U |���wH����@������ofJ<�^m����\��>���U�jy���/����w�o�]'�_���)r�+�9yoX'�_��l+?i�"��~E����o��N��r�!f�{�Q�x�K����s��H�_cz����n�����O�x�+��vH�K��H����K��\�}� ������C��wq��}A�',�>�h�U��}�����������e�8t�����r����}�P@�a����;�����y��*4q�������U�S��0������s%;KD��)��=�G���9Z�[��m7u���������
5LHQ�:3r�}�`�����&�n�J�,������TK#���8c��c�uZ`��7~l�������B�q� A�e���\����P�#fO�^v�:�����*h�)���2��Cw-�J*|,d8���V���?,/�����v`���
8�n,�p4����F���!�3:	�Tq�x�pu�<���p�N�Ym\�h
f5�0k�9�J�F�R	���>�VmROH!<��S]�9�x8E7+���_m�z��K����&{��>�dm$b�9��ju����@WP���l�c�Q��� �eJ�`l�{���Y�D�z�A`6n��'�< �xF�G��gM�G�QM���RDF:<�TRG����X��P�u����u�dM�gT�d9���-��;����5�b��� �7�����Vm$J��<��E���/���X�wv�;�r��������(eW���������)����A�����[s�h,�(�;���B����{��t-�D{��w<�';0k��7Na����������Q�*q�����H�P����Km���2����������H����x6N;-�(G_"_Rg	:��v�wgi����!���P��,=�� ����&��,��SM�� 7i�5����H��=E���wj�}��Q���lq(j*�T&�p��S��k����� �o�d���uxd_|�O�s��W�Bk4�S��V������`$_�4�cZ��td����'y�6�1o&��4{m|p������<�YF$�:B���<`N+�g�TJR8��?,�WG�zw��lmq�<�u�[��,���5�}��y���#�@��T��2��iF��]��b���������������d<�������_�G$��zx�#��^���0���xA�Z3�+s�J�s1W0��>�|�U5f������2_vPH���j����gi�|�j�������qn��!�������~�P?���2Q�Q��G�43E�]��^���1�������G�a��k��QP�3lX�f1
�;���w�|&J�%A�\�O�<��{,��
�������	�b������J��^��4�5�o������x��l��h�	����T����L���K�~F�#M��6O��2
"z��8�Vz2M���\yA6w}�����E������R/�R���}��@u�����!�4�^��BV����5 ����tc��F�d�C} �-
�F(@�.���	�}���B��%���/e��������A-/e�Q�%,���t�����^�<���W�v������e����}
;�uQa+x� �g���w~���A�*LK����w~����;��CJt���\�AE����}��s����N>�������*�"v�e}�[^�
� ��D���u�8t�4x����u����C��Y��� ��g�a����%C��w�;��i�����SF���Tf��!��|td7x���o��w�����^��0��!{D��~��M)L��T���O������s�D �#C��w>�����i�!��k0D�n2O��
A���������]A@h�V�d�����w�=�E^��S�(����t�8j��HE����/3!����*����G�<����-�}m�	V�&��V�O�I�����7��
���M��A��
�H%��o?O%(R����z��K���D*/88f����Q��T�Dh������y����/�W�d��Wkr�B<���)\)�fA���K��S%���ST�eWI_�;i��w9��B1��:w���P~l�8o�Vu�M�7E����O3T5�-�-�^���A.+h8�4���Fb_�y�$j�H������Sv�)�I��dY�W[�k��F��]f��sG�rs9�����c]�.+Ptu��k�:�F#P�O��Z����� OKt���u�r�F"�n���F4��A�EQ��Bq���F��3
Z�v��������!2�H����V�S�!���nONf\X$R���x6n�Z�V��WS(u�i~wg/K��k��S,
W�j]�n<M�9����;;�=Q~��?g�VSD&P��[O?����%y���uS���@.��R�o�u�R�_w�iU!*��(/��7^t��{j��Bnt���[���)/7R��M��z?�<_y�W{%���(Z���}$��FM�����P�75f�a��6x��,g�;�����h4��s�A�y�!�����K+�Xj4�����>���"������l5����Y��b��u������
$��!����3��.]O��^s�U��H/H:�3~���Tnk$X�����@ B��:��[W��.*�4B:�M��[�����E6f������H`��OV=�dB�w����/�r��C,�i��
���"�^$����kp~R��b4�����yb&��G���
��,�@
?��9K���,<�w(�l���r�����Kr�^uJ�X;��-�L���WV��@�Ea�OD����C�7�B&/	:,~*����uIC \}��y�|k�	�:��"vW��LZ6~��0�%(B���zR�_oK��2�h���?�;����/Q)Ak��4f��@�7��z���Y��T�������X����;*��x?]wH�n���q��������nM�������Yh�{\���������S������k@@h������`����	HQ�=^��_�!��>�<\f��/��H� �����G���&�t���)U���6peL�Na�4����������bz+!�W�]
�JG�����w��g�.*ls�|����0JW7j�Hw���~ �:����*�e��!"R��:�����;e-�i:�fD��t�Q�
��^�u�����
�yC�L������b@l��2���~ M�t�jK�m�P�l���h���1=��2)�7u7T���<�'��(��P���VC��Q�1��A�y_�7R6c�SG;:;������:O���w�3?�$�AY>OhU�,�9�a��s�8W�<@AM�yB�`���:�P��Z_"�BC.���L��{���'@��W!�)��GY:]U��NB�6+D �����0�Y!�)�c	�
��ZH8���Z4r+v��B?<�>g�� ���2�uHK��3��qO"(T���E��B���'�������{A��;���f>�
��m$���@���B�H��~��*��Lg��"�6��w`Bo�>�����u��w!���])�����4��(#����s��/�yH���1C6��]����)eq}&���5�V�>��&[�2"�7���9�0�>c���")r��i4�o��������xfN��/{#H:N�j� J��������4S���TI��1S��p$�p
�J���^mA�K�nu�t�Z�n,[l|����(^#$sZ�����F<)�Z]��w5�7\��S=����f8)�z����)�:������n��v }��7����&���T�V5|/��#E��"�pf��
�4A3�5����E��B�1�����6�$;���U��\1��h$�D�i��6�>X7I^=xL;Nco��#����~��B���
_����ycQI��~c<�������d|'~��|6�D��g��VU�G%;'��4f�0�N1�5��Y�Z�/!���k8���}����
;�F�����i&7;�a�WHR�n���Z%���dl�$)��U�����H�{�Z��b�w��cc%���H$?�;�$RJs�����a�T�O���Z-�.��VY�Ydf��Y�D�o��`����$#����;��Y:�����2����' 3hQ�;7�L�����	jv)
7n���Z�NW\��Ie��4,&�����9��
�KN0���<�F"+`�& ������i��1��&M���n@�_Hz�c�tp��M���Y�O���e�S$��C�!�j �������8�v���������	�����L�?r
�/q-#Du2
n���I8x���5������[�����9G��5{�H�9dO�!��n����2�z����%-�5�����������&��4~�m.��{|s�	��F"5*�;������"P�-������d	�{��6C_G�O��rWb��0�n�:%�,6���M������do��
t/H.�e���v��P'�{�P�i@���;g�$�DA������E�%*Q2�?pgO���]=�GP'��l����+�p���H�iu�������F�g�����e�["2?#�[&X�JvL)�������sr��,��xN��WR�S�������FpJW�v}ml���Im:i9L������(Z������$�Dx��$_��]��t(9h��
��)��sy0�J�^��u_ ����y����*]<'�Y&��2�]Q
�Fc;��@
�.���F��l������b�H�K�ZD<����^���t7u]:���/��Fhw!W�&*D �0�bJ����C�{���*@���v?`��]�{�������P�Q���W�w��
��������p�c���T6�"l��H������E3xQ�K��N3���cF@;>�}����9��<��s��4u��~;�k��e�q�U=�B\�a����gd�]A���)���\�k
���(+L�����#����e��m�Be�j��r��K6
����i
���*h�8��������dM�����2MI����my�����6"�~�Z�@�@�����,�1
Q����q(�i�s\6��������;�����(2���t�p�'7~n
KZe��c��5)zo�A�>�������'��C5���J�,����m��kf���1?�fM�Z�'0�|���l^O��@r�a�<�C2*<���`@��}�|�j��8��}wHQ���Jc_�x�mLD!�H��]�Z��=��t�@���/Ge&��K�*$���M8���'���VNN��	*�?��p���yG����*��������W^�>�Tl�7�%0:_8���G���wZ=���q���uA����������G���w�z��C�1i�xA�r�M\)IA������J�����3h+�������|k�z�����4���/e�zXkJ*��)IQtu�w��r�d_hV+�%$���t
����6�~�����&[���\�]����);��2Q�:��]�R/��rZ$^�h1�]�Y�\��������b����(�i��l�Z��9^�Q�����5C���H�$�,Qw1E+�L*i_��4��LJLD����D�|���dW/�
~3H30k��Lr�"_����#���
9�����Q��I�.����t8)@Q����K��?�[^���,3R~�g}��4���X��%_�t%���#h�	rg�)k�\6d_V�f�(�PTi$Xd�`w2Fcop�;�r�*����!d3)���
��k��7�N4�30���H`�9q��2�
��^��KwV��a���fd�;��*}�������w�r[��:�����IO���s�B&Q#�EnL��&�X#���Q��s=�>����k@�
�����Q8�b�������{
3PTi�o��M�����M�U��H�@z@ze�/a�L�����G��&d,(�5MHw��?z���_g�
��^����|�����G2R=|#��;9��OdY���o@zeF_�2�� 4��kxJ<!��w��\�>@�c<���w<?���h�uYE������%��� jbf���.(������h�/�y�A�&#/L���<����X[�bm#��)�j@vN(���o^P
��4].�E&�U���<�����'&�w{�����������z��V��J����Tf���{��=�K��c]:AX�Sa|n�y��=�J�����K'G�$d�t��Hg@���7E�E-c<��q?��`I��������S�����P� �
"PyPL)>8���!z� �RF[������|�zAd3�0���*�2��;%���������H|JE��c
��}��\��h{#q��gvP
��I�����9�&��kiY�t�t��2[�K4
-=�)�� 0W;�=�t���< ������]��T�
1���!E�z��V��:ka��7x
��Z�!t �����8 e3�K9��:�������@���:���/?s����&D�����G%|S��R�I@��5�fPw�hU�,}(!�c-��
���y�R���)$�f���4���B�w�w\��tlkGP'��_"P������U!�9@�*��8��9�)*B�����(+t�wr�G������B	��%�y�� �7�5"P��p�X���
QE�X�!
��T����W��i�ex���B����E�}��T�[��Y+���Q���Yqf�
Uo��:�J�,"�����Tj�Y���Sh�( ��CW�������h5���]�$_EU�@��q[i�\��1i���w�"}�1E�E��X�?#���D�0<&n�f#_v�`K��neJ��,�vb,�F�0���0�QI���g9j�*����\�L0�5^<����V����]F�!�������Jl��)|oC�Ia��}C������y��J��%H�[U�LB�s�w�����VE�'����z���R��������3sX#%�����sn����g��H����`?8#��]o��8��{3�����fr�]Q�af ���h_���E���W&�I������v���d�P%�#�
��1�u���Uo�trc�(�l��1i��A9R`��7���}�j+�b���T�##=#��l��}����Pe�#��j��Dewdm�JG:]T����B,6D��������_:[&��P
��j-�l����+�|1�2�o-<�=BAp�O\����YE��kZ��
>��j�*��)������V��H�1"����������z��k��o�N���'EjU�!�w�XA��K�k4�)�zPF���f��*�����A�M�E���Wy�Q�����]e�5v@�O�K��t^c��a����o����^|�(j*�(j�I�VP�O��'W]�0_�`m4�,���B���t�i5��5�Zn?W�_�h�B�����7
/����S���������H
���P*3�h�?I�j�G���^AUw2�AF������}q!��S�����6�"�I�J����E-+sq?]LRtu��{I.R�����iq�M.��(�����F"�X1�V\�+m�����~�Z����B(��������
��
|��?O1���P������)Uf����t*�<b�2����d���{�(�@
w:b���m�������'B@�S5�WA*X5�T�5�w�����A�W�'�	�~�O-����%�(n7C��xX5�]hV&���!���Q��_�	����Cz�M12�D��������Se�q����H�>Z��������	i�����}(&+v��B
y���,W6 ��B(�	K���[Y^�����v1!����8x���!kt|=��F��H�ah���uH�\h>��]g�.7���%��}��s�
2!�u@�5��hs�����gJ6���S�X�2�C�$^1��U����l�Cyb
g��],b�9�o[�G���3�{y��.���i��Kw�������S�D
�&j���~�|�-:?�k-�����Ri��CO���y?G���Y�{��/s����-���f�)�}�;�g�z?Na�v��������
���������Sn.W�u��F��`����y]����Lk�f;������(�����o���'�F��?X8Q����Z��LO	w�u��MA�(�j�)C�y��E���zj�0|:�N��������h��y�y��r�����Z�����^�&o����2���?�*#o�5\}�w��cC�;o�Z�����z;��E�?@Q����w��#����R���"_v�HOP"����Y��T[60M��w�u�����_:��=S��N����������,:5ZE��=f���es�����4Q1�7��-���XqsWwD����KU����F�h����y����Ure����0j��>���E��d��g���=E��2j��.��:������M�i������*���^�*�e���M�9����x�n��������J�K���)��Y��X�����<��M
E�����_Vps�����������J�qG��+��Q1T����tn�h�2�A|F����W��U�~d��P��VfU���X��XAUq�(��J���1c�~�O�'n����<����lno���t�)��sd�
�5}�>�O@�)���sN� W���lS����h|���F#Cc>�}��bG�f�)���-w�Lr�����\�i��TCO
69>@K���'#�s��5����������`WJR(�I��������}�w��k�$��5�x�>f�Z��b�������x�N��Z�sq2
�&<�=P�~D�n{���io�.&��Iv;�a��8�OI-.iB�
V�U��W\;znk!�GIe7F��;P�a #���3zG�c�Q�JFr��[\���r�H����t�Xp'����K�E_�C/H����������\������@x)���m7'�T_�Ht����U'�@]�fTW�����c�?�In��D^��~�Z�f]�5
-�(Jv��i��R�f�2%�`	__1r&
?q�{�L9>Z������D��'	1}V�lV��^�K���z"���dCyd�S��W�P"�����t���E�X�q@}��k���z�0�]��:���#���F�`��������P��RA��<C�t�8h�0!�q|��q�vM�@�����H�/���+�} ����������s�E�k�?������z����`	���/����{����`��v���.l������[<�b��TD�C@9��wq?W��nA���^��.��%��Pn���A����tI
bE-����n���1VYw�A���R��"{�xU��C�l50�O������Y�1@�g�Gm��:�D�X;�.����]���:;�9n���e�h+�#J�;/d�j�6�~�N����3������5�Og�������)h|!{���/��� MS�!f�H�_��\	Wd7�<@>ED@�kn�6@(��kY��|�^��������gu��*D�.;�_�Ff���2������O
�j�2�h�zY�)7�%�x!�?Y��5�R��v��!������������~����vp�+�p��E��Z��Dx�g�����wv�lPx�@�������=��-U5;����S[��=\��5^��|������%.���w��w���{���r<�C��	�������[I�5�V��
S�Y�����l1��h�v�� ���x��"�a�wn�x����Z�
T+���Gn�s��2�=R����s���(���-���e�8�k����su��*�.�{���@��5#A�����)U�2�o�;r�
%���[7�4fP�)����w-�\|^��9�@7�d����t&G�@g�W���7���7���^,QV9Ta�^q�~x��t(_ D�x�����uj[���y����!E�D�n��%[&/^���5�1o�
?U�������I���*�_SF}\)$*�<s������R�]#���������Bb�����0V�]]������h��|��N���_zQ~�����6�/Tf�f���E��_7������j�R��#5��
�U���RD�!���A1�j}X��X�m3�S�*���^EkO�I��r��	&�{GP�-���_�Y��b�E�[R^%��t�r^���1�w-�>G��G6k3cg�$��b���H������N+����V!���r�(i*��a�����2m���eG	���An�����v�@�;���4�*���\���@���-�����	R~^5�b-d�_�ZZ��rD��cMU��	��b���T#w��M
�� �?�3i�1�Ipr��*e9C��#5f���B��5�A�S��{�|Fn�jq$f��F�W#z(��!Z�}]]^���'ihJ����
��']d�f��P~`��7�rY��5��RGu��P��i��32)n�p(l��b3Lu���)r_/^Y�T����]q����r{�-��E�aByc��Aq�4���RQ��^P���{	C��M��[u���5C�E���C$�F^?��k���&$"i�D��;�������������U�G�S:��t'#�]�{��m��)~��B��K���*~rZ�a��)^�f�`u.v��!�L)��{H��U���F~
-l��h(Z�+q�xJ�O��9*�9��S���R��+�� ��0%bSy00�t���v�/��k�U��e}o��]�Y�^��W���kD�&�F{�d���
;p�:��s�ZY�[wW����mINa�_w#W�'D ��o��C��=��0e�O�YQ�U�,�]�����\Kw��}�#�E�:|����2���)ty�p�x��g�61
w�An�V:�Z
��&��H���b���i���K��:���B��2B�u������5�k���C�mE���L RH�M-���b�f������%�
;Kp&�y�s�S7P��
�W �{|=z�Z{�l���b�#�����h���`��A�R�a���� ��/O9+p(  ��+��
��b��)�����B9J�b0rR!�)x9����vU�<@�fB@�vU���������6uH�!���U�Yv�C��Bd����1e|~��bo�^�s��S+�6�yA��sG��;�YHv�Ew��������/�e�O��AR���>������q��uB?���������pO�N^�~����;����u|��"����B����L������i4�E"^o[�qH��H���tsx��F<�S�y�>��
M����P�n�&���}���"������YvY����G��Wj����3��j���������m�uSi����*�)MV��]M���J���J�9�AZef���������&���_��8f8��������h�{�p�B����7�����OF9��'6^�e���:���&�����C�w$}��
9[�������
������v��ak������������'L�7�T����l����#F�QA�%f���f������H�a/#H_@#7f����T[�HY1w�V}x�k��tLe7f`���k�{|ck�{��������wg[����3��Y��w�4qI�H�k�<�.�H�W�U�$�� �`�7^�RF�����T?VF����������0����#u�����$z����A�1�t'����`�#�%����>!E�l��Je�
�����)���@v��'�����E^\��q��
?e52����y|�����E[7���j� ���]�R��8��t��[��ucFG�;{�<��r����0�R�"�Q�myB��m���K
��x����U��^�G�_Hz�����qA�\N_�
?M8���6*���_�"�k��O�|�'����m����[2���h=������k�dY{�������������[�n����NY ��)UfP
����L�`��6l���N����b`6�)GyT)LW�����4Eg�4he�!{}���@��7���bZg!��w�^M������o���g��+(+�RkF)���R�������B��6`m���L
T�m������e����c�+3:���j�r��v�J2i���<RK#wq�VZ`B��J�u]��1	M�J"���:s)��]���	!�6�=��<�/�8�J��OYL�j����iU��&�x[�����R�����W�W]vNH�M@y�qN�yN.>�_"�k��0��.U����'���"�i���l"�����I���b*vx�l�4���o���5j����b#�h��+@Y^�
>���[������D�h�];�����b�SX������+���8�4��6�"l��h��l����<h���+B$�0�l��v�����Mg������o=�T�&�k��
l����:��(���$g:��S�<u�4"Rw�#����0�������vB�O�W�g
����;T;~�*
���(�9�?%(H����>w
L�Ia;l'�<z!��%�_;1���4�b���
�H��u*��C�}��'[/)���C@��B�h��
#��S
��n��X�����Hwn�5�P�]G<��9?�����`�C������ <��9?���n���<T|�~|u��E�f������R�|?RTb�(�O���f���>Pv�NVT�eAO$=A��1���L�YW������zyt+|�� h��l/V_���Y;�RH�8��i���Y�\��$��k@�c�7����z��a�+���G�{���F"%e�))C1���k,���9�G��l�*�y��a��^B��Ch�����S�3���[^�g���#��f^2���^x�?F^<�����������&����q;��Q����w�N@����#��{�6�d
����k���	�q#��O��V/�7n2
�:pBem�H���y7����Zn�z=N�A�
�����������3/���u
o�
�����5���zKe����(�����[W|���j��q��������-�-j6�� �?�5vf��I���������5E�|WR��_��B}\i�3:w-F��[��MI���������[��>#�����+_��������;�D�"k`�E�>:�a& }���AMQ�X��xp^>bt���e�M"na
����"�wWi�2�~�����J�H5p��!3[��h��b1��7f�O�\����0B:����P�I_�������������mn�z�2i���W�H_{���kG-�����|l_�w���H����&|t���|�"�$E ?����F2p�F+��n*tr���^(j*���U�&�94��UL?e��>���}r%� ��$�6@V���:E�=�q�v|�7�@>������������|���x�>���o	2G�O2���~=l�aF ��~3��~T�V~#�b4��j����n����X���#	R�QP�v�)����1�H�����1��(<��KG�FP��b�������g�k������J���������& �����A�e��T��B���gk�����b�H�Gf��
R�1�]&(cn� f�3�'?n ?EL���sZW(;j��b��j7#�kZ.xBNy����J�%x�~�	���
���T�i${���T�TvN)���J�0���������P���\�1�)%9���
.B�|�O�f�I�1����h��������T4�lJ�O�A��/�5�B���k���.��'$kQ����
<��I��y��v
"�UX�i���3r�c�b�EU������!�_�����i;�b{C��:d������D����RP��'��}���6�B��@�tX��<7Be���
!����pY�'�E�������y*���h(`�A�A�������*MR� 7�$���Hs��3��7�
��\�}R�6������C��#6��/>�rI{�~�x|U��J��C�c;G��l�o's�d���^aEf�Z`��7�@�g��}���n`����d�h���	������!;]W:g�p�����CEQf�v���}R��;R:'�0{��I
�t;�[��C\_Cj.,[jy�� �s���\��\w�By0w�����uz�����Y�8t��!�-I����'�tb*����3���4���OT��@|��/��^��(A�{�5��[������MX9%�X�c�Y�������]
S������$b���������t�@���#Z3J��C�g��Y���J����-B��);K��������<wY%�6���h@�����qju#�
����� ��O)������K3
a���[�\���t��������s[K�J�G���=�����\�qju����ZIq?���9���V7����YY�}�AA_��<�����9�"i�j����Y�������8A!7���VN�M4
�O��y���Bn4���VUg��$/9NW?�<���*��sZ����<�� Pz��o$O��'�[��)�u�
��l����B���Rf������<!�P��XA��d-�7Gw��1H�U�$c����Fv��TVktN������9��h�-�����8���W������[�to� dHb�
&NA���U�%�C8����w6�d��tW�7r7>_��wHO6Q'�iY��������`W��C����{�����5�����]�p�Q:��q������:�S�)(\�����}�(�z�>����_.'�`-~#�������3���T]�����N(h���?�����LE���8u���S/@���1�'�s�Tc����H�N�������r3�����>Q��#�-n�}u`Ev
�t�����K��!H[�	��Y�5��p
?�*�m4><v������@�����#3!D�$	�Xn:DF���P��xA��d�1\�Q�>��,@�������swfNizpA�@��P!�4\4�tlWm�N1�x�*�A��������QM�#����|
���H9�4^��IV������y`d���W
>5���t|4�%b=�X�#>�����#d���]����G=;����K�'�
��tU�
����vo����N���AA�&/Lk�{��� �����Aq�������i����49�d�B����u��G,�7���x�9���K�/P���S�����sHr=���o�	�:��[;������8l�h�����Y�$�� 	x�Y�*����)nA�' ���4�l���wU<�v�>��r���b�[��)g�I�����L��(+�Dw��9D �*��ra7G��T�A��8�~�8t���\���oAj�8�&v��U+k"����
<���W�5�����3^����-������l���o~�2M�Q�aC��#M����}vo�v���
]�_����oW�
�GYooF�{v�.d��L�4"�n����%�\o����j�����g���*v
��^
)���s�{g`dM�*ZAK�
,�����j����3��\M����i���$_�6�
M~�\;'d5\�R4l~�-k���\'$i�m�4��9!+�	y�����];'d�:!y��C����	Y���@
[W;U�M��������N
a��A����rh��8yzbQR�Yh�n����,����Z��)Do��YHP�:sn����Y�����:n��R/�n#�;���T.7�cc0����"-�V�X�����J��i��r������F!�9~�V��5)�}0�^����H6�d�F��9��E\��v5���ON�J����1��_�SF8��u�8����\���D_� ��z]d�L�#_Q�T^?���/u��X���UqEO��������|^�ZV�IhW� �%9Em�a'_��n~;1�����f��w�{�[�]e��t<K���4ft��;
U�8R���������������'���h����O��;��U7�ha�m>Em�C8�y03������l4b'�G�h�����%�H�a�E�����eK��D�����.�F��0�!:��A��EmS��r���3Ro�]�Y0[��0���k6��);��'�5�dX�h��!;���g�N������H��)zn��|Oa����Q�TQx�i��&��_��/�#��H���[�mL|0h�d���x9�q�O�Z����4���*�_wv51��s;"����
{�H��]���3���N0��QFz�q�~cFo���+F�=!U���G�w�<���G#h5&��0)�+ �RX;����8*�5���C$�����])B�@��:�����&�S`����6��s:��jY�H������9��)Fo4^P�8����MV� ���5
�0�b�2=��A��2���l�
n��_�[��5hkJ��{K ,��6|&�D��0;a�KS�����L�b��S]{�A�y
F�^�o�X���j�#�������~�4���,���1��V����Xyr���`�	
�~*Mz|�����J���x�b���h��*�Y���Sz��K��l
�=��]�lL
�k8{�
�p5���������G;�����B��,��Fc���-���P��������2	]�r|&��:,����H2�J�/;t�y��;�o����?>B�)���R"n����n����rJ���NV���wt��D��'�*0���'�U(X���|�Ua�L�T�7�=��>�����}u���r�����#�������	|�������:��H!���3�D�+�GaCG���:��������a�����iG��,��>�.����r�8�����B��<�����CZ�d��R�#i�Y�K�3��Y��M������h\�C��Y�L[��bgEC��i%k_+(����a�@*l_+�������h^������Z;��}���e��n;��mz>e����������@QA@Z��C�����_������A�_����G5��<N�;��{�V�������
"PQ�Z}v�Z}�*l���`'��*D �����Nh�U�<@G���q�k��A����e'
��@-�����4"Rso���HEV,\�i�C:`�z�u�yj�)����l�^G��f��$wd���9d���>�'��l�X�&n��9�3Y�V���f������7
1��O���f]�)xq`wB�����H��}|�ZH����8e�;$�[eE�T�'�����-�\����b�x�8��1C8�����x�wN�����1��r{��4%�����
Y��3�������l�������"�+�}��#!�����3���/;L&�6H�4f�a~jM�:x��d�eNE��fi��?�O&�y?�Iff#_&G��;��T7w�qv��;����K�T��������oT�����������������~fb����������
�����%:���.F�'�`�^�/[(�l$A���uG���q�8��-��5�fDw�2�G��X������d�'�4r���r|��zq6��&�nq���Q~����k��ys�..P,���oeF9SID1��FmlD,�^���yx�/f�h�
��������%�	�
>��,�*�%qCb'�w����{�����M�x���*�e��������>J�!�F2�b�=��KJ�(G	�G�����_w�_����B��}��D&-�� ��W�hEm�a��������sG������Hm1q��8��b�tw�h�h��I��N�)�X�=�$AW7�x\ i�h�����
�X7e/_��dO4r���yn�,���Iq/�8��
~�8�zC(Fzv����9�n�����1�jV9>@�����f/o���7?3�gI�_8*���U�����C�J��f�1��z���;k|*4�n�����K�d������0g���~?� �oJ&!"�U���A������EB�i�
�F#q�����U��]�W�v�>�o������=&��Qr��+Z?_�7~
-3q�)^��tW�1E�m������7Z������
���Rt�j�������8��g#��k���2�#���2�������D*�
�U��H�@z�;�o�������H����L���/I���jZf������g��c���g'�Q��`��*�d%pG������vJY�
J�9�)U��pr8�����;��%#�_6�F�%���/���jA�����?�Z0�����{���U�E�Iv��R���������Sv�B�C���t`S�a��U�������u#�������-��FV�]��FL<�w5�i��b]����a����"P��2!����}S��Y��-!�z �F�o����.V�����7:��������97����7��1>s�4��&�����:��k -����iF����v����������O��/�������@�z{������1�������L.���V&���D��+��CN
�����~C�X7�E+�18l&H}c��L��(t|D�?�=q�I����,>����
H�����8���s�w�����^l��{��C+�C�I������p��?���^����	\���F�Nt���	\��*�������������H�1�{;�'p�R�hS�]�x����:8��Uk�����?A�O����d���������d0(�����}���K2Dik��`j%_J�HG��F��fn�E�)��������x�s�,�7juSpV>��#(vF���Q:�
UX<���6�U�EGi'���o�%}��[�I��G�
��,�G�)d��J��p��uU��E<4@+||I�d�V��9�����PK_�4�2����q`c(�����&��E�����}�W�:h�U���p����:���E�\i��b:�(����f���,#x��.V�A$�W#O�{��W�Nj����C�U��*8�c����{*{uI��yc-B�j��sV��]�6�NKG��r�#j+CK��R�?���vz�6�"D�N>��T�%{$��A!V���nt���Dj��T��`e\##6���J#9��j����}�Z!�K����i$R���eJx�/����L��������I��'#���m���F�>J��l.�5�_��?)�g6�����jESl|�	�gz�qf�B�Fc_��1	���#�����U��� �u���rB�S��	���]���
)zM7��������w��[xp.��n4�~�Y"m�`���&��:e�9�/�5��\�6������;Q���
H��V���>���rwn�C��2��a�EMe�o2U@>���a�rUd��"�/�|�I��j�Zc�9h��(0
���������BG���5f/����2��qY!���A�&f��w��xr�L����O5�3+�c|O�K��?������
�OD�j��2EDs>e�G =CD�1�����h	s���K�]Q�%k1Hy�e��CuS�
1��;���=U��-�9&"�5��������	L�=A�Mc���m�>y�1h(o�
h�W����>�F��*��3�%#��)1�m�rqr��/����3�o�������i�:"�>�/��?2�
��������`-G�yF�4��:��;z�P>L����:�a�)Q��������=���\qO�\Q�fW��g�5�*M!�S|��
<�W����!��Z�qB���,7�}�^>�b~	��|zL�������9-50";��r�5^u(��W���P|�������X?�����d*U�A�h�m��:�����&���!�$R�H�Bn�=�������m�Y��+�v���
"�|)W���� �E�>��_��8��.��(���*�X4�,�e��|g�,&�c1�0�=�56Y4�� �P��7lvh�5:�����E~�f�^�Rm��|�
�x
��z������Cz��n�����������m+�P[�HA�U��������/�ez�T��������s�Q�
}k�8��
l?��� �F�N������� ���^
f�lV�<@
��������C
���L�K�!����,��l��~�m���%�x���V:�'����e��6(�R�nQ�
������B
���d�~�N���o�<@������;��0��4i�w9w��<�8tz�^��kA�>����p>����k�*�6!Y7)d���	g7��xa����.{���� lG:4�4f)Hn�4��`�&y����{�CF�w�gG�S�4S(E����V@�h�6��R�6���'y���
��F#����Z��NH��.cyR���g��#��������zEe����X����K���&T�?X�Y��L0"�X���q/s��;�j�<�	Wt�����N�&��O!���PD,�����z?GUC���1s�M��
I��Mc5��yPvr��]D��kX!lX���c��MD����l��o&]���'�e�Q�$���JC��T�Il����O��������zR2�O�������	g
>[<d� �?[���|��|}������A�
>��O������r\>�Tqt;r���y��H������Rj
�RaE#\���N�|j9��>��	�v�dg��r�x/4�Qdyx�����s�B�� ��Hn5�D-�_���U�L���
���R��gqO%��*�
��(��BiU!�yw�t{�>�CK?1�(z�����@�QH����j�����s)<BeO��HN:�j��)L��Cx^�k��B�):���Z�����$)��T��A���H:I^��3$g/�L�	�Fb�x����*I�6��!2�-����_<.O���v1���ew��;���r�����`]z�5��\�dv�3�F�T��%�����/�*�[�2h�����W��u�#�>���JP���@�
�~c�������}�'@��HNauH�z���
����-&)���pMi����F<�s���J����K�q����*�SYv���H�a-���]��`A:�����b���+?y��%
&kZt4��LU�T3�Hd������tx�k�wR��7���BOUF'n/W2�
�� �������z�Avd�/����&{�x�K�H�ev#w��9�"���)�=��������NZ./��4J��]m�����RTi�eN�����l��N����� =A6�1����C9� �I	2";��'2��$J�{5v�yr0�N�^��?<��h\Y^
$�H��m��M�S����s�U�w�Hw�no����S���>�*Je��!�/$���,wrBI��;��u�	����=��w}p5DoT��Z�/�|4.)�������&�qg�w|O��y���:���c�mA�
��*4X4�*��I�zD�b�EC��U��d#a��	��\�M)�=�DAm
����Hja E[�E���=�l���p�p#_$bz�Q|Wf���,��&�'N���w[��������l��Cl�1�Q��MVl����A�����}Y�1D�\{�Tk���oe��C���(O�'���B���t4�@�����mY�rTx�m��[���l\�G1X\��v����"���@�i������C;�S�)s�y�����P���GY�
��@��8t��|@��F?��w@Mu��U�c�B��.��*���*�;��2��y�#x�����y1w���&-t���Ha�A'}_��s�V|�P���]�^��=�������Z;1{�c	
��i��qh�u(4��R)�@X:�g�t��TW�OkCv

�LR�tv�rS/�����U?h�I&�������B���K<(.�d'����6�$��f��`<�4M��,,��c���4��eAq�%���^EY�	��Z:�qL�������K)yv*��
T�T�"3����_���X9C���'���Z���^=.�Y@����TIz��6�w�E���!��Ai�f��;�'q�������9�K�
��y���V����=U�#R�[+v4@����c�������:�����l���:��^����.������$3'��N�L��m��L2��y��
��h���C��s.����(	8�
"��h5u��OV��)���Qt;����L����k�������i�`����������Dj���;���Ua�[�wfod_(7��)�
���U�o���-TB6� ���hu��L
Gt����B�������g	j���{H�������g�o��H��E���b�Tu�])�b������g��}��d���.���,C�F�����|�O>�p���&A�F#J��;R������<���{YQ�F#
�������Ak)I��y���H��E�	�E_�K�A��"V�7���i�)nLe[fU��2�X��x���-�����
��g��C�~���H��e-������s��������{[��t�},��Z���^;2����~=�9�(�_8���������o����4��?����<|���VA.s����	"Z�!���,w��<�	"Z
��5�h�G�Z��}��� 6W�5�*�Xa�}�
�uen�H�:yh'���n��mQBt���g��|���h_w�h�V��,1����Yx���@�t�|�B�BDS��S�NQ��^��b�T�������:wHp�v��-^^�2�J�A��,���H�i`����]����h���J�q�B^�������������.��	i��L�����rl	����^��?�	&NF� ��{��Y1�����^k�~2����xP�!D�?7z����wB������
>a�%"1�e�����������R�����%]�1/�&q�E@�(�*~���F�XN����n����������������?����:�2�9T�^CG=�7~�����)�� tM�C�k/���(_��%�H��m�1�y>�\�I.�,_*�Rs�h.V5L�|d�� <�����s�M"��G�3w�Y=+���H=�S�G_���I/Y��_��++��]���g���V�P���+�.��G*������A����i��)Z���t��W���'�!I+�xXg�(�������F"!n�mO��@�7eb��� d�;$����.Si>��HX���f&,����;V7V#���9g:/���7��'rv�����/?zM��k*h-����'��$�j�)�8���H�k�WR~l9��������W��7�e)��j��181�l�xC���5����i4Y���J��6�w+$g�X<Z!�� 
��s6�����>,6����1x~���=(��%G����yx����P�����Dy:�:A�G����iRp��XK�H6&�����pV����s�|G�I#�t#��	j��_:��0�g��z21jv7�#�����L��A�Gd�#2�_v+fp�{���u�����g]��d�%B"��N��e�g�3i�w��Ay$�d�"�&W���B:Q��uy��DT�^���J]/����(R��2�tMDm~���o�'��U$e��k������Fc�>A�>j��1s>�z��C�W0�/��L���2ey
/h�-��N*~����'���,��9��jLQE�q�F��a��6�x����W2�
�0g�~�*G��=�_!�}Ml�kB�^7#~$��vd�d'/���������k4E��j�j!��0/���0��>x}N-HD���J:n�}��"��(7,�i��(?U_&�T7_��]�����3�����&�m*�5���*B������{w�S�Bi�X(�4v72M���./X�4%y����
"qo
3�JrE!�������%��uz�M�?p-(+�j��������Q��������B5C����=kI�d�~D��5�u��4=(4"5��s�,��D���N�v�|/�0Hx����x��Y�|z���x�(�S���H�(�:��F����L!-�4#Rw;�}����u���+s�{�>(${�cF�Q!�����iz�qhgZ0>)��y����G?���������"A��A�NS||��jDQ0s=�x����f��
���;�����$��X�7��1�����s�P�Of+Q�4�f��4���'���~���m���9�Q�7�F���vN�7�Q�gTyA�1������4��B�i�P�U�]�B�v���W�
��sZV�����XS�,��Cw��'�,��D��K`&{�2ie��
(�����,���3���?= fp���/&����
�H��y�IcBi�:��"fh?�V����I��;�+EaE�C>��U;���E�f�3J��*�}��#���1������eB�7�����s�j������#��Jq�I��2�GI��hfp�]d���K\v4�G�N�LL�#��S��p�^��t�������*�s|����		_#^YA�������f�|M�\�NU���/�E&Ec0�����-���5<@��i��tO��X~d�P����AW�}�g �)rvP�NV�,��'�)���a�A��Nu����!���a�C�F���=��Kz2b�b^B@����Y��?��C��XTZ,r@����M���S��Z�
�2	x����`LA������D�SAe�7���=�<e�fj��b�g�i�,����K�u��
�R����
����o��B{�Sf��C��}PNI�;�(���h�[h�Z�L�B��;nkO�^v��������x����p�;|���xe��M_I�O�	�"Q?��v��kp��(�4����6���8y
���o4���P�����������Qg��(���%V�����S�����Hw�qE�#�3���vi��8���zp~*��WQ��5.��0�����d�G��M8
S
:n�gr�������'�>�(�x&�q6�z���(�g�xA�2��
���P0I����4����P!�� R��v���,OQ"}uHc�����V�RRIF>�	�{�z��gK|������w#^v����
0���d>��1��W�+Js��`����Q�	T_�8hP~;�&�I���>z�!:54�s��t�2?��B���X��)��2��	)��~��f-W�&����lg;�P��I��)��`�l��T6��,1�1uq��):�QHL��7/x'+D���l
Gw������:�P|��n��\����g���q�~1��D���%�����Uca���Z����&|u��.x����@O�����qh�M����U����?�@�(@��.A�EN�c��j#�3�^��Y(H8S6V�yw��3��1����%������P��-z�y��I!�v�,����*�T;	R��t.�$�sG�,L\�&�7����B����$���z�2��(-�\��EM<	'�$G�&����d9�E�����!��� *������cb�����G�h�N�C���H���h�F#J�sO�J�Sl�����\@7����L��%�Lv���.������3��GT�<|v�{�|2�R���������B�*��<f��^R��#��#�.9�fOb|�#�_�������gJ�����[Qf��-q�EU���}B�!�?@otp�����5��'�o��H�U�?@�P$Y_����3�����h�^}h����{��iM�<e��YZ����������?�cj�.�!w��H�����aj��������1����w
H8Lg�&p�}� 4��k�����)h����z�t�b]B@k{���;
�H���ag��h��  �gy_����dOD�����NYV�v��k���[���*����G,h�9����F��(�";��U�=��'�L�����������8�:��<�'�$��^�F3uH_ZM�����	��L�c��{w�����p��h���h�8tW���`��>���Q�c�S�
*�Jc����D����]�^��Y�H�E�_yA�}2��]��`">u��yO_v|�?��y�V4 
q��@���uVW��.q�U��J0^:t�?�u7@���iv��y��K����);���C����:�<J�x�L����������$�Gy�$Q���d�����jx�	K)'0��y�Z)XW�P�-�����5�0U<��(fy��Ld��2���p���
a"�0\������6�h�m�$|^�09&f��]R�9�j�F4q=RQ��������IJ��	�b��AGu�O��Uob�	�r���!l;�E}~p~b���-���.xH�q~�������H��7_���D�/������
����A���7	�)��Z�	���{?�	�gRY����%R�#���������G>����NS@����k�cZ�[�{�������-L�t�w���
_#�-�uH���a���a���d��:�� ^� ��K�i�A������5�&�!#OqD@��:�,NH��q����]��w1��:0���#
�
��;�dX�;��<���p4�$L
�~����b��hj@�!����[�;���Fb9�u���zRIHe��8t���h�4����wE�c^V8Md^vx���wz��Y����r(����gKo�V8~���\)�&����!��c���#!M�(�/�e�qG9T�7^n>���������f<���	�~��?=0g��h}?�0�W��T����J#&�I�V��������s�	b
�p�h�3�N����y2I!s���(k$��:g�U�#��}mTF9\�!��x���I9Q�:B9�C�^d�%v@yBASyAL�|��%.)��:�����'���O�Iu������7|�/l<i� �T�;W�����A���V����K��
�����]�7,���w����GI�y$j5
�x��v <�d,>�J#9R��j����{g�nx%��+a.3��Sh+�{�����������q����O7AY}���N\6�9'fpL~��m��0S�!�2s
�������r���j��&S����'(��H*�����'��n^�/�vH��He������G�1�^5~�3>����1�g�������@�H����a��h��O��``��J�C4� �U|�A<�����J����w0f|]����3L��
����
b��%��i)^�����?B�Ub>������2.�+���}^m J/i��
�74�	Yr�����_2k/��X>���&�5����,�#���GF4��bw'��4��Y@�����l�d��:hcD�G@P�� j�\Y���-����Y��
+}B@�k���9�%�l#:�����Fx��p�L��(��g?�[����[��(��gw7p`�H��]�ly2T�Wo(O! ������lHP����j�{��	�u��q�
���$���QCG9=��9�N�>��R:^��l��4���D�|���	1[b
E������2���0�#�G��5��L#
D���Qp�<��d�xw	��(������o�D'�3����/N0�(
f0��}^a�m���9���m��S�����b��;.�G
b�<NP�xAu����LW&h���9��6*�����V�Ds����+�wX���\�P��\N��?:��=�o�l
`wP����'�k������4�����'
7����}��d�V��Y��^���#A%_��?K����t�\�_���X���c����� �t�����8��JIF����w�a��T':���w�g����l�{'�y�����g����"]���B�B��gm�0=*�k�u�����r��O�9���O���C[|������!���J��!��������>7��&�=!��H_�c����B�W������8�2Q�-pH������G��z���Ku$�no�h��h�';��Z^��r�D��S�|�n1Fa&Z���Bc����

3���P�JV�,�����n�g��m�6@(W��B���5L�����C���� ���������(^"uH���,r�u�Rr6���*�������3-�
����N����8�|��XI�������F����H�)�x��tKpOP��S���O�H�UCd9F�fP%^Fx�_Qh���������������Q�2����>��9r"�2B9w�V}�2�/��"��M������@D74�=��h��6��d�-���`j�O��w��Yy���ZG������'U����=���|xR�z�����3K$�F53E����~D��5^h����Y�H���s�n���3l�$�x���;rho�qS�1�I��_����+|�}<���V���>�D��L���lk$�b�Y��{��3{]�OF6��s�R4Ftd����M$d���_g.+��H�	w��(��+�o.���~�H�<�����B�K��x����S&���������k��7(�i��)?��J�)����y�R�7�no4���� ����W�e���t|$����$��j7��@I���a��b��x���i6�w�������U7���<s6�(������������/w,q�;�	���K���:��V��T��
�<(�h�9��6�,��5��Fo�C~ ��t�m�5��at���k��=f��&����PnT��-�R[�O���w���}M�}i$�8l��#��K��V��a�%.'+_�k���]�tDI+y���N!�t}/��.����:5'��2d�q����"{+(A]x����-��Q$���b{D�b�,���ze�
�b��7�����J ��������g1��h�f_���m2I�n^.�l���&ZcC�
�>��9�;��{?�|&T���3��R@2�����{i��[�>6����=��M�nhG���'�=yn�Q)S�����p|�)�n1��,����k�P�����U�=�$V�X!8�����@Y4���Ti�b�����s��U� y[���
)|�-��� w;���s�,QA�<��n����N�M@�u�X����8qZIdo��qZ�]���u		r����Gd� �p ����]c0s�uI���tH���~�|�Z�1L��pq(d����y��KTHD��FU4�T�!3$m�-�|%J�n�v����O[��#�7�YE��rx
���[~.���;sD�vDQ0[x���=<s4O/xO&��Lw	-D�D����I�*��xo��"��'������1F�h�
��
?������B�+�
>|�����@j��z��hURe�'��n���^{���V���&�K���m��VE�F�m��V�EQ��7?Z��c2-����v2���o�#��f�4�w�+����\���os?��E��]PhY��FS�����H#"]pb����S����PQrT�*����7z���f�q��N�����mx�<�qR0���P6� �}����s�!nq@L
{����D ��y�uG�|]��#@
�K��X7��+D
U!��:�H"�n�}�;�bK���{�(L��v,m4��w�y/��zD
#tv�$��-���g��=i�^o�j6����>w��pT���-��@��'#��;��3~��w� �������e/�"����'��
�����	��
;GQ$�d�%�� ��=%�V��+�����_��BG=X�df�i�q�UD�*������f�����ixW�v`���S�Pi#����u�d���/p+X��:[d���I�@�5�T
�_���
��^(�K~!�^\�!��|��p�E�x�G?��e��(�JA�^��@
�cF�����s��:�eP�;W�p�]���_m��N
b_#3P�#k�v����zS~o��N�:<�h��	z�4�o�<�+;�4q ��
rW�eG��r�C�x�G�q����M�����w<�Gp6�8���'�.qd�� h������Q��#V�2���d7/���30ww^<��S�f,���LF;����(�N
v`B@�sD�����o�1�w��n.�!���}�H"��KS�-
}�p�������N��MU}�/gV�� �yHy�@F������~~W�FT.^r�L*���>2�(���g��h�TvNH��1���[j�����kxl+��
?F�N����BY�B����V�^�I������������AaV����S�y�.r�D�N��'%�wH�xJ�X#|��z��t�Z>&�_�.�j-Rdo������"��)w�0)���/l;�6��9�^+Q��w�<�
E��n�m4��>�Nm4�d�(���3��4Q~��Z:a*oFl:�d��� R��/�>;=�2}�f�k�U1"#�g��f�=�}@%G�3�����t�c�XcF���idO��GH�6�V�$#�"�/$�h&+'�D5��s{���y�(�����������\
bo�}d�p����[�"���8��1�J�2�������`����p�e�OU3��dCz�cH�h�����W�����;Km+P������E5��N*���cn�Y,�Zx���6R�O��s��q��X��M|���>xS<1K
���N���������%�<�u��]�_�@vN)�����?���x���%��0�1�pM.~��f�0'�f���K�0���'���Jqw
%�;�3`�?����N�����'y�)�6w�:!��FO���47o���f���+3�0��W���;/�p�����R�Xnix���W�?C)�7����%����hf�t;+�'�4�	(��%��	*g�(���S!jL�)�+�!M���p}�)���(�+��V:�7��9c�A+���tx�r��\	Z7d%��k��]2!g������T�rDr2�|�x;�}��n?x����*!5B:3Y��L����/�|�=��s�dS��zRm���^�q�}Fk�c�w�
�����7|2���|���-*x"����F#�JML���)\+�W�h����g�Ya�^���!���TvgD�0#����n��r�ta�.��F���yx�$.N���8�E*7H�A@�W!��yY���B��!���ph�a�
^�t��0��,�)��T�&10{|)���T|���9e{� <`���,�h:6���8��?'�:s5�}�+$y�xr
���l�k��
l����4���� ����\����c�
=M��|E�w/�(���Rr@�^^9�b�]W�B���8�\�~R�
�����q��v����A@�����s� TQ0��"��u����C������KSY����T� �*k���wPg�1;��F�V����
Haj�d��OT���6�j��{�8��s���C���c��!7���9@���t}�>Z��~
�LN�
�5)��Q�"b��IdqY������O�� ��\����L��;(�Ai�`��u��`\�b��Q6�1���dt������Q��`=�7�^�
������&�o�����XP��! �A_����3����N6� _��2];�����~�V?�{]�}����pD���u������
�������<e�{$�neFG�9T8P�#�un3Y=x�-�x�L���2}��ZLt����wy
�X����oM({R�V2����6��K�V4�DY
��S�jE��B,
~o]�[�5�d����P�:*�(��[�E����"������d7��!�/�T�1g��C��k��z�
��z���Y������x�J�tCO%�������G���2�/���G������M�(�%�k���TI��,����%y�Y�z�vw�K*��T��U�k���)"����9��jm4�V�9�1P%g#
�Y����z$����v�jB�����u����sKG���O�KT,��60��);L�{�5�p��,��Z�?s�r������"#k�)k���t��2W(���������yFA�c%:����G	R��~����Y#UR6b�rK�������������e�R��a
�
���jS�0=�P�Tf�p��]jE�����"4W3��
�J!5gl1j�0L���P���Cu��(h*����Z1�CDW$
(�=��Y���	�s�TpqD�pwT�!�����J�����9}����M������B����5��G��s���P������A;j�E���w$�7�	g�iGf�@�H��Q�?B"��|��Wh��pA��(C�O{�	����Gy���q��(ngw���9Z0}���T����-������W;��En��9cRQm�o8���z[S��I����2_�
�[��P�~��N~�n&Z�wa��H���~�`v5���r*��Y�]v�I��R�����D�f��n����L��B�*��B��Xjs��Av����[���������De�G�h�y�����@x�f����m�q�H��������	Q�G���D�g�p����b��$�:��o���)N�y����>�+���7���wF�w
S5=e������o^���+�|��i,�����y��6pG�����#�*������f�1'\���������=
�MU���FaA�%a��@ k���ZP_U��
�lZ{z����Y����)��8�~�8D����TYa�2�������>���`�HQRV�*�Y�\	�����z��.���Q����A�]����I��~�����Y]���{������k�8�\�wR������k���e{���^.|����2���2i��  4c+D
�	GF�W�5��6��t������uHo��Kx�T61"�\^]��9G�
O��5#��d�0]���w�hE���g=rg������Rg�v����D�u��v?�q�0e/���S���\��(T��[�l��!��d�������Aa�D�w�Od7+ |"{����k)��0L����Z�;-��D �`�����C��"*�P �8��AdJw�# �6��;J	
&������!��.�KAa�%�� �F�/#d#U�?��a��T�eQc�Y7^Q����0��(��I`#�>�����������Lr9��_#E{gdVi>HT�A���J��0-�s�3�0?��V�A��+�����fC]_I����>��v1/oYNW�-���
?d�B<i5��k����>���Fb��>�0UH3"��}���(��_/���S}%��b�rf�F=w�%v@�:!��]���k���PB�w
�%Nk$���:������+8���'^b�-S�
���+�#)����LIT3@�s���+�#�@y�GRy����|$Z�7���@��]����.�F"Y���zoR���w-��S���xM7[�����Zo]�0������*�H��&��j������	���Wt�^Jo4������U������dgY�������Q�m�y?6#��K�(�!��������j��8+���x�%���_*
�G��n�����e���r�����Y-"�'S.���F��NI%���4F�c�����N��=�\��}�
�>R��4�`:]M��H��m��I?��U>�|�9���C��^�"��6^�9t*\:��lYi�CFyB�W�<�{�C'J�G�n�vU32�*��sG9����1��t5��&����za
��j��K�����D�W�*�J�@z�hc�>*Q+���)�D����G�"]�~�'���6y�"_c"��gR�c��l��Za��*M�Wk_�\v�[�h�����gy�����������](���O9��N~m��%��ycs���9c=r��w�#Y����z�^])VWY�g���h�#�5A�F�9���U� ��>������y���1+����*�d���X���A&��V8���c�7��Q�\�h����Y�6���i����d�v����X��"nT���^+}E[��t�</e�5��o�!Tf
|��?���p����E�U�����S�H�%�uN�6]����j�b=h���0h7�w�N�yy8����_���
|�\+fv�	W�A9$���B�2�����2vf,�S��,�h:s�Ol:��xC����h��Q���)/����$9D�FY[ ��`��z��1��d�=W�N&nS�&(3����v
b��9�C!k���y�G}�_��)�NCf�������n���7�	�Q�c^:���s���p�����d�n���a��H�K� ��D�e�l?7�����;��l?3�s`h/��~��s�
�B!NV2v94�-���P�n��gk�5�
�� Tz[:��;�Ce�	/���U��5����#��`�aj���ER6h�U�<@
�������q(�����4eO=�l���U��)�k�h�U��SVh�d�E�B�RX���f;w��Y���qPD@h�U��:9k���Y����@Y������7`��:�1��~�J������C���g�Fk�pMO���Tfw�<F�cv���x�x�p�J�8h0/�FNDo����������s�=�� #�@��1���M�Y�s����v������J�,������*��y�XW��/��}P��Eg�����=��Z�A��]b��0�:+����� �
�h�$oQ�#�'�}�a�����C�3Uh[�`���y��2��j/���s�Z�$��b�B�����Yz����TM�Aa>=<������x��x�r�!�L��w'#n��@#7f�e���
o;��7�������{�ZcUs��(�a��L�9���S`������Z�������Y
?V���Lmcm	fz0=�"�\}���`
T#*��s�����@�{8����Fb_Hs>{��aD�1{Z�q��� �P��x���s�j�]K�A��/��zu�V����,��U�u��&����������h�8omQ���s_P�T{}�9L�h�y�*[Z����tH9�/Pg�6p���6������._fAISi$J�
Wg;��r�Fx�����@k�?�,�0�5��
	�I�OYE-��^Q��
��+a�a#�W��!^��������[��eD���4f�n�Vh&�7����5�*��t�Db?�S���Z{P�_�K�N8�.�j4�2Y����V
fI=N�Q& <��y��b�9,E+��&��B
-X�5���4r>sW����G9h�=@|��Gk���W�\I�.�x��KRt�� c�h$<�X�7�:�r�B]5���[�Tc��9�1��fg�D�������8>@�B���y��%Y��5i^(�P���v���$#� ��1h����sb������h�DFz�]�H�n���r��#��o��L�D��{�����9q��z���_
��(hj�[T����]��X����t�t���W�+3~�[������$R���p���H��w�������F��L���	��z-.w�����9�K���sSiU��(��� ��T+f����!�8_�
?�bVo����M�e�h
>60E~H61��
l/�J*~�0���v_�r�^
��p���/����{>n����j��F��|���0�d
��N`�Ga����?�DU���j�	�51�d��b.��Wp���������;�Dl�h�5��9�d�	�����RFs;��?�A'lq��/e4���C�{����Xs� �<ha�v%e�vv��F12�;��hY�T�
�;�nEMwB!�A�;����C+<"�[,�^��_Y�*da����X��OV�����Cz�Z������d�V����C�5�;�r��k�+[k�r?wP� q�oP�[�l���.V%"���$@���iF8����=��P�����;q
r($�~qVA7�Sq�v����-k��DaD_$�����s���~89<�x?�g����?��c����1?�9�N^�\��uE�F�+FBb�\�x��4��\�N��I\W��I5��M�+�����X������"��d��)}9 |����AdJ
�����y@��d~�b@���+
������X�� b�L:��� �".l�(�X��h�����%��C�I�a�_��R�P��A�r��.;'^Vv��W�sDo�������zZe��L�����-z��}�z:�QS����mq�|�a�s��4����
�a�������A�@�m��o�a{�U�kR�E����Mc�F"�����!�0x�6��J
_;��DT��qvI@��)!)�@9V*5^����r����O�6��k*��x�~G�=]e��7�/�P���:����}����YZ���f��
[�!��������j�A�OQ����vd%�7�>��(e����J�,Rc�r*4^�D���jD��!�	EY�1��C���6��O�m�9����
?�9+A>W����Y7�*�c���N�g������s�(a�)��Cx�����$�vD�	��NDp����q���Km�����BG7dU��)'[�D���uj���S�����vX�E�w�?�^�Q�F��A�wD����$��Ly0��4="�����P?>B�Z������e��k�q6z�Eot}(�lK�6
��%��I_������a�.��A�
o���R�HU7�L��B�!�{��UnEO�����M
�QP��n	�������D���8�0�&me���/HJ7V��aN����>�CX��x��o���#E/�S��(����t����Oo�]���f��7��5�hZ�!Yy�\m���0���H���^�s��>(�@��f�]6�.��{o��
��:���Y����z��)X��D��~� h��������KgJ���ss�L���~�JQ��>"�,���$��]�X�\����$����hF����U���2�P��T�������)�M��a�+E!�H�������F	�?�O����t��H����!2D�fd��Y��%��\����<�Qn�i%CF^0�	n����A�y�����O�
Ua2��g����Q>!�/�|�-ppgj�}�#��9����2�P�B�Uy���S���3gz�������������s2;�]yo0��"P��\�F���R�cW���cbV
4M,;'��i����jBs{���Xq�x�8Krp�*~j��Z��|�e'�_>o+~�������w�w�y5h�T�kw9s�\(r��`��
X[��������A�Wh�NV��h��.&'�����X�
>%�!z�������{p���������t��:�n���s�:�	����
8��:�!�
1����{B�x%}B�?>�a�4d;w`��:���QA���|u������4:��k�+����g�"��z�`��M���_(x2���o���}g|�:����og�3�$�d4��c��u�r�d�w?�(N
�� ���P��w�����eB�{��������c�����C��!i�e��QPQ�����@q���ND�[�p�NVx�� ���7D�����!�^��R ����v��O��6@��{�+�_���`����j��������W��n�9��V��=�O[����q|u��ke�X��Fq�q�`h�<.h�����h{�E��o�-�[U/�,������2�>��3?�k �X�`-%��l.�
V�7��sm�dw���
V��;����J�eY���N��U;}�����dsgcV
QN�*�+�<��L
!8�e�@�<�v}}��������������^�p������PN��G�/3�U�]&|��;k���j[f������i�H��E�5�V�q��J<��]i���r�O��Y��&��(^8�d��F�/YzK�'��J$�|��^t�:�1}�xA�S��������z|�;x%
>Yg��s�>��(�R��_�,U4"��%A�|��C�=�t3���������
����;�����\=�b���.N����h6`�6{��s�4��0Ki���t@��l|cf<'Y��3����I�5Zh�#E�f�-'Ej����Ias���1�fP%W����"�r�����*����?(r��h@�����m�}��>�8AQt���5�o4����V8/����� ���Y�*�2K��-
�����:8.l�}�c�U��m�����E� �]�����N#�s���;4{�|��2�#�Si���+�|�N��>_��_�^%������H�~�����Dxxd� ��K����jl�'����7?3�����5������]F:(�l��o������6���~�������J�of��v��Z���M&�$��E�=�
)��d{�<���|���
����?����sZn�;DgH�X����n�:�F#��g,�i�DL�l��c���Q�F#��V���}%���
���W�O}�Tfa ���Y��&;��/P��xA�0C��]�7�`�:�����'�SL�����6����&|H�h�{�����T�f�8`��N���	I_���r~���-���5O:(������$����f����l5Fh����
|��;+g��i������db�������D���sf�w
��P��`��7zB>f��o|����C~��J^��>s�DT��3��!��mGuR����9*T1�0�5}	�W��474^����6pD�`�_��rWk[F9h��@�Q��/���$Q��R���d�aI��t����k��5� �w������|W�l�{��,>�
��I�:rS���~�v4�*~�����lc��}x�����7�����1�wT���M����h�����wA��N*FBL`V4�*c�e
� ��Ih�9u�)�7�p:��LD�X���������$e��9�
9�S��v���?�����*�����
������og�_��>���jjA�\��L��sC��l�m>/���
����C�\5��dD@��r�_�t����!����&�h+�lu33����K�G�j��F�C��_
<�F�'-h�\h�'2�7�����2/�!�<���Y��EC���x
	)#�nL|������g.#����vHW�0��"
���[^,�S0�;���+R!�y�"�^0/����@����C���o6���n�e�k��,������t:>}���B�y�*��Y���*�:)�v����</�Q:
{��!�m����u��m/IAvu��[}C4�7Egf����%V��yS��v���A���Yt��=��;�g����q��V�e!��t4�+3�q���4�[�.,����~,B�\=�Y������E`�,X��h���)�P���Q�����6%�H��q��DM�)"�����)#�r[P�o�����IDR3���#����������b��-�j89�{'�8N+J�J#��b��9?Y�
W��hgP�
_���y�,��G��S��%;����^k����
�0T'#���1���S�?L�6�������K�X������T�i�2�\��A�r�w3�TN�� �*x��c�oV�e^��#>�������;���T�o�p7!f{y����uv��|�w��LYT\��AT)	���P��d��pOY^���C�>_�������*�FjL���r�
"�F�;������o[aL�+(��C7; ���U�>;n��
H
����HP��j�^mQ�JP������y�j�8��$H�=-W�|B������!��(������&��`�[�2���%�!��h�L��Q�E�#+�����z�l����5V������bO���J��[�|G��������S�rF�^�1__���: <��7�(�#&�V��;���^b�_o�-��H��0[K�).+����8�,������e����Y����<�#��f&���kek�;	 �rC�5O���}��#��.��vf�[�s2��,gv���K��'�T#J���S�]F8��:$�~&Z8���b[{9i������
I�718�y�Z)/W�{��\���^��Q���33��&NF�����i���=9��k��\���|J;�5�� �NZs����|'�?�j����5Sj�i<��d��t�5fP�G��r�4
��U)g�kZ|�$�Nh�G�p��25��
�2}��f��Y������e���(=�l��.$����"��h�)K��
���'Q/*H|M\�'#o+��Gj�SA�#�C���M����h��T�HV�x�
�x\C/�?��\���G}�`<���\LS��=Z��C@�����zfC�16�c��7�����h�5:$�{
{�(�+x���
Md9�B
�J���������\��r�izU�^�G�����z��_4�VtMv����I"����):=�<=��53>*,(h�9�</��vO
��>�vb������A�RW�ch{�2P���CB
5 ��:��f� ;f�.X������9�{�B�D �6������u9j�L�]X�� �=�qHz�G���!�O�^���nRJ: ���0tQiT��b6R�/�Gx���[�VU�������}��u*�G��O��8S����Y��':�0�X?���i�3�Jzl1D�,W����
)�Qf�3�|��T���arw���,����������[���Whn���������9�@��m��(��T�P��a�FcG�s���9�Q� RI	��������ch��$�U����s'��''�5���[���
?��s�k���9L���7�O���=w��Rc$�����^v�(���T^`���8��E�d�{�������)�oF�b�V���
q���%cVg����!	/��B����[���@�0~���u��V�n
�m��Oc�JG�,[�����*x�No�*�8����c��n���c�e����2Ve�����p�-�����[�qs>:X���8�K7��U���	�F��}4I�bq@@P=� ��A��*���L���������3����_�B�,��nl\#�7��*���a�[�'#;��/ Y��c�>��d��-�/3)���9/)Z>��@��v����*�@���4as��������o�w����R��my���%<�yY:��+�������_�k'����R^!6��Awjf�H����6!��`R�8�����������uNA�_
�yA��}VA�<Q��h��@�o+J���~.�c�x�u�Mr�W���^��j���"����^�!��9\Y���2�|���}���U��(�@y�g�x�Q������H$����oJvC=��B�������8��
6xL>B����^3f�al�G"�sz�m�O5�pS
S�4$���ye#f~���F��r4��-�`���U0�������9Nn�CU�"�' }G�1�gAy�JX;_Bx��)za~����MgEy���r��/{uI��QcDL��5s<;y3��@]t�O�_���}���Q�.y�3>����Z���
��X��Y�Q�+,����vP����Z�b��wx)+D�"�a������o��|^�&�TD��{�F������c�
�O�u���cX��g����>R�XK6��8S?Q���8k���
-�e7{���G��b`^�]������t�K6f��+����S�I�
<U�����KB�$���YNn�ok3V�H���r
�
������4�o`{���f �/� ����^��z��w��E�A���Z�������`�A��o�b����C������/��qh�r�M
�P�`>�A����[r���u�s;����C���+�A�o��q�v��1�{}�c�)��"����{w3� ��
���eP����v�������Z������gb/j�s��S��W����a��Z�����������u�d��I��yeE�u;sO��Ir�\����<�
-��"���.��0^����*����#Lo�����gI��,�n�n�1��D�����%#�W��\3���������Az���O�E`��o�F<�� /j����� `
��p����4�Y��� R�C#yf�P����L@9n�o���y�Z��M�y�<���
	*��8|��(e|��CdO'�z�2�#P^@�4^���,��T��Ho�g����8x�n�Hz��1w���.x��1<��@7��
=�Is1����O�1��<�P^yqL�[SJJQn6�sJ�J|�N�;�'��|�x�� ��#�~r�C�*�t�9�R��L^�I�FY��R�����`:��k�e]v0[~���Hm_L|�$UHU���vT'��_�1D���{f�����OF:�>�:���%�G�*�h��w�������C��1C8M N)Y��`���k�|�����'����x������b�{#	:��Jx��!��i�X��$nSd�fzK���t��l�_~�JZ�_�������>���X���u%�<q$2�]����d���q�EO�I�f����zt�+���%vt������V�;�h�9�79/	Z�t�9
�G�W�F��*L
<��%*
0�~s��*kd����W<���>�N��D����l9�WT��P��������K\����7�)�)-/0�z�9��
�(�LY4�[
���p���p�jlf,|�5�1B������|���}��E��p����Yy����(,l�	�����r o�5������5�9�1i%O�q�C�'O����B����w]�L"�>������"+���qgGc�����������5�dj.�
>��MF>�K������8GM&j1�@j����Kv�@g������]��GM�K��
�����4KqnR�8�f3#��}N��%"�;$���o}7�s�73[(Hl���7wwk�O���2���_{B�6����j����A��0$h�������}�!"���!���J������&��5^93nS�s1z#7��vt[~�p�����<iF�o�����L�������NK�p���"��Q\�=�9�x��M���gh��E.)�����9����J;�B-�E,>��7�����7���^�h&�&��*48|����{�5�^:>,�4��z�Q�[�{>�i�
�H��{$��qh���ldaA@{���q�yc������;����T��tsMx��gR���Y�9�&�m��v�R�o��t��?w8����j�f:�8d�0-H��F�����Q�%�v���l��m��L�--!S@����}�������F���\�Yo��?��A�<t(FB��Z���z���8��@�"7A����&�4�ju����>Y����}H�?U��� 5a.�7���=���9����p����};s�L��(S�]�����"PT������G@nV�,@�}���p�B4Xn���=��m���)��:��{��B�q�(��dUF|R�>@�
��.OQN��o'L�����N��P	��_�~P���1���JUF���}M������������-���2:�c������h�C�����7$���Fo��s63�E���b�6�$�g������)�!6y-�V������z����A����h]0��267�d��N��������S�-!�]a��G����R����d����a��G����l���!+����s�Ev�������?8�p{"�s���H�C]��13������ow��P����.f�8�S��2 :��(����|��6[/�����kg�m�����
��������#�����Z}��)�S�h Q
pQH�����������M�'�e���/JU~�o���Tw�������=��I�)1�D�n��&��f�L���r���!w ;���v����|����A�J������471�"t�dG/���[�����GO��9������=���y��	%�w��o������S��
l`�q���m��l����n3����������=D�����f�y�
��p��3��	�~?T�9z���n��7Ov�
�n�)�2����U��StF#f���l�%�7N��Dj#<��>��9hG�H��e��v��^y&/��k>d3>j�����UQr����L+��������kd_���*G����������$��j0���1#�����m�~!���?����dl�'�pE��@#�����W_ze-�[��}��������2�`���L�������9�DO(�E�p��5����d+��r
�D� :v^�2������� 9p���@t�m�QFw�UQ��cC4
�!�;��(�~����+��JX#i
sLA���&(8�v�A����;��+�����b������y�1�2�,���^��P� �2FdL%�p
�������
���������� '��e���3���]��`ys��wR���J�%x����\�fZf�
�������0�Y*3Tx@K��2.�����y�B�.W����O����m1r*9��9B�$A������g������xq��F�����4����K����;����9��MY�[;e ��{h�. �!�t��	�gq�7!�Cay��QX��������a[�md���"����eP��A?�#����wH<��������I��]RXx����}g�����(|L���FC����7^���&?!����0���i(�`!l �9!i*����z�blzN���Z�#���>�+p4��� �%�<@
� �>~>���m0p�d�G��\K��v�����H!c��5
�[�����I!�	�F�jo�v�`%c�56����{y�z�3h��lo��#�GaT�F����$���������qC�w����#��3@�qs�!n���������_
�/j����.���X5o�L���;����
��cz�T������!�)�|������;�j�F���q
����|L�������{A�87L�����^j���6�&~�P�.�X��<��-�K�,���#�Lj����	!�
�C�����;ju�{���!x� 	�t�U�E)2k�69+u����o;�3�������J��Pp�)@l����+L�O��
q��y�R�c���)j]�6�U�����	�H���9�>��Y<t��t��e���l0'������y�!�"C���A7\E��H�����������*�������f���X.	��K��R#9f���������R�+I��`!���vZ���?�����(�t���v�>���f��/��D_ �'e���~L-"��G|��P�tM�����$#O�����&D��[�>"JG��L���x�R�X������.�:�G�����>��3 �)�l�?O��L�4� ���U*�E�f�������Ln�;��%L�h�O�;L���fl�;}w����[^��|�g���O�$c�G�W�8�W.�WN��d�������rkF\E���2��l�%(���\�=�5��{t���-?+���t�;D��K2�
��Dj������w%�������sJ�U�-�����kh�c����k���?��$�oc�k]>&���������u����J�}�XQH"v�*�Xk���z���W#_�MH�>��N�b��(n�Z-�Lt�v/�H��6�����Q#�F&��L�Lt���(�����	���#�@��;���P�. ����7���ik����I��acaWQ���d9?d���a���Z5�c��`���8������#Uz6S��n�$�(������]>%���F
���%	���A�������������C�$~�m��
����s,3�y���,b4��8H�	�1��B��5�6�����V��<�a���u�g
��m���6�a1��G��)���k8���N~6�oC��W|~������}�����E��s7�&k�B�y�3R\ �!��`�#�d/3�����"	~��K��=�"������	�a��%c=����`�q@gE�;�0j�>�_���xpV��w����f��A��w�foH� �7|���uY�%��
���lY��n�Fw����1��JRh�M(��,:�a�lGt��^
���y��l'|�U��r�Lm��U8w�;�pD �<�?�zs�H����?&��P�*��Q�+�ePwa����DZ9���l
��9IL4	,��z��P�ds+;�x��@�9g�(�l��
i
Q�a)#!7I�\������Fl^���v�5�R�G�*��6!�����E?9�C��5J�6X������Mz�`o��}��Nyv �
j��-��-�{��+"��=n
����^>��I����#���y�bD���0��&1H����������$���.&�	��)L+���
�u���M>���b�|6M.�&��|D����#�I)?i|2�95�f:���c�AdR�o�����A�����3i�Aw��FC�T��e�>O~Fo4��9�{��5N�Snb�|�:8;�Z����g~ST�	>�q��U��mQ�<���h��/G��|O���-m�N��7��^��H��]?����H>���\8����]���'4G�'H�1���/X_��m�\C��I��c����iuv�m��� ������M�d����W��-�������\��<����c��Ho�C��A���N��!'��������q���zDe�;\2)�ml9[��f����e��p�,��2Z��;E�M���cv����9<����Mq��1w���GN.wa��6�t����lj�%$+&���*�y
��������k�'XXsM��Y�����)?��F��b��A����kW�
�'Q��O�h����Y�,����������h��2�I����>f�70S������j4���q����^S�w�
��������]����S�]�J��P� �}�����\J� i,�k���dl����i�����������745U=&��*8�#BQ�	�k�SV��Ov�I��M���{5�Ap�o�z����f�h����K6���f���d�S<7���>��=�jv/h\&�$���(�������Ov�<H��<)�z��<yjI�{�$����j�
%Z�n��spy��fr@(4����7'/�N�GVYI�%���Iv�J��t��_����8��S������|��F�7��aF��e}1�����]��*�� ��i:�b��2�,���TS"2#�^3�dRF�r��b�y/��[�?�5�+0��#���K��f���- :��2Z���=���*&����W(������F�������������9yp�k�����FL����x!f_��G����l�-y$�d��5U�	�@��MR�W��V��H�$�R
$��8��a]UF'�a>�:���G�`G^&`�S�w"fnm;D�B|5�/�k�=����a��"'��c~y�eF���H�8r����+����=�,���n��$9��=����Y�<qr�F��\���9.)�s�<�C��G�N�N3�B���������54C_�kAcH�o����EC�a��
�|K�{Nt42�<L��*N ��D���@������I������QQ��*cR4'C]���1�����!�� �9j�<��v��c����"��Cj���N����]���&p���1z5�q0=�c	
0��R;|KZ R���b���j���A!G��������2�(�L`{�C�-���z�
����Q4o�������@&�n���5P��f'�U��yB�^�1@������f|<�K��N���XZX��@�p���I�3�e^�����,�|7���jE�:�^�Hb�j���}B:0�@��m\��4P=3u�A!�oB�W��������k=����	�(F
����O�()B�`UoY�YGNgC��C�w�� ,M�p������-M�d�����^k	��em4t3�����j5���z9��W#����v����	�vg�aT�e�n�|����$1qPJT�]�k8�L3f�>��~[��H
.�� ��]-cL�RO����h�>�u/|!?�1'�����t�	|���Mo~N��, �=�?��7��5�G;�I��g�w��u;4>�
����7b����'Yf����E�(z������2|��]J�ld6�G4����K2G�$9���pn��f�!��e��%�nB�!U"e`�t�����)>G��)Q��J4��6���z-�#�q\�[x�}�3��LkREo0�y�]v^a�k��l8����[:�N���K�Q�q�]��u���$b�����B�cZ#���5���$�"��]�_���V�[��s`V��c�D���0���F>��]�k4�z�(��C��c]x��v�z��l�e�dl]�I)p�v���-n;l�2�J�*��O�q��#K��=�dFd��)�vn$�����7�[���������3�sRF�����=f��qg2�������	%w7����c�����!V
��Z�Zg��D~j%:�!N������Eyj��xq����<21W��0��bh?�	z	'�d����Z��5E_�
���.��i�S
N!�C���_E�DM�?�TU�o#{u8�{�3>��H�������rR�����!�z<���U\�^5��kR@������9�����M�V������-�%*��(�uV�F�������j���<�������3+)���$� 9�I��g��u^���&i&�����T�[C��������a@'�$-4"/��?��^�v�!����Gdm��#�������~z��S6g#���5$�V�S&A�s���*�sJF!�'HH����~}�MY���k?r@��~s�����zOjEsnq�<��;%��������Scl�
�	���pq�oKI�����������?6B�
Qr�5"]����wC�ZW�-� ���e&����o���������'���:��NYv�N�����}'�L�@O&��Gg��#Y���Ewp��Sa�=��7����@�����W��7����Wc���6��bgl����e����M���>'"�������� �7Vy��#;���������g2+p`Z����y���A���s��{��^�������@���������V��A�9��m������+��c�5@a��;x�8~���^83��|�3�d��"`2��x%B���K	��}FF�o���\F��G�jF�������0���@�6X;����������Zm������Y����Y����F�
Hak]sG��Zyk�@Ea�`���
%�� cwb�d�Q�^�n�����A��������C+�E<x��d�
(L�����6�sBS���+�1q������������8�u�]�Gt?>��
�/���A�4x�0
��4h�fe��0E���*���
�����#��k�rQ���N���iP��?��5aT!�i,a����0
�
>�Y�4�&;7=�^�<;g�������D���wz2���>���J/�����u�����h\q^���C��m�V�p����iQ)�����]W�K�1OS�U�o�Z�����`GT��<v�����<���<B2��Ux���A���Fp�5 �V�!���c��O������J��9�$�A���/�����z��]�9���W�q6��
���V^o)��f,Q�J������������E���I�����@sYA�*"
t2�VO�K��k�b�h�'����!����:�GL8�s�~O��p��9������M���/�9Q�K�wSv�3��BKS�����P�jC=�hbR�]����hi���1>��b�2�m�n�Lt����6�6g��c�k�gr����;�
�K>��$3�����;��"W{c&����ke�g��&�4\(c�������av��))���
}v���=��qu^!�+W���>�����X��`GO�0]%��	����3r
Z=�?\�'�W=��U����2hu�����`�D�T�,��$c7Z���c���.x��	"W�����[����M_����oiAt�+)����i�cm�(���h'�Q s[G��w|o��s�#<@���clPt�4��?�����1��q��j�d�;���������C��fGV�2�u�~�.p�]@��|�0�=�A�0bR)Ea!{���
>(���"��R}PS���5cA%��5$c'�H�F�`�$�������]������*������u:~�4��I�"/s�+��+�h��f��������eDyU�[^h�*��P}#����. sr
%2s�@�e��.0G;�G�9���w������`_���:G�������%�2�/(P����yf9�g��3���Qo���c��p���������o+���f�=�T�����e�'�	)�MD�1��SG2y���?FB&y�Si$�n%?G]��s�	��)X�&��iA��Q��T�����LzP������O�6wvA>o=����H���
�C�g�������>��m`G�)'Gf*M4[���S��a����j30+>�[��s���bl��
���<<q��DC������0n��>��.�~�
�d�+����4��f�D �5��\y�|A
�+F&W�h��Xy�����8���;�*+>�9���_��;�+>Y��[s?fB@2��bz����j�%��!C�+8 _Z�����V|��>t�N^F@�.����@�����iB@���1P�a�*�li�v���/������p�����sp������z�5"Tc���|��l=2�������@�%��C��;5Xo�y���2F?f���jo�u��IeQ&�����x��j���se11(����m������1��\F�w��:;�G�����@�kL��|�B�����0�Y���^����W�!�v�7 �tcf�A"�B(�����LU�y��#Gd����zL�*x�����jt�iP�8h�M�u?7�&;���'0h���)��|*)J��]^���'!0tz���w�$��$��x4#et
����u>>y��=0l������v������O��!���(����?�jc��`]P���~"o,���_���W���:=�CrL�i���oiPr�����g�rTk���5
����.�D�5;���5��`jW�����:R	>�����Q������fm�*���;�����U:Mc��v�9��c[O�L��]�i�L����#v���<2��T#*�qB�
)��h8["co����.���^v�����w��o�ZT�A�~�Q���.fi$��{��������v/<z��9�����n���s��O���E���hz����v�W=Z�5��xQ�f���q8�57�i�&�x���m�'���3a�Z��b��|�s�	�4�E�2vYiQ�6`��m:=�2[ x%;'�0C=���"���
�����Rp��=vhK�e�9~���Qt�(�8z�(W���.j�,o���rn�inZn�����Z����&'_���!'^I�?b���<�w�I�$S"��D���I�#8����x��gt%��_����T��M�Q����,"�dI�'�������Z�9�d���}�d@���������6ih��f���G���a�hi����)i�M9r_��t�@�e?"�&
&���L�N^B_��py�D���c�������4\�w���O����7���n�H���l&S/�ND�P�&et�_����$�~!��KpK����#���^U�+��	D���Ix��i�.�,���I���|A�����s�q�M\��]�]]�Lt�
1�F����g��z������n#"�|��E�C���[�J�:����U��A�(���G�u��0�B1�b��k�$f����;#��|���"�[ �4����Rp�R���B���b��T��h�4e\� ����8|���1�"�f0���0_���}%4f+N�&]t�3v��!��`<��1/�Z�$A��tY0����L���V���������I�5�I.B�2��1��{k�u�)�p_Sp����:��X1�=�C�F���L	)���y��'��kl����s��u����
�b
A���@��*<���LV�*lz��F��6P�+�W��}���	<w����\u����=pCC���q��##6'���������3#�� �ybG�1��AY����|����!w;��]j���k��z{5P;���pm-�SB��o
�	���$$;x���F�(1(@�AT��
��=�_��mz�w)�8���l��|r��1��jB���B 	�x����4P;���X��k��R����W�XQf����&1��� ;���A�Mb#����t?�W������t��L����z��6��7�q����� M�������I	�]b��69(\����&I�9S9(��i,06�����{�&
����_�4�z~l_"��3@�A�M��iP�ASl9!��������h�f���!2��$/��1���(Q�ML����~��i����$������1�^0����
hd�k}�.�w�$�t^����K�Nj�M�����y���$#J�}�� ��
�.�'�#��~�����z�Gz��8��2�#����*���r�k{1#���?^fE��7�����0�|S�����t
����������f-���S���ATv<2��}=�V��yY���t�,��2��!�9G!��Q�%�+�{�kX��;U�����	mpQ�"�CW<t ��m��7h��������'T&yF���F���CFs�C���p/K@CSu������8<���^b��x�K�(z�����{��#q
wP��2�Q�E�����ss����nz4K/���Ge�?Y�=���L�=���4��-I�s���`�����9, �B��J���_���61Om
���,�k����U���s�dV�~e��Q��Z�*>��������E��!Y� ���[b���D� ����Z���0VU�����7$�|�\D�{�{F\3��*��D/PH"e������`�����1�����D%�}L.<���<#��������I�Pr�4Um8~�b������R�at�e����siD��I��nYo�G*�	�[�k�V�9�Dsi%:�YmA��y<��r���T{vv<�\V�G�.a�$x��W��6����H��=��E�!��c]��������"�'DG�!)��ox��n`��G��]���C@�B���<l�H��\^�35G������G���s����^��M�t�����Vq���h�'�Hv%j��S�����{����]#����=���
�{�k�GK���k�+l�l I&e���7���������0�@��hB�_(��/&LL���.*�A������3��q���>�O�Df3��C'&)���W�zY4%9��2��u�(�������`>���>�o_����w9l4�=���&���2�I����5S|E��N�9X�@��	)��Cf��@�J��*�N.A=��$p���)>�'��3_��W��a'�7���[�e�j�A���h������������s���Kx�;���j���S���*2;<�3'�-�����x�;o��.d�d���2D��\	�l3�g���p<�3����3#�� ����E
���}#T)I�zh�{�B7�sJ�l�xsI�dA�9����������qIGJ�?&�,"�W�
P?y��/����([`�Xi�vX��/��D	5���`�]/�1���b
�k��n�a���p��\2B-8��zH�����N��
��{ ,�1/3���u������;��,�@s57���\n}V0��Q�w_�8h�c���|d�f+���M��H�3|5���a�?� �����WD�/&����@m����G^L��
�[����W�!n��-��I�^/��������.M���A�(>jL���A��n����%�����u���+B��uir������P��n����?��Di	�D�c����K�y�y��������Wx�$et:j,�
98��b��_,���Y��m%'�����1���b�WfV�_�\AQ���]��*l��o��r/��}?I8���;��e��*��EGw���v�|R�z��1�����>cq�J�>����K�iw�Srwi��,��p��Z�������N[~������C-&�K��;<.��0�I���dqn	�����&x�x[����R2�@p��E�@��
����}���snO�k��d����Em��+1?�)t%�&HK���IF4�g�LQ[��d���	O]���I����^G�H3(�FCGT�1����*��������ZC�K#���=M�;�P����,�\zY<}�n�Y��7�S�
	�5_v�4�-�������|�H�53��%)|�C���v 6=�����At���?���\�~{vuu�s�&��o(��Y�i>����E����3Y~h��%rH�!�i%�G4��bG�x^C�5x�5$c'���7�id�
�2Iq����Q
��!e�XiL��)����w���	�6���;�����#I������)�Z,�(�����)�V!���?������
#�S5"����rV$��j�A�D���~���3� _7l�'�e �['��������n�����!�M��)q���kv�M+�����n>��ZAA1�<X��=y�d�+��C��F
�������h�������^:��M=��H���W�(����[i$����W�������_�j ,37D�=���Q��A��� ���o�	�_���0pF��W��e�;}�.RF�u�go(��tB.	�Fc�#�P�%_�#�23���F���?"6c�e��s3+)������D���2�`���I\�nB�9@�*��N(pS�(=t��\i�Kd_	��1&����)nHB,r�4����[�k����4[c�s%a����x��I����=�y��G�����s��wc��tz\��h�7(/��v��=����9�����Qq'`&��������Z���}�!�����56�)�r,?����\�`= �:������ld����@
\����;����AU�����
Tl�#�,�i0/��w��1���}#��e�Kx>���-=xd��V�[�0��q�� L&bk^������gFf�g�Nc�.d���g������4G#J������.���'�"x"��@�Y�9kx��������=��r�14<��
��(��"�%�����:�B�����M�V�@���7]������ ��C�~�W��C���zK��.��6�~����*��m����d����$�mb0�I,Z���`��n�G����������>�(�#��7���MH�5b@�-���`g����@]��PC�gk(*|2��{���L��HNm���4�f�x
����M�3�[9��|��B��	:���!/���Xw?a�Y!l���E	"�M��C����MA�$�8�dl�0����ke����
\���dQt�
L���d�3x�y��B�^i���������p���O�z��(������_��89�d�|������6�|�N{�t#�-L�}O���-_�����>F~E��s���|��^fE��7����ma�1��Bv�4Z��i]�q��w^N�k:�?�����&��������Z������@��d
��*c������w�4$��������QF��pv�s�%t�LW�.]��~��5�@����d�@!E����o2�[H���Uz��4 y6 y�|��}d���d�3��W�L	���q�I�0j�Ib1�����p���D)��z�����E��~w�
��
8�S2Zs\�l�������#cc�f_���iH���z�d��UZC�W�����G,7��7�{';�	D����2��}��?���^��@{�(P�q���������a���������1����M��oE���W��{}�����E���G���D�{}dD��)�V����cE9{�^�aD&�(��vS��Za�-����	W���@�*#~���Wc��;��>�3/��DU��'�Y��}0������c�d����.��UZ�Y�U���5V�t^s^����r�7�T�_���t��|���8U�?l/� m�ia?�Fv+��v�*������9��W�)�,�[������n�{�������C��\��|�������
���q'#V�!�PD�����2z��Z6h�<Bb�~���~�o���Z��~����$�F���H������kK:�%�����bGa�L��(��2��B�)��?����=erG�{A���������>�>�����U��`b�G�2�~����?�@j�>�
u�������	��b���d��k�*�N��� ]��p'�.E��S�#B���6��ZrV�z�m�!$!�=��y`(�:L�3�=g�~!U��P!�����{P�f����Y��[c	+����,��"[�`.W�]T��x�6{�(�>���1��f�;�f�^��C`]J�zj��-gV�������_-[�"o���*\pq���@���Q4e�V��3��$�,@�I�j�c�v�7s�h$��;�t|�st�'�f�Fw?���tg��z������FC����$R2�b���-�<�E�%�����}����D\V����6�|�2����e���@�]�7�����X���^4H+xy,>y�H+xE�4O��3�'��cru���h��@H�BKA�5�����gl����
Y��������<�9�|5�A���
��s#�*�6�bV��8#�!��mk1��Y7=�������
q�����-,�a��}��r��� ��W���ZI�AX����)�;�0��@y����q�Y��������pK�!.�V��!g���R��Tx�h�X
56�| �=Y����H�G��������Q�j��@'�F����D�o�����i.OI�i�U|YE�(
����l�������MZ�{��8L��qx:O�VC���� ���&��2�-��y�z�J#-�\"j_�1=���@�7]���1������%y_���w������.d���z|L.}������}J$c���;��T<�w	�����6���K�v�!+�]�\O�Lt8y�)�2�hZ��V�3�l��c*;�(:�)�2zM+�V��}V ���a,�o����5�;{
��yaii�5�����
Q��N�G�����$�_����503�M##�e3��PX e`����8��(,rm>f�)�{�/�y�.�I~�@����"�m��������d�7�c��^A���@���_2��������2��JpD�s#�
�����;+��8��Ze��%[��9%�"�e�Y>��^ko~-��x�������h���{�^�LA2������g�/�m��'Tv:JU
�~���TK��OQ�~\�.������x�Ny�Jh&�|9��������(�9���q�_�&��`tZ!h�$�@_
]�1�Nr��T'=��3RF+�7�����q��V;+�)����s�j=� �3�Q����Q�����G�-�I��T�jsd�&����.�#�C1�V���r���7����W#�{���?�`�������=y�<�+��,VgO#�q��A��?)�X:�a1��L�D��3J�3�{�
��cLy@w��<���U W�~��N6���:���cs2����0���z)�g"��mv���o��+=������yM��Mf�Rq�
����&$bk��y�z��������I�e'�������N�=pMD��)��k�����w�z&��$|���&�1�����}�U�
x�^��q|d�2��B���I��������%�2x�9����oJ����_��;���i2Ji�������-H���a��'Qx�"B����#uu�S����Fr�3�)/��r�4���f����I�Z�Vo.��I�{��L��8�g[<UK��a�����FL��?4�N@"��1@���9\��7^�K��kA����[�=lz9A�� �<��tj!��	`��z�<m`c��R}���z�;m��.��5�]�:�Q�O�|f$xD�����|L$xD �^Xl8~~n4tK��2u*�����!x�l��`(J�	)��8���*$�9t$H� ���`l	�aF�[��h;�G���C�x� Z����/b��V)$�Q������k�G�����P2IS���;��/��h@C@�abz����5F2$<w�I����Z-a�
1�� ��0^K��[�����-$�4:M�4X�����EK~[7#B����������/6�HdN��qR`�6f-�
��T�����A�9n��@�v�4��h�3n�y�,����&
J���|>�@F�n^j���M���8wh�pqA��j�]��Z��������te�h��fG@���h�s�{��g����_5~����(��n%����
u��U�i*�-#$��S���H���{j5�������b^��J$#�{�� ���J��U���o3��}G��6���=G:��T��Aep1cAKS��!�p�J\�N�<y��26+�n���b*8��{�(�;yx��L���+9Ii���?d�'�^<���K7' E�2:k�A�a���'���[<^�
���G������������c�V��{[f~���VZ�P[��S������+�R���;��6{��ug���"���E�dtrO�k ��_e�p���A�uEA�k:��f�D����Z����8�~9s?Q���dl������olO�����[F�|��(���wu��j�6.o�|�#ME��~���zQ?�p���\H>��$nx-*�u)�I! �F���@�@2�D'9��%��d�A6�Z
��$#�}8�z��mo��p��5"���C��;l(���}6��O]�Fd�`�����+�Vn�M^a�&�A�-��B2���g;����F��1Y�4�Dw(:���2���i��c�������A�55�/�������bm ���X�1�"��(�Y}���U��O����|�,���'}E����w�b=�i��_������}���7��o��U
��]����k<]�mH����3�b�=�lS�-��l�	�\RF�z�]�����"��	�6��Y8
zs[��I>
}GYQ��O
wO����|���������i����0MV�[��k�4��w���`��I^������T���`�l�����&('��B9���1o�w(�C�E��|j�I��"���H�c��&���f�����eo���$g�?(�f���}��~�/g7A)�u�����(5����-R�Q��t ��b)�����M�4���]|�`Jd�;X|�Q�##�s�uqj ����`��=lYv�	���������s�}oO�H��9���2�$cI�T���u��,�w&{�����������/������1A��w�q��>�����`�p�����K�r���	>����0�PB����iG�w������d�@��i����'�u�n�VsD����V�k���{Z�&t��4�2AFi�k���.�hn}7���m�Dn�� �,@eRh�-����R�8���BrA�z���w�=*��n�{�W�8�P�;�m
�{��yNg���&�@e��������
dYD����E��[�@l��(#X���w �-�,#X6���s$��ay"��p�����#)�Kcj�e�oU�*�GH�����	}�2e,�6�N�o�������'ax���|�V���zK���IGQ`b�����C��V��m����A�2��������V�,@^!�n,:��&�<
)0������#37Q�d�#y5�����X����[��|���Gl�
B��j���llT��6N��j4�$��S4����A/&�����7�q��i���$��=�`��4��r������
~��$9I	$zMK�-����'����YE��*���{�2:M��,���BK����l����c���	j���SNr"S�:��/��A2vY2Am]`*9�'��k��:��S�����2�L�J�p�r����]�$���<������*Ng���A����&t��Dg�l�Q�I�Pr�4U�D�f��9��.�z�e.�;���2z�F����?(�3�	�3�k��/���!�Pp����
������G���z
Z�����I�}�%[�[�����������geL����-�����c�zFQ����N��1\���:\IF���l����~�������[�q��-��d������H��p�uI���;��.���2b�wp`��\L'YfE������~#2w�w���#?�p��������&��P�����)��$��I��\�	9�)x9e7^K=K�%��F��G-r��~2xE]����2��$#��S ��L��k�s
E��@vLb2�Hmc�;k+/5�����$c�L�.���G����z@e���,�I��"a��!�Z����9�4U�~Rf����[f Y�	�q��	�'&/�����k���wP�*�7$g���\���!��d�,�+�Z-{�6w��S��sH2�C�I>M.(D��uM+�]$b��i%�����#`���@�E2�2yn�$+_�Nt}�0��w�>��|7��P�W���	�p?��s(m�J�M>���:Y��{�wv����#�W��0��d�e��IU�_ALo��
�;p����$�������@a��W����|�L����U�y�~�����g������D� :���2�z�)yo����
&U���2�
�B����>�����#)�|s���j�e�m�VUF��r��)�$���/���&��)V����X)@���������*3��(S����3�I��q�����0x���i-s���BP�����A	��MIa&��\#��+�q�����fI����N�������R
��-)�ii���.h	�
��y�'��jl��6?��jLEa�t=l;�
�ZH����rO����
�������@������D��H��sNd�F��?3V9"��I(��f���v���'c�6��$��;��1�-����Ir4��	U���1fX���a��&c&c�� �����@e�����P_����o!;#��_�>y�P��$���
�����P1�Y�[2�p�^�Ut1����	xn���j��>SeQ��PB]�=��A��z����.
����buA��[�����G�{��{�9t�[�_=��vk6�~�
V���P���q�[�l�((�kg���	��P�F�&�5���)pG�7����GJP��6��@D���7�tO���YG���6���D �v���&
��iPI���eM�sI��52bvmK�N�����%w6�&-��.�"o�F�6L2�og�c�M�������/���;&e�����P�U�����b�V�����<zZmZ�83�_��]=�n1�%�
�I���e39�t���
�dl	��p��k?i�`����I��o��I4N�^�����������~t�I=�>���Y[��5��#��"���Imx�����H<ywl����Hf�v<~Ga���V%���(��i�r����?�J��Pt|�!e�x��5	1��qO����$�'�|sO���hqu��9=C�=z ����T�8�{�zSTF:��]�$��c�R���7��%�>�I>i%]r�e�pJ(���H�V��u.�i-���L6]K`{��-*>�"�L�|E��5t�����\���&E�n�KT�E?�DU�.	$i���|���+��w�*P��"����M��0#�^�|�j!d�2�����U�"���[��-�Z��n%
�#��S�����dE3MMUF��d���y����~L05K@SSe����z���?&{�B�K�oiA�@�����/����*�q+N��t�h�hj��X�:�vY�j��&�`�o�)���2v�D|��A ��Kp�L��-�oy���������v���'�PE�A����
�\K��h�F�x����MLN��oS��-����}���&�#���)�u�m�.��C5�����{R@ro����������zF�_(�~��2����c��,~��8����_�,��Q����g��=w�G��d�]4D���
/�f������k%�d3�@���}&?Y�r�M�N3�h��iy��=����C�$�h! ym����3-�I
�G�=��_�u�����[�+��znI�^�g2(���M'��Ce�S��}���'��{AU�����N=���wb�����2��� xU�y����Mu��:�5��Pd�\�\ly0���Dz�Y��pM�
A�y[�,[����
��`0���SrrJ,\��pA�E�{s�������d�����B�����>�{�so/��h4�������AtA��������-��Q4x��C������;^�{������E;����ez�?��1c�-��V�7 ��������,d�^�n`�o�b9e�K-#�%�P�PV�7Pr��`�����Ch>3�J>y�VF5X%�,@�.�����y���O���m����9��&(�. �����e�,[!
�v~�~��X��P�?�bt7�jLt�.����
T�j`�����
�4@h�V�P#@}M8�v�*��A�����\����;���i�s��H���}~aE� 409�PS�p��?�t_&7P�w�Pxe�K3/��H�ju ����n���
��{"�������
�Ha��^6���}�a��2�>�e�jb�N�:�e���4�B����%�2�}c^�#<GwiR�!7��a>R!�)�l��2j�q��4%S�7�aT!��O�w�iP��d�?��z`���FC7K*��MNm��KS��@��5!��m��6�^����^fk�/�������U�R?wVg�>�9���n0�����$�f����|��N�D��$�2��''5�z�F�z��\?����n�UtRF�Cp
�W�b���G��9����~l>)gY�S��{��|G�&���������mV[���NN*�oa��G����$�e�9.�~P���1�N����`|#{��F\St��2;�Qt��H����f�B.
����2>��N#w�t.����^�u����D���<�=/��f"�P��U��09Y�����m~��"ATiNaV�:��L�d����T���������-���h����RzcsFVpiU|�E�s�,��C�`]���/���PF1h?{��E.��<)����`F�G�nR�xb��<�6�u��e!��7$�;j)�b�d�r��x[��~��Y�(jSR�u��DpA�/pcI�6<�K��0�-�����M2�-HS�?��#�-�s��-�t�d�AF�W�~||�� d���|����ad���$�����*U���:+y�7�g�p���o;"��_�o��q��Q�x5G$��A�V����L(�r�\�-3�Z���ys=Z��a6����.��S~�l���kB�������7)�*����{^�X�`�5��7�������AL�Ex9$�=�����||03�{V��,L�o��w�<*j�v���(_5�?X�D��$~�����	L��M�}����)��|�5��pyK.��W�<>���m)��gRz��F�����J>�����{��������������\_�C�	%�{���[���()�N�m����"����_�n?p:�����98��9��������A��1��~��o��oL��"��"���9�e��9:h��$co<���&�k��k.��_����Q:��C���C�&2��I)��D�%�~�M����y�Q0����.z��]��7�������}��\����z���g!��[�~����������U�����{e@�SC[kq��N����R�)?����|�3������5���&"�&�W���)���Y\���,���*����msL��=6&������n��*���,pg������!�� ���)L��&3�I�PZ���w'�,�.���� �J����%��+������ZP�i���|����9^�-4�,��I��S��'�
x�]�u?U|��kD�1R���lp��������y\'�'�
�GJN���,<�����c��F����|W�*��(���0S�k��S)�8�7��
�}R` [���D�(V��>x�Na�^s����0�����P;~~i44k]!��Iq�b�W�/�]���6"V���?_2�ePB}5P�=��DG�	��{uS�v[�� ���XI���7=(����u�|F��0s����v[ty��uuk���>��$���$2��
��iY#t� ;�h��Q���NdR��6���"�I�>rzlc^p�A��B/����g�S?y@hD��}�L�y
Ea�s���C
�(.@�AM����<�l�i�����w4h21(M���#zo��>-���^��{E/'9�	����^������{�����x�v#���.����;c�?����e&W�e��Ft��2:�n���|���!�6C�	�
GO�3���^=<�k�;����L������;2ID����������8A�F��R��9�i����t
�S����[�J��aS�6��Qh�jnA�`XI�����^?�@C������$(9�U�E�0�u)�^���$��Q���8��������i(z���^���'==��\��E�p��<)���9.aG�"��%�/���,��4���`#��<�q��DkrT�-\�g���@"�H{99��9��
����������=#���|�zd
E�-t��N�E�r����$J��k�&O/H�	�w��0�d��<02��A�/�($#Jt�$��
���9�5��_����e}~"-��Q�������%��8K""[��|F�������k�%���$#�9N��nl�������-�g(����?;&�b�$;�WN&:|�f@
)�-����Z�F6�s;����ymp�������;J��')a����s�>�A�w}�m0�g�bz�&F�FL&�/�W�2�s#�~'�;?zo�����)��D�Zs������I��}^>'�������E��A��h�w�OQ���]�-C��G�x����f�bIb2�7��f�2��*#F-_'����^CnN����#��JD2�&2'��)��c]3��<v~���@��]���x��7�rI��d��pJ�U�>���`o\����B��x�3��o2�x=bs�g�"#�����y���#�;���}��K �����|:0j��|��I�4��[�G��M���9�Z������gcPV6)2��M��S����]���y�Fz������a4���'�j�e.�����H�"�g�P[�Uxb��?��!Zc�l���bzU�${�����IPr�vB��-�������R�c8����
d������Y&��qH;��-i|/h�m���4��_��eL�� �#et�������������n���~��W�y	f�P�z���m��HK���Ob8�������G�{C�h������������I���f&{5���j�������>����2t��f��������w�������M�����>-bf���ln��s?�sF�h?�����"���nyRX��~fl�#�,@i��OC���QC�:��{N-�J�s�*���l=��r����� ��{���;}�h=��$���h�K#����wN���E��q��T�e�,�0���`��+sf�
����$A��a�$�#�O��;����8%����W�K�;`bD 3�Y�X�=a|]!���E���]��waw�M�������r0�X���x���T�B��4Po��q���,P���jD��CF�T�' "�W��������=`���[J�MY��m�U�\��g�O���Myb��-�����f94��d�{��������Q�����9$1�$_������J�����5&&�P����	,&�s�
��&���K����#3�
��;:zk�>U�,@���{%�JL:�H>A���	�"��4��
1�� 
���'9D +���d�{��kO��s�/+��I�7+���8�N
��u�?f�i�gs�Uc+����p@�����\el�j�~R����p�
W�E%�Vtxd&etJ&��"��?��=z�H���H��i�l�j,���DN���"�+ ��d��J�����s/�n>Q
`YI���vm�Nk�H����~S���=S�K
����o;|�����?@fM��chF��<1fkQ��4��Y�5�~�E��1����z�^|A������+hH�{�n?�M �7���)/�N�W{���j;�#7����~.��B�q
)�sU��h��+��M���xS�f�L
�G�������Z�	<:����j�8����;(�����9���PM+��x�<=��0�{��Z�X�d�7�5�Z|����'��W��F��F�q���^��MP�$�mZu>�[��8\n&�[��E��T��6E~�Zo<�H�2[��C*3W`�����<���o��=��AT���MU��Z�@]o���\���a{��Z�\��P"{3��}�EZ4�x$�F>}�^�qK2P0!��/�&s��p!)F�`�����d��3K@���������hh���V��2�V}L��������lP1!�D�,Q��m�>�^���6�/ �H}�D������u���S���u���dd9���*a�.-Ad���^����yx�$��A�N
����O�:������������A�t�K�d�����~u������~�����������G���.z,J&s�����IZ�cV�,J����Mg{[wG&�2����d������2�/,N���rf��"sp�|c��v�
����*\�]�i��3�T�H�.�i=��l�'	��v����{tI�?5��|`�5{@%}2�!�hi���|�{����;�+
&����v	-����)]����_
^�
�7��;O��k�6��-BZC�{i%�-��I�!L�6yhr#���=��N�r$���1de���	�����\��~���K�����>E=�F���'fg�?T(w��2�f-�U	?:������2bx/��_������`���*��z��L�\�d�L�g��U��/r������?�5����W���c���M����&k�-�N` ���<)�g%��|����E+�6�h��3��L�]��i�
��CE��,J�sH��d��O��u� Z�Pe��^�/�c����G�x�^:�	o�s��}��P�V�+�=�
�Q����,�M	�
5�9O�Z� ^Z����w[k���dr6<���S>��=�(��5�0�D+��C�'�T	<��_&�@2���'�Ey��<�����^����'�='�Sy	'���z�e6���A��HA:��d2q��Au`78����������vwc���
��W��B-����#V+*�v��.t|	��1^��B�<Dk��������B��&�@�n
~�3�e��/��P_k�;���2g�hum�~�������7�AD~�<<?D�2cc��o�2���57='���IO/���b@���Y��H�����K�����
��{��D7RL���N�������c&=�L�����;A��j�~�����t%
�1-�FCLzzy?����+k�!�N���>y��l�
�w!>o�6���-��P���;k��^�-���c�7�
O���j�46�'���6���� v7���z��������(8��#UtRl�K�=�R�f����=dL��ux�F����iu��)��`@W"��R���$
��H��>Eg�B�|����V����-�u�'G��(^��L�������h$�l�{�	=������eF�y�U]����$�`����H��^��B�;]<��S.�H(��C��:L�_�f�p8m�W#�8�C.�(��h_%�2����^q4
{�i_�1!+�=^��d�L���b���+ww�2]?�:�.����L����/��{���3�����������Z�������=�[�We`�t��{���C��$��3�	%�,������{�H��+��_/��@��z�����(^%�������s��v��dm�7��t�h�'����}��V^mJF�0yh��GDl�<����H�����+�O�#b�]�M{��|���BPM��[f�|��2~c��7Z����&�L{@�~�����������?�;����}Lw�L`iH���SmN�u�*�Sc���dl�4���w�v`<��E����+��7������0������>�I\�pu������RF��m�90?�����o���y|�h���6R�*�N`i��?��������+���kf��TU�7���d+=}�P+�N����u^F������"/H��������������Ia�Y&�h�����3~�������?�y��dtB�w�*�;>4h��}���+x�y]��V��x��6�8�R$��*�����2��^���Q��LrR-��m���J�k�e1��K��,p��4/�I+�6�=_�C'�!��>:{���A�DO���#�3�X�8����j���������\��'�Cu5����@����������:/"C���H6Z���@tx��(����cRh�*L�&�hC��Pa��|/�gZ�`rN����}��g?�{�"��uJ�Zh���������e�
�]��)����'���6���2�XV2@[W���/n����mak�PZ?�������N[�H6���58�y%��� ��C�� :��e-.��2A������Frl
�=B�y����RFt����=V��r�Eag1�u�p�B�]qfU0�\b��>�-x�<�)�����:�=�/j7��
���*Fo;�v�BS7�?�wa�gy�P��<y����������:�s��]A��jD���
u��)�Zo0L�����n���� |�s�O<��OF@��������;�����X���y�S`�5�n�|�E!�@^�Y,UD �p���o*H��M�:�j���
�[���R�1W�_��ni�n��$�}����������=�V���/���a�O����9`��/X����	cd�T��
T.�<F�U*�����Vr�g���e�����Mb�pY�1%�FrM�1Wn�������#+�u��D����M>�2w�j���y���q��vmM�r��X�Pc6���d�X������������|��j������� �3[[m�dQE:���6���"�]�����!&U��'T�lT�%���9,~�.c$h5���h2^O�k>��|��V�i���,����s��m��6#t�Gn{����2��l�=��@�t�H���9�2�_C@�Dor/w ��V��������{��i�L(�R�8\#G��I��]2S�"3�O^aaJ�&�"�_�o��C�-��u;���o�@r|@!]`�t2e����4���u��{FW���M������"�E�=xW�{ ^%�������
j=�~�.g��Kv�2�!�!�`��
3�xZ�$g6�ItM���i�q��+X����SV�LpDC�*���EbR�}����P�D%�+�o��o����|S�'W��L����[&e��"����=�#�\��
:��['��F�W���|����b3����ZY�;IF��N;��"���
�Jt�����L2�����������x������T�{������{������;��I��l"f�ap��@�j����,���S�c
1�I�����]��}�����������������pKHXJ:/���M_&#�$�iVD��Og;
�#1;w�N��2�-���?
i���{������&_2���|M-C�u�e�S��0���;~Z��
���&��$�_��G��w��c�$w(94l�.�r�)x�]���9���q�Mn$_o��9$/�U|J|����b"Jd���ld�Y����0��B�;t��yhMH^a�<�����dn((�{��#��:/�G��"�����.������ntF��-w�u����h6E��74���j��S�-9�.%�.?��T��G�����������n	��K&������x}��f�~��}��3t6a

����%�s��mx�?����@�����k-<��pIHm�u���7��6�,�Pdn�a��G���Lcfsv
/��1[�%���;��4�('����x�����tXq+�C{5O�G��:��)}#�1�)�7�gJ(9EH�f�g*�=�J��;���A�/����R�c�+�+���/���B����@����5Z���i���E���[�bK|���;v��������?t�68����1��x��h~���h�~��y�3��U����7P�;�1�8�V����s>�5�������ko����?��
E���@M��y��,d{G40Ht �7`�J���^��6������@�-�;��S�s�\�
��
��{9����P{�����@%����u������%#Vbai�S�~���2�1'_����#�*	t��,�S��l��#�/�#Hhh��	��
q�����
Q�	���&o��`�����/��x�����Vr3�m�zk��Hj5��l��1<n��/��|4�����B���Kk�^T!�i�~��C�
��i,n���Zb�8h�1�  �t+��6�u�[CP��^JG��7�6v��G�b�- UxY��cHBnm��9y�m��d�I��c����%��nkc���@=�
�w;��N���{��z*P��=��T|�H����\./�~'��3 ��j,�_��y������k��^��hH��������b����������i�<x������dk�O%����R�������~	x'!����L���Y�LN�d9��4��ZW�f���1qy��M�_�����<��v������m�|�=��V��}L��
)������5.�� ��
tMS��d��K����t��B�<9��h�%[+}F�ImI��\��E����v����������+
���kg3��H����'m!�
s��&���f(���=��������@�ZZ�!�5!��XRmUn��Wp��jRN	�aID��������1��� ����v��� ���huP���4������ ~&	�c�V������D�|�_�!!��e��U>����	�$�+��5���@�D&9X�=�C!���)�M�������-Y��$��4�����sBE��G�����O,���_Q�|c��5�Ymft�1��z�~���-&��n	G���S@v��k��E�����^D'e���,�NapHt&s��t���g�����'��BV�b��I>�QqeG�f��e�c��9�YN��\���?�o	4'�/�:��!���=��Qr(��.:9gb�p!MI�4p
�K
xO*��;}�\������&���[f
�
ai����vQ�n�]yZ�Lp0�)�F���.�_<w
�#/.��Q���0 a�H�����8����|�0z���J��3�������Ty����h}������y��.�v|,��U��V���I��B_�a�0���8��~�R	�z���	�x&�X 8��Y�����@RW2������t�qz�r�W�?�������g�4���S��V��:?�����Y}��$|�����Y/�Lr���5!]���@g���9d�J��Y�q�3������?4��5��f�������)1���0��l��L�{|Z��0S4��C����f~!�������;�]��}q@��������5a�ZEw��y;��T&���Vw�fGY��Q^-��������_x�,BN�oI���^�*Z�e��F��"�����{��<6����t;��W&�)q��&��*���E�@jJ��!�
PNd���oT��;��l�3��K���!��cfJ�Y�0)P�����	"���a������� �pe{����{B�;��T4V������WW�7���qE*u0�}��~���g2!�~MSh�3�+���.X�����&\x������7P��Y�NA���5������U�?��P������������?�m
V��3Y/��a�LUH��OP�<��z�:��!E^��B����@f��v���������4��7.�����a� t$�6b��9)��w�<j�1����j��K�!��"}
VAC��\i4�d���>�1uK�!&��LQ������*q��
�C�M�����=��Lw
��P=�o�#W�+6E��c�?�e��w7M%<x6�V�@"���H����[m�t��T��_�^V�A��E�t�y���,������
����~d�}���H$��,}�Q��<�(Z(������oi�~��19�$��[f�3���"';#k�M\0G#��O�-:�*��U��	��E��m�~�Q�~k������M9w\F����;���	3��
|)�v��Mp�����@�eBw\el�J�������p�K�^v��/XH�� �����|�~�W����md
��qw�;��&�m��y1S8�����l���6�+��F?��k��2�F�F*B�C�/7|B{���'���c{�Z��a�5�wK�i��.7�-���nAB�!�&U�nA~rzG��l���D-Z!cSx���+�����eG�����'���-#6�MQ�Y��%,P�%�]������x�
��
�O3XU��F9'a�tH%n`�7�gX����(�����|j����>��+U�4�����I=�D���C�����q���"E����C'���z)��H�C���x�IN���P��U����m�j{�{���r
j��H�n?_�*��)���]�$�#+�^���x��i1��89ys:���i7�x�=�LT{��Q�8�&Ci���A.��#��K��L&y���.Z��aC��t�����T+"�o�-
^R���?R�s�Tr����n�P����F��P(b��]�*��DJ��$
d��,��������G�m�EmL��E>Sr��dP�����y�������qR�36��yi$����������)�a�&�kI�D�;�������!YP��n�L"�x�)�/>��d�d��#���.zCC�l��w�F\	
xM*��XM.E�D;������c��oG}|#������'�7�Id�?��2RX�o����`��LV8yWg�����(������gB�}�����w��A����]��\��
�{~H����i�B���� ���������l/s��;�*�����o���7���6��������[��-����v�7����R��"<��P#U4�s��������d�eV���sj$�0��i�z� �n�8e2C���eQ��?`*1����zq�G���-��cG"sD���m��;�(��� q�G��q9L����!�����a�?�6{2�����{����5���<��^(�l>&�e�%wpMH�T��P�:��0x.��������d{���r@��0����2<�!e���G��1O������P
��&k��Ai��y���L>�"
�
Cz�[;5X;�-�h!��PB�����;.dt~(���=�|�*����^���-�|�8+��B��[����s?0K��m��+�}i��a��%�~����t��H>������~�������@�m<�w�1=s��	����52�d�����\�{.Q�c�E�����LA�5�7�����u����P���ng��Y4�����
	�n�A1
��i;A�_j�>�\D �yc�B�.��X���&���4�Mu�s[*��473bn^��t���t?��c�{����S��j�5�p5�����|����(����E��z�(?&841���;��)�V���1��_[�-��7���$c�}��z�9&?�����[��EH<� V�~
8\9a��BlX��):�]#���2�D��%�*O�E��N�;����1|�L�*�x��!�t�c�2d�7EN�jN��Dd����F\�J���D�j����yQ�\��,������k��2�GT:j��M�1��	���xE]��#T��+���~�"'sG�b��M�\�#�����`�Z!�Yy��~L��W����d;�h0" ��+Dl���0���*�y�
^$����������w�������>�~�mv����J�7]t�j�]gG0�x�x����Q���Y}O����`�������~�k���4�{�:-O# �G�
/;w��/���2�3��`�R���+�S>��9�{��A6���5�H��$
�����=x���w��	�$��c}M zU|�o-C�@�7]i�-�P�����9���T�x2^���\��W����f�b�s�
��E�Jt���_����
����L�#.s�p�g}�3�"���G�a�����Gf�
t�t�����F��{m�\�K*�\����/�UYw�����-��vT���=K�V�����W����x���W����Qx�0�*P��GD���F� X�xQ&^�CG/(����72����I6��A&��s��*Z��)�bg2{�����	�����?W,�s��V�)��_P�r3a9���j�lb4�Up������t��)�J@1�fyB��b-�����{4;�#G�i��g�	n@����U]�����&�����XM�:o(���}���X������b}����3k3�7Z7p<�t�'�$O(�
�W]�q<;���I��6Iv<!D�Kn��c�1w�d4>(����bM�H�����jC�K4���A�_��u������S�w�)���)!#n��[B�;hw�O^>���Mf�V���]��8��A�����y�����D�nsJA!���8���J��r��h�����LP��W]���7"U�+��LF���kzl��I�;3�YIV�f��.�+�x:7�'���GLeV�c����Z�.]O��
]��}����R�HJ���QO�����	���������\���p��-�F�[2�9���N�)��V?�w�SA�9��U�Gp�)��V<����]�2k���J>���a���r���;��������m{#HF�3�*�P��zRF�!0��w�����>�O�/M����yo�h@_s�c����rn���B�gh�3�[?�;q�c�B��\�j����?�;h-���T.�`#�����p������w(gl%UG��2���1�n��*s8����?;�������T�7����'�����|;�d��*�A�ju�WG�|�,��6��kOx���	n`V��37P�*v@�|��E#H����	�w��V��'#Tt�<�D��M�'=������|����������'?����B
�`��L���H�6��$-� T���&
Wh��W���JCzd6��m��U�*�G6R
K�V�T�7X�S���H���U��#���IZ#iW�#���� BU���`"�u��  4x���7�
A:��_�8h���r�U�*#��s����a�W���W>':��y�������{����=su�d$�%�u��G���}|�<���j����Q����m�����
�gN=�!{@x���hh�Z��`i��0��y���a��t��5n��x�t��=�|�Q��E�����*c���w�Ns�H����J/��%��e��-���j�BM�PI�p�p~
�Z�p����0X��?M���{8��������>ru�G�a�|[��2��S���t�G��tp�H�l}��]v�#�&<����V�V���V������D��������[����{k�
���Q��}��\([�`�<d�����<X�n��S�q
��"�������l��~������E0���CV��k����P�U��]�bu��By@��]�>+�3��m4�5����������J&y@��J���-�u���B�������z�e��)^��H~�������1b�V�#/���Z���A����M���6(#�h4����i��k���2w��+n�����	f��Z����m*�W8-��pQ����(j��l�-�,�=) ���HxO���Z���_��Y
`�B�=^;q{����q���B�xM������9��:��dK��	�����y�J���������q������]�����B�!����3�uZ-�������w~��F2�����8���M�Af2Jo����m��>�#���������X�#��V�D��V9c&�p��9x4X�#�c��V���I^���H�?&'��b��]z^���[F��x�<��o�V�
�8�
��j�c�%��)�?E�!<�F��F^AfJ��t�������:Z��]�c1a%����;s;��������,0��xp�$b[�8�uNm{�9��%�L��c��S|�nY��C&9��b#�-�}%���FFq�����$� 9�����?w�u�e���l�2�J��������2��<�����^R����`�TRj$���������V������1���{k����H���-���@�
��q����M�
���<�Z��%����do���Qc�6h�'����N���[;��u�]������-���T���
���A��G�;��[��>fk�h��=�8�j+��j�!�r�����5������zC#>�M�~�X�WB�=pI�
w)���-�W��f$u���s����
iFR�o�'��2��.��P#B�p:����yk�Kr:<�K���o������u7@'��O$�+<Ow�E��-u
�1�+���(�^��l�E���E�����!�� �%����{�!&����0G�6���
���<�2e�&���|���lE^MjA��*�u���O�� �N3�|��X��e���l��>�*p��2���\���>,D �)}BX
���uKf^0���m���r��B���/&���w��/��1��|I�U!{�2sj����HX���	����|$�q��Z��:5�H�D��|�����lr!�,���6�97�Hf����A�M>��d��BL����
q�� <��I�2��}R
���4(s� V�6�+7iP��A6*P�c�{U"PzBy��w``Y<j�n�����h5����}���1H��4`��my��\w��G�LI/+7�c��>++1_Olr��x����kC�!����o��VgwI6D�9u�y�&%��r���;o�~��(���dl��}?�F\�|�)��T���[�<��']`�9m����~�57��j<�>�(��o�}R��F���]c[O�Lr��>��!]�]-���i���Ie��.�%��������o�<���������]rAaC����7�����|�
�� ������!�'�*� e�[���xN��M���" xR$�]�Wc%�������fa��,ToN�����;T����T ����M�%�Hh_�ii��0���6,��^l������q;���*FA|����Z��t|�I���_+y����f��{�I�EY�9N�����/��	�(${������1����O�Y�*>h���L������dN���m�|���=T�N�,v�:�r�9����A�[8%l��q�w�G�����L��ZL �������.��A����
�gU��|:�I�~}�(n�
���7��b���B�� l��*`'�����c��t�1r�_��Jz������u�Z��#�Ln��&�v�SyA��dl}�i��{L����}��bCE�����m�4��$�^ �x��F���(��kP��'��|wfG�� �'��+�Z���t����k���U�E�k>�^�b�^�A
�l^�D"���3�	��w�'���tBe�3���%�2Z���t��||��LnC��J������H�/YnD��U��k������Z�=����I$��w5"s(��H2������y�����_W�#��Qg��n?BP+��|�R�+*x�9�	%�5��vlGP|y�o��PdC�q$uF���&��MF#�
�$��&�~��������8e�*i`D^8��d�T��4c�y	v�	��5�=y����	�Z����x�������<�=�������g�Lw�%52�������B�]��;L��+��T��e����r#y������b&�;�����z�����Y=���"y��e�< ��rSa��,���c���f��4�Dk�u���]+}��/�5�����o2��3�2S"r���@��H�!1����F&����o��se
3��1��`�b�
�lFkH�o�����O
TpY�������|���WF*	Al���7[�x(s3�t���,���:7P;'r�,$,C�u@����^��7�w�����2�Vs\KA���_����#�U//�=��)���������}����RY!�}���������>���y�u��v���F����L���W����9����~%�}��d9)*��I���!�;C�N���� �_J��&��I*#dlo��g|8$�-���O��$�;�h�����FcV��'��������ha^R����M�F��:,V�����������TD�K�d��g��hD00���Y������.T�<@
�w��K��W�,@[��),��Ey��%��_f�������B{��K�!�-�nV4�a>X!���~���F�9��E��d�����u��o���|��"�E-L��K�HF|S��X�54��|@,|�e��e��}S��������K��T&zD����2��o;A�%�<_��&e����^(����Of��E|��'3<��������/f��`�{����.>����cj
Tf�W6��5�">,���.���!���T�e'E��RFo���=/,���	���i��cK�����L�;���|MY��CxJ2b����jm���0���-��$c�{vi�N7����V��}�8��������o�����>*��x-F-[0(P�)-{e�����5,������\�Dl��NvK�tO��[��S�j�$���N
����#x�4�>��b7��9�y�����|��������=Q����[�L���q�=��s(�?Q����X��9�t�a�ij�z����I��Khk<N1';�f
2��d�Z�Z��k���l?wno�j��LY���!3HTe�d����+��}�h������;�x�	:~������k���^h�*���sT�{M&$/�j�����X�#1�9�}G�W��m)�X
���<���jOO�i���f�����(8(H��G��wg=�2�������~[�p�
����	�����2����k@�����y����F��wyu��2���|�P�5�K8��9\Va�\�����yefT&��
�js8�����2+�����������m.���%�y�ldT��#,�|;������MJ�z@
5�����<���R�A�S����%L 	���u���F�
ew�y�
Aa�:����	"����cz���~����������.T�*� ��
�����D��!=~~i>f��~�aR�w��-�`��	�*����5D��C�|L��!��/�oYs�{�eC����5P������,�S�1o���}�����Xb��SW��F�i�#�*s#��Ea(��)�x���6?�b�WM�B�� ������q��kNdJ5Y�u>���Ks�S�s��Q���s��FC�y����������dO�A{uP��i4t3S�[�j�+����� OI;��o_|�e�^��	���$?���V9�����5q_?�"*4@��
�
�_��oc��=A��,X�#.h���@k�%��v�|k�����.f�i�^�)!�Z#��U���Q�G�M!��f��G������<I�����]=��78B�D�
�j-g)vK�]S��a�@"��8��V�R4�=�Aj<���!@P@�o��:�:�y������l��<�H^u�>{K�F�&�����]7�<y�<��s����/�#J�/���������<��&������Yo���X;<I��t��I�K����I>Q
o\�O����4q����
=UV�/����/�k�9.'��8��e�KR���g��g23:W2[�a���T]�^��g�L������Z��p�T�����y��:�r�(Ht)�����-�2b<tFxIosQ�.)d��*�]�!�G����d�D�=�����D�)
���U�M��qR����{���l��@~}5�7���/MZ&�Ky�!b;�-�$�2�����N��H����n������y���Nw���
4��+2W3�O�uu��mRu�C&����\��[�|q y�E���;fF����K�CU�������^��H��@�ce��i�g�W]�����N�Z�H�*w�q���#�P�E��H�������.�B��,0D�����92���<�����]�1�_�����Gk�S��
��&�l����p7�e_x���0�5�������������m|gl�n��g�3	�F�L�z����H���;k�%�7T8���3t+��^��o`K��D��{��U��1
V��>��,l���j"B5�{{�F�TF@�.���=o>���5���
Tf�o8��[.�sg��c2�|�I
]2���5d�]��'S�z7�"����]�VD�����
Tf�o0�*��!�����.���B��q���L����74$�:�#�"Bu�.�`'����Z��_���>��*�`����_��l)s�������1��!��m�+��;���������"�O
��t���P�j�?�-�U�1X�k����0��&�^2���P���5};�o��^/���S�~���P��qw�Eo:�r�l)�}�	B�h�����;�%%����}J���	B�lHTe��dIz�b�O�2{e�g�����m3�Z�b���B���-�{$#�N�DZ
)�O>�jG�F������Z�sR���i.bv����}L��c�P���.�Z��n�&X�ko��O����`��a|���<��z+�{�z�w8������<$a�8u�Z��==���z5e_r�����d�6h_-�����
J�n
��E?kv1r��V�!���`hH�1>;�sok�HD�l��������a���/w��������'���h �'��fG���`�m�m_cE�6��~�T��%~~�Bdwn��2�@r�����G��j���SB�v��5�e�(��������[������5f�]90���xRF�>0��S�p_��_���������Kd�[���zo���(������F�Z=���)���z]v�6O��;S?k��\�>���kgw��$;+
2�&�rd�r>�pg`�~�v�;-g#Q=w^`5���T@�<��*�����Va�D���#:�[B����Z��M=P�d��pKHF����[9����7�j�]�P%����Yh�02������e�@������[�2d�d��Og)%*��T���NP�>f�O�9��8pB���T]�Xq�)%��[+g��]|%��"���+7��O��}���C�)�GA`��Z;�,�3�D�W���d����GJN�H����C�-YT�I��lC����7x���-tL���#wgD�eR������=T[����$�e��U��5�}(�P��P�$e��������L��Su�2��Q&�/>r���j�p���������o�+h�Z��1C�6���e��`C(�]
>�)$�m��> RFh���"�����BM�}/'�n��%��dP=B�CY���C��Df��4PoOdz��/$c����x��d'�<�YM�&.k ����D���D ����9w��h��
HV�Wy���k������=o���*���m�KcH~R��.(���������+�~f�����]T����a�s�5������
�V.���
>�N.��Au��k����>��d��sy5Xo	y��'�l�l�S�u�,��q0A�s@����������6`����g���Z�J��C��}��IFj���I��i������1��~���+Bc�j��_k���Q����*�����@B��j��s:�(�5���5bN?�E��!4	K�}�6����1�G7���SU���v�~-Z���7��sP�yi����-���5K�-��;��/��6�c\��y#0�W7������8�������3+j��>$��h��R�����������6��p���������E�2�A�+�V��Fgn[j�����A��=)�;��hQ/�/f��<H8���%����4goI��/�,#�k����c�����n��|*ZM������?������������v��5�R����2��G���a6A��mo~xO*�?&���y;���������$G�'Um�tv��N'���cf�:�)���(�%�7��l .Z-�)��-x��Xy��o�X.�4��	U���2
�7)g9u��f��,�XP�n��/Rz=�#A�a[���$/IUE��^���mv�)�������[��?nI�6*��������$��'�����6P��� ���k�(�
�O2��l��t����x�E��9x�
$c/<��Vhm�^���z�c� "������s��>3�Z��� �v�t�����Ro8�a������Y�
%���oy����Pl��v:��,;�`j
�%e��w��������$�$��<�#)������!��sy�ddW����ZIp��7X��h]8��x"�O�[�e�P 3���Djs$�a�����X�
{�I��b��&��R#�&��L��cz��8R:�t~L��5��K�0P��I�ASSe���Y���������aLf����|0��Nx��{A�O�[����t�/#�.���	��� �����^1���R1Y�4{�4����	����>��s�{��&(�&������Pi���R�����*�O)����n���z��=�9�P���a�x�o#�B�y���}a��^&��X}MU��p����TQ�&;Up�^8Q�t��U���;���r
�i��a��w�w�+�<�?��F��b[�}���0����1Z���*xF���>������@^ �7��]�2������U^���F��m�2Y��4U��
x��������2*�`h)�_�������X!T�P��U�7��yw����\��7Pw��)
m����#^��@N�M�9wX	$�<@A��C���\*:��HF@[��[���C�Q�~�e��7X��/E���8�O�p��
����>��h�0-��o�m�My���
�b"GG�yh#�)5�nb5���<���QB�
�fi�2�����U7P�&�@GN(��M��[�G*D"�������b>R������I�.mm��:g���V�`��/��
�HR���M���Y������j��
�HN�oo��4�Bdr
t���9|j�Y�������0����h:���Z�����
���D��Mx��I��g��,:*6�2_�IzY%��_��Nmx���%��&�k���������g8O�V�m~o�S����*��������B����U�e����.:O]�7	s�(/&���j�s���T�p���5O�}5�����$�;��$cs�����7�n��1=$��cf�R|L�u_Z���N�����o9�s#f�^%��Qt��b�L��e����7�}]mob����Z?�����QH}#�����u���@��K�mEo��7��>S��Q[�����!��>�	;����MUF�C.�'�7��e+v���D��('��;���L���g(4��4|,���0f�����$���_��I�9������;�\=�2oPP���C:��C��I��Tt5����F���6������,�����g�na��S��C�#�A��j��~C�te�
������_��]�$3�P���.A��
�������o�h����>�>���4���]f#I/��%���t�c
��-��
�)�^!��m���F���$7����"�F��vm���>H���u��&\�^�l,H���.d#L,�Y}��t�H���M���=��w����^�!Uo����JH=n4j�QJr����D�B���C��
k����.��������:G���)������K��(�	�Y���H�Np�]���A����k>7[-	?�����5r������vTg����=��"
�I�#!;y`��	2)�7i�Yg��q~���YkI��Z��-iN�������V|���:��y�Y>��bM��v�1�2�����C��|�|og����U�=q��&Y��}�P����'}�>@4/*�kyb�&��*��(�h��l���H����M�GK"v\rd����(�X�yx��O�����c�
����}������1��}IV>���+�N@��{o�O�R��B`l��&,�|N_U���W���hF&��q9�_P|��2��}����(y`Nl�9]�P���x6*�����������RQ���I+��~r���_$��iG�o�L��l�����j���P�z��CY!��b�hW����N�Y�SX���m����K8����,2������V�**�g�L��BK��mhu����N�Ia�G{�BD
�� �G�b?�e�xs��sge9��`k���=@��q	�����P��|7���m�i��L�C��|�Q�A�8�-�8��F�[0e�FLK���|�&�,�n�z�
T&�;	��1�K�"����s�Xx�]�TPC�v����?!�S���+�����Z���2@���T�-c�y-N�l�sn����G2+��S�g�#��u(�;��4(3Q'3iX��I�2s��H���=��6"�Q �Y��K�}�o9��0���V���N�:���(��4�U�w���1�*��2[,HxQ�C����PL�k���5��
,$;+n'���}6Z5F�	��'�D��M���;���q�j-��_��/�F����$c�Dx�<�^�}�w�C��2�-��SI���I0��q�#�7f�9}y�9v :Fd���|r������B����vq��D������j�>������b���D��z��Gc������8d��yak��X��N�	&��|rmo�����.��WE&:\��
���?�SmoH���g�|������/(��q�_��U�R�w^������s7H�^�w�����=���N����`v��w@.�&��{�6�&>��|�<�>��s/���GoRE���b{}�����U�}���pPR~�x~K�.n�����Jt-0�v�(�v�i���L���S�|{Zs<��d�
�f�5	pQhz�����'�mc�7q�4��U������D&9���@%�t�JO�{��#f�;��^<��E�Q�������7��y5��N7Tfo
�����7\��C��u���]��Pw-��Go���d��UD���F�.���7H�������x
�`@�]���;�/�8���f�`ia��\��uK�z�N�S+b��3����WH�?r@�)�"�\�*��iA������������:�~�G"W����Q�|�6=[�m�Ex$���j�e������ImBs�z\�n��2����2��At�A��XO.�,x��Zw��7��7~w���*�!S0I��t���#�HF�!g>��F%��{��Z���V$c���K��?@do�|�P���)s����Lg\���<�aU��.3"�9r�P���������#
�����K�;��e�:}�J���`-��	�=�����`���mp��_q�	��x&�px������}���Cq{Z^�7�<����A�_��nt�U�Q�I�;��)+x��'�]ON�����z%����En_&��g�#��6������-�S~N����7y|�&���-�^"�r�����2���9�;��/�@���{
��������/c����Z�!V �m���A�^
�D��F>����nB|=��
���u##
d,Mh;<��
���u�O]`����VHr ����P_��A�A�J
��� �� rE�~��"`=� �)���5��y�tR�,�{N4�1.����|��;�`hAWw	���C� {�o��Fr�gi/(P��a���?g�Q��'O�p�*����@�o>�s� Zd����ja��8,�|�r���R����{�5(���W����H�o����5P���
���F��v@�u��O5�p��F�4�
��5�E���X�W�����uh��a������m���gr��������)r��rzY~��|z-
��y`��y���/���0�d2����^�4h1\@�/��4i��M����:��I�n�����A7
JNa7hs�&
Z^\
9�i.��Z�FC�d'�'S�z��&����4��=B�K�����l@����v
��~�����H���j�����O��1k��v�O
Mi���f�~!���4���&�X�{O�'$Z�(���W?O��x���$c����	����\`����dL�����������M
9G

#47

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#46)

Re: WIP: WAL prefetch (another approach)

On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

from the archive

Ahh, so perhaps that's the key.

I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.

Thanks, will see if I can work out why. My newer version probably has
the same problem.

#48

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#47)

Re: WIP: WAL prefetch (another approach)

On Wed, Sep 02, 2020 at 02:05:10AM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

from the archive

Ahh, so perhaps that's the key.

Maybe. For the record, the commands look like this:

archive_command = 'gzip -1 -c %p > /mnt/raid/wal-archive/%f.gz'

restore_command = 'gunzip -c /mnt/raid/wal-archive/%f.gz > %p.tmp && mv %p.tmp %p'

I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.

Thanks, will see if I can work out why. My newer version probably has
the same problem.

OK.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#49

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#48)

Re: WIP: WAL prefetch (another approach)

On Wed, Sep 2, 2020 at 2:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Sep 02, 2020 at 02:05:10AM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

from the archive

Ahh, so perhaps that's the key.

Maybe. For the record, the commands look like this:

archive_command = 'gzip -1 -c %p > /mnt/raid/wal-archive/%f.gz'

restore_command = 'gunzip -c /mnt/raid/wal-archive/%f.gz > %p.tmp && mv %p.tmp %p'

Yeah, sorry, I goofed here by not considering archive recovery
properly. I have special handling for crash recovery from files in
pg_wal (XLRO_END, means read until you run out of files) and streaming
replication (XLRO_WALRCV_WRITTEN, means read only as far as the wal
receiver has advertised as written in shared memory), as a way to
control the ultimate limit on how far ahead to read when
maintenance_io_concurrency and max_recovery_prefetch_distance don't
limit you first. But if you recover from a base backup with a WAL
archive, it uses the XLRO_END policy which can run out of files just
because a new file hasn't been restored yet, so it gives up
prefetching too soon, as you're seeing. That doesn't cause any
damage, but it stops doing anything useful because the prefetcher
thinks its job is finished.

It'd be possible to fix this somehow in the two-XLogReader design, but
since I'm testing a new version that has a unified
XLogReader-with-read-ahead I'm not going to try to do that. I've
added a basebackup-with-archive recovery to my arsenal of test
workloads to make sure I don't forget about archive recovery mode
again, but I think it's actually harder to get this wrong in the new
design. In the meantime, if you are still interested in studying the
potential speed-up from WAL prefetching using the most recently shared
two-XLogReader patch, you'll need to unpack all your archived WAL
files into pg_wal manually beforehand.

#50

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#49)

Re: WIP: WAL prefetch (another approach)

On Sat, Sep 05, 2020 at 12:05:52PM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 2:18 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Wed, Sep 02, 2020 at 02:05:10AM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

from the archive

Ahh, so perhaps that's the key.

Maybe. For the record, the commands look like this:

archive_command = 'gzip -1 -c %p > /mnt/raid/wal-archive/%f.gz'

restore_command = 'gunzip -c /mnt/raid/wal-archive/%f.gz > %p.tmp && mv %p.tmp %p'

Yeah, sorry, I goofed here by not considering archive recovery
properly. I have special handling for crash recovery from files in
pg_wal (XLRO_END, means read until you run out of files) and streaming
replication (XLRO_WALRCV_WRITTEN, means read only as far as the wal
receiver has advertised as written in shared memory), as a way to
control the ultimate limit on how far ahead to read when
maintenance_io_concurrency and max_recovery_prefetch_distance don't
limit you first. But if you recover from a base backup with a WAL
archive, it uses the XLRO_END policy which can run out of files just
because a new file hasn't been restored yet, so it gives up
prefetching too soon, as you're seeing. That doesn't cause any
damage, but it stops doing anything useful because the prefetcher
thinks its job is finished.

It'd be possible to fix this somehow in the two-XLogReader design, but
since I'm testing a new version that has a unified
XLogReader-with-read-ahead I'm not going to try to do that. I've
added a basebackup-with-archive recovery to my arsenal of test
workloads to make sure I don't forget about archive recovery mode
again, but I think it's actually harder to get this wrong in the new
design. In the meantime, if you are still interested in studying the
potential speed-up from WAL prefetching using the most recently shared
two-XLogReader patch, you'll need to unpack all your archived WAL
files into pg_wal manually beforehand.

OK, thanks for looking into this. I guess I'll wait for an updated patch
before testing this further. The storage has limited capacity so I'd
have to either reduce the amount of data/WAL or juggle with the WAL
segments somehow. Doesn't seem worth it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#51

thomas.munro@gmail.com

over 5 years ago

In reply to: Tomas Vondra (#50)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

OK, thanks for looking into this. I guess I'll wait for an updated patch
before testing this further. The storage has limited capacity so I'd
have to either reduce the amount of data/WAL or juggle with the WAL
segments somehow. Doesn't seem worth it.

Here's a new WIP version that works for archive-based recovery in my tests.

The main change I have been working on is that there is now just a
single XLogReaderState, so no more double-reading and double-decoding
of the WAL. It provides XLogReadRecord(), as before, but now you can
also read further ahead with XLogReadAhead(). The user interface is
much like before, except that the GUCs changed a bit. They are now:

recovery_prefetch=on
recovery_prefetch_fpw=off
wal_decode_buffer_size=256kB
maintenance_io_concurrency=10

I recommend setting maintenance_io_concurrency and
wal_decode_buffer_size much higher than those defaults.

There are a few TODOs and questions remaining. One issue I'm
wondering about is whether it is OK that bulky FPI data is now
memcpy'd into the decode buffer, whereas before we avoided that
sometimes, when it didn't happen to cross a page boundary; I have some
ideas on how to do better (basically two levels of ring buffer) but I
haven't looked into that yet. Another issue is the new 'nowait' API
for the page-read callback; I'm trying to figure out if that is
sufficient, or something more sophisticated including perhaps a
different return value is required. Another thing I'm wondering about
is whether I have timeline changes adequately handled.

This design opens up a lot of possibilities for future performance
improvements. Some example:

1. By adding some workspace to decoded records, the prefetcher can
leave breadcrumbs for XLogReadBufferForRedoExtended(), so that it
usually avoids the need for a second buffer mapping table lookup.
Incidentally this also skips the hot smgropen() calls that Jakub
complained about. I have an added an experimental patch like that,
but I need to look into the interlocking some more.

2. By inspecting future records in the record->next chain, a redo
function could merge work in various ways in quite a simple and
localised way. A couple of examples:
2.1. If there is a sequence of records of the same type touching the
same page, you could process all of them while you have the page lock.
2.2. If there is a sequence of relation extensions (say, a sequence
of multi-tuple inserts to the end of a relation, as commonly seen in
bulk data loads) then instead of generating a many pwrite(8KB of
zeroes) syscalls record-by-record to extend the relation, a single
posix_fallocate(1MB) could extend the file in one shot. Assuming the
bgwriter is running and doing a good job, this would remove most of
the system calls from bulk-load-recovery.

3. More sophisticated analysis could find records to merge that are a
bit further apart, under carefully controlled conditions; for example
if you have a sequence like heap-insert, btree-insert, heap-insert,
btree-insert, ... then a simple next-record system like 2 won't see
the opportunities, but something a teensy bit smarter could.

4. Since the decoding buffer can be placed in shared memory (decoded
records contain pointers but only don't point to any other memory
region, with the exception of clearly marked oversized records), we
could begin to contemplate handing work off to other processes, given
a clever dependency analysis scheme and some more infrastructure.

Attachments:

v11-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 7e3c960b798d12f385dc0643f530c2700242b05a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v11 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.  On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v11-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Improve-information-about-received-WAL.patchDownload

From 3f2cd160120613c925e078b8e1b38155d9b17c78 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v11 2/6] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently.
Without that, it might be difficult to know the path of the file that
has been written, without data races.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 17f1a49f87..fc7311b5eb 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -894,6 +894,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -985,7 +986,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -1011,7 +1015,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1351,7 +1355,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_tli = WalRcv->receiveStartTLI;
 	written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e675757301..7d7a776ce8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -284,10 +284,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -309,10 +311,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -321,8 +323,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -330,14 +332,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39df4..84f84567cd 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,14 +74,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like as flushedUpto-1, except that it's updated
+	 * without waiting for the flush, after the data has been written to disk
+	 * and available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -142,14 +153,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -457,8 +460,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v11-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v11-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From d90e5e09974b47704d16dd29feea04f5c2c5cc13 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v11 3/6] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 103 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 698 insertions(+), 206 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c2f3..5f6df896ad 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..29d21ac438 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,7 +211,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -911,9 +912,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1416,7 +1419,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4345,6 +4348,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4388,6 +4392,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(&XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10126,7 +10166,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -11873,7 +11913,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11885,6 +11925,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -11895,6 +11944,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -11917,12 +11969,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -11948,10 +12001,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12071,7 +12124,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12174,6 +12228,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12290,6 +12348,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12367,7 +12429,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12390,15 +12452,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12424,9 +12487,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12440,6 +12503,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12476,6 +12543,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..22e5d5ff64 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -439,7 +723,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -476,7 +761,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -487,7 +772,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -497,7 +783,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -511,15 +797,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -527,9 +814,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -539,25 +826,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -573,7 +890,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -608,7 +926,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -626,7 +945,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -645,7 +965,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -664,7 +985,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -974,7 +1299,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -983,7 +1308,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1147,34 +1472,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1189,17 +1563,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1217,7 +1594,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1228,18 +1605,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1247,7 +1624,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1256,9 +1637,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  (uint32) state->ReadRecPtr);
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1404,17 +1785,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1423,58 +1805,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1500,10 +1861,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1523,10 +1885,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1554,12 +1917,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..db0c801456 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -351,7 +351,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -829,7 +829,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..885530c4c0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3863,6 +3863,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d5e1..d86092f47f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7c9d1b67df..2846766312 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -812,7 +812,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 2229c86f9a..38ef72f318 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,7 +240,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -423,7 +425,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..7259559036 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b976882229..ad77c04d0f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..374c1b16ce 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..2349b7b78b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -883,6 +883,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.20.1

v11-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v11-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From 62d64fc3ec8a4ab9bc90f6ee10855a5cb2bf5d26 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v11 4/6] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  85 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  23 +-
 src/backend/access/transam/xlogprefetch.c     | 895 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1387 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..6e2c8dd201 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3257,6 +3257,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 256kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..1aecb19c2f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -323,6 +323,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2738,6 +2745,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4668,8 +4747,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         argument.  The argument can be <literal>bgwriter</literal> to reset
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
-        view, or <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view.
+        view, <literal>archiver</literal> to reset all the counters shown in
+        the <structname>pg_stat_archiver</structname> view, and
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 29d21ac438..5f929de671 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -109,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 0x80000;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -3684,7 +3686,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
+			fprintf(stderr, "XXX will try to restore [%s]\n", xlogfname);
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6533,6 +6535,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7210,6 +7218,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7220,6 +7229,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7249,6 +7261,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7420,6 +7435,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7436,6 +7454,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12209,6 +12228,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12473,6 +12493,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..a8149b946c
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,895 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 22e5d5ff64..fb0d80e7c7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -866,6 +866,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ed4f3f142d..7bde0639b4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -829,6 +829,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 885530c4c0..007d298877 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -282,6 +283,7 @@ static int	localNumBackends = 0;
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -354,6 +356,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1370,11 +1373,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_ARCHIVER;
 	else if (strcmp(target, "bgwriter") == 0)
 		msg.m_resettarget = RESET_BGWRITER;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2692,6 +2704,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4461,6 +4489,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4665,6 +4710,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -4936,6 +4985,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5195,6 +5251,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&globalStats, 0, sizeof(globalStats));
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5282,6 +5339,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5582,6 +5651,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_GlobalStats myGlobalStats;
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5647,6 +5717,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -5813,6 +5895,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6448,6 +6537,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..a2a54e9bc6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2636,6 +2664,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		0x80000, 0x10000, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2956,7 +2995,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11608,6 +11648,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..e6412ad517 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..4f58fa029a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..8c04ff8bce
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f48f5fb4d9..f808c175bd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6154,6 +6154,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2349b7b78b..4f16f1973b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -62,6 +62,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_ARCHIVER,
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -182,6 +183,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -453,6 +467,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -597,6 +621,7 @@ typedef union PgStat_Msg
 	PgStat_MsgArchiver msg_archiver;
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1465,6 +1490,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1480,6 +1506,7 @@ extern int	pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..03177a7949 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2a18dc423e..1d3fb52d1e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v11-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchtext/x-patch; charset=US-ASCII; name=v11-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchDownload

From 08f6818351edce949b6cf37add8f59410d0d4a01 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v11 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record.  Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

XXX Needs review to figure out if it's safe or steamrolling over subtleties
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 ++--
 src/backend/access/transam/xlogreader.c   | 13 ++++++++
 src/backend/access/transam/xlogutils.c    | 23 ++++++++++---
 src/backend/storage/buffer/bufmgr.c       | 40 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c |  3 +-
 src/include/access/xlogreader.h           |  7 ++++
 src/include/access/xlogutils.h            |  3 +-
 src/include/storage/bufmgr.h              |  2 ++
 9 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f929de671..475abe9e10 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1452,7 +1452,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index a8149b946c..948a63f25d 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -624,10 +624,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fb0d80e7c7..9640899ea7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1651,6 +1651,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1860,6 +1862,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1874,6 +1885,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index db0c801456..8a7eac65cf 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -336,11 +336,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -362,7 +364,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -391,7 +394,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -439,7 +443,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -447,6 +452,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -505,6 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..c8a755fb09 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -598,6 +598,46 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+			return false;
+		}
+	}
+
+	/* Does the tag match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Nope -- this isn't the block we seek. */
+	UnpinBuffer(bufHdr, true);
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..c998b52c13 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index ad77c04d0f..84c5fa744b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 374c1b16ce..a0c2b60c57 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..c3280b754e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.20.1

#52

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#51)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Thu, Sep 24, 2020 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

OK, thanks for looking into this. I guess I'll wait for an updated patch
before testing this further. The storage has limited capacity so I'd
have to either reduce the amount of data/WAL or juggle with the WAL
segments somehow. Doesn't seem worth it.

Here's a new WIP version that works for archive-based recovery in my tests.

Rebased over recent merge conflicts in xlog.c. I also removed a stray
debugging message.

One problem the current patch has is that if you use something like
pg_standby, that is, a restore command that waits for more data, then
it'll block waiting for WAL when it's trying to prefetch, which means
that replay is delayed. I'm not sure what to think about that yet.

Attachments:

v12-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 7e1ca6e7c471b038cf145a23c1aa17cd17aa1a6e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v12 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.  On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v12-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Improve-information-about-received-WAL.patchDownload

From 8efda35d207f07f7f4cb742501d0bb369ac372be Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v12 2/6] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently.
Without that, it might be difficult to know the path of the file that
has been written, without data races.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index bb1d44ccb7..7ab56b6dc2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -903,6 +903,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -994,7 +995,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -1020,7 +1024,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1360,7 +1364,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_tli = WalRcv->receiveStartTLI;
 	written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index e675757301..7d7a776ce8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -284,10 +284,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -309,10 +311,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -321,8 +323,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -330,14 +332,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39df4..84f84567cd 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,14 +74,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like as flushedUpto-1, except that it's updated
+	 * without waiting for the flush, after the data has been written to disk
+	 * and available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -142,14 +153,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -457,8 +460,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v12-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v12-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From dc9dc6c670e5aeec5647fb17b6f9fc656b58323d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v12 3/6] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 105 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 699 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c2f3..5f6df896ad 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f11b1b9de..f446210684 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,7 +211,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -911,9 +912,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1417,7 +1420,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4347,6 +4350,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4390,6 +4394,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(&XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10136,7 +10176,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10274,7 +10314,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -11896,7 +11936,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11908,6 +11948,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -11918,6 +11967,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -11940,12 +11992,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -11971,10 +12024,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12094,7 +12147,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12197,6 +12251,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12313,6 +12371,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12390,7 +12452,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12413,15 +12475,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12447,9 +12510,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12463,6 +12526,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12499,6 +12566,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..22e5d5ff64 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -439,7 +723,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -476,7 +761,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -487,7 +772,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -497,7 +783,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -511,15 +797,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -527,9 +814,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -539,25 +826,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -573,7 +890,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -608,7 +926,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -626,7 +945,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -645,7 +965,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -664,7 +985,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -974,7 +1299,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -983,7 +1308,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1147,34 +1472,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1189,17 +1563,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1217,7 +1594,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1228,18 +1605,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1247,7 +1624,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1256,9 +1637,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  (uint32) state->ReadRecPtr);
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1404,17 +1785,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1423,58 +1805,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1500,10 +1861,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1523,10 +1885,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1554,12 +1917,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..db0c801456 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -351,7 +351,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -829,7 +829,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5294c78549..9a23743a96 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3886,6 +3886,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index f21f61d5e1..d86092f47f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7c9d1b67df..2846766312 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -812,7 +812,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 2229c86f9a..38ef72f318 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,7 +240,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -423,7 +425,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..7259559036 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index b976882229..ad77c04d0f 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..374c1b16ce 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 343eef507e..addd85620e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -905,6 +905,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.20.1

v12-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v12-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From 60d3d48b2dce96fe3e9e79d90458596560730a07 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v12 4/6] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  22 +-
 src/backend/access/transam/xlogprefetch.c     | 895 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  27 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1387 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee914740cc..c7dce9b5d8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3257,6 +3257,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 256kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 171ba7049c..217b3cb9a4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -323,6 +323,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2746,6 +2753,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4727,8 +4806,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+		the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f446210684..b2c3315313 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -109,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 0x80000;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -3686,7 +3688,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6535,6 +6536,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7212,6 +7219,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7222,6 +7230,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7251,6 +7262,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7422,6 +7436,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7438,6 +7455,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12232,6 +12250,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12496,6 +12515,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..a8149b946c
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,895 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 22e5d5ff64..fb0d80e7c7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -866,6 +866,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923c2e2be1..e05075f546 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -829,6 +829,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9a23743a96..524d8c395d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -284,6 +285,7 @@ static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
 static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -357,6 +359,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1378,11 +1381,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2715,6 +2727,22 @@ pgstat_fetch_slru(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4516,6 +4544,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4724,6 +4769,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -5001,6 +5050,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5261,6 +5317,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5360,6 +5417,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5661,6 +5730,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_ArchiverStats myArchiverStats;
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5737,6 +5807,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -5903,6 +5985,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6556,6 +6645,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..a2a54e9bc6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2636,6 +2664,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		0x80000, 0x10000, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2956,7 +2995,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11608,6 +11648,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..e6412ad517 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,11 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB	# -1 disables prefetching
+#recovery_prefetch_fpw = off	# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..4f58fa029a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..8c04ff8bce
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d6f3e2d286..a4683052f5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6162,6 +6162,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index addd85620e..2346591264 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -184,6 +185,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -465,6 +479,16 @@ typedef struct PgStat_MsgSLRU
 	PgStat_Counter m_truncate;
 } PgStat_MsgSLRU;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
  * ----------
@@ -610,6 +634,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1493,6 +1518,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_wal(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1509,6 +1535,7 @@ extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 073c8f3e06..6007a81e95 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -441,4 +441,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index af4192f9a8..44531df144 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v12-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchtext/x-patch; charset=US-ASCII; name=v12-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchDownload

From a3bedca00e76a78cfa28f7778aab76972dbc1399 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v12 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record.  Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

XXX Needs review to figure out if it's safe or steamrolling over subtleties
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 ++--
 src/backend/access/transam/xlogreader.c   | 13 ++++++++
 src/backend/access/transam/xlogutils.c    | 23 ++++++++++---
 src/backend/storage/buffer/bufmgr.c       | 40 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c |  3 +-
 src/include/access/xlogreader.h           |  7 ++++
 src/include/access/xlogutils.h            |  3 +-
 src/include/storage/bufmgr.h              |  2 ++
 9 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2c3315313..b552e07c00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1453,7 +1453,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index a8149b946c..948a63f25d 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -624,10 +624,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fb0d80e7c7..9640899ea7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1651,6 +1651,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1860,6 +1862,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1874,6 +1885,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index db0c801456..8a7eac65cf 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -336,11 +336,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -362,7 +364,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -391,7 +394,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -439,7 +443,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -447,6 +452,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -505,6 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..97ccb34f57 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -598,6 +598,46 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+			return false;
+		}
+	}
+
+	/* Does the tag match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Nope -- this isn't the block we seek. */
+	UnpinBuffer(bufHdr, true);
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..c998b52c13 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index ad77c04d0f..84c5fa744b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 374c1b16ce..a0c2b60c57 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..c3280b754e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.20.1

#53

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#51)

Re: WIP: WAL prefetch (another approach)

On Thu, Sep 24, 2020 at 11:38:45AM +1200, Thomas Munro wrote:

On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

OK, thanks for looking into this. I guess I'll wait for an updated patch
before testing this further. The storage has limited capacity so I'd
have to either reduce the amount of data/WAL or juggle with the WAL
segments somehow. Doesn't seem worth it.

Here's a new WIP version that works for archive-based recovery in my tests.

The main change I have been working on is that there is now just a
single XLogReaderState, so no more double-reading and double-decoding
of the WAL. It provides XLogReadRecord(), as before, but now you can
also read further ahead with XLogReadAhead(). The user interface is
much like before, except that the GUCs changed a bit. They are now:

recovery_prefetch=on
recovery_prefetch_fpw=off
wal_decode_buffer_size=256kB
maintenance_io_concurrency=10

I recommend setting maintenance_io_concurrency and
wal_decode_buffer_size much higher than those defaults.

I think you've left the original GUC (replaced by the buffer size) in
the postgresql.conf.sample file. Confused me for a bit ;-)

I've done a bit of testing and so far it seems to work with WAL archive,
so I'll do more testing and benchmarking over the next couple days.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#54

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Thomas Munro (#52)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

I repeated the same testing I did before - I started with a 32GB pgbench
database with archiving, run a pgbench for 1h to generate plenty of WAL,
and then performed recovery from a snapshot + archived WAL on different
storage types. The instance was running on NVMe SSD, allowing it ro
generate ~200GB of WAL in 1h.

The recovery was done on two storage types - SATA RAID0 with 3 x 7.2k
spinning drives and NVMe SSD. On each storage I tested three configs -
disabled prefetching, defaults and increased values:

wal_decode_buffer_size = 4MB (so 8x the default)
maintenance_io_concurrency = 100 (so 10x the default)

FWIW there's a bunch of issues with the GUCs - the .conf.sample file
does not include e.g. recovery_prefetch, and instead includes
#max_recovery_prefetch_distance which was however replaced by
wal_decode_buffer_size. Another thing is that the actual default value
differ from the docs - e.g. the docs say that wal_decode_buffer_size is
256kB by default, when in fact it's 512kB.

Now, some results ...

1) NVMe

Fro the fast storage, there's a modest improvement. The time it took to
recover the ~13k WAL segments are these

no prefetch: 5532s
default: 4613s
increased: 4549s

So the speedup from enabled prefetch is ~20% but increasing the values
to make it more aggressive has little effect. Fair enough, the NVMe
is probably fast enough to not benefig from longer I/O queues here.

This is a bit misleading though, because the effectivity of prfetching
very much depends on the fraction of FPI in the WAL stream - and right
after checkpoint that's most of the WAL, which makes the prefetching
less efficient. We still have to parse the WAL etc. without actually
prefetching anything, so it's pure overhead.

So I've also generated a chart showing time (in milliseconds) needed to
apply individual WAL segments. It clearly shows that there are 3
checkpoints, and that for each checkpoint it's initially very cheap
(thanks to FPI) and as the fraction of FPIs drops the redo gets more
expensive. At which point the prefetch actually helps, by up to 30% in
some cases (so a bit more than the overall speedup). All of this is
expected, of course.

2) 3 x 7.2k SATA RAID0

For the spinning rust, I had to make some compromises. It's not feasible
to apply all the 200GB of WAL - it would take way too long. I only
applied ~2600 segments for each configuration (so not even one whole
checkpoint), and even that took ~20h in each case.

The durations look like this:

no prefetch: 72446s
default: 73653s
increased: 55409s

So in this case the default settings is way too low - it actually makes
the recovery a bit slower, while with increased values there's ~25%
speedup, which is nice. I assume that if larger number of WAL segments
was applied (e.g. the whole checkpoint), the prefetch numbers would be
a bit better - the initial FPI part would play smaller role.

From the attached "average per segment" chart you can see that the basic
behavior is about the same as for NVMe - initially it's slower due to
FPIs in the WAL stream, and then it gets ~30% faster.

Overall I think it looks good. I haven't looked at the code very much,
and I can't comment on the potential optimizations mentioned a couple
days ago yet.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#55

thomas.munro@gmail.com

about 5 years ago

In reply to: Tomas Vondra (#54)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Sun, Oct 11, 2020 at 12:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I repeated the same testing I did before - I started with a 32GB pgbench
database with archiving, run a pgbench for 1h to generate plenty of WAL,
and then performed recovery from a snapshot + archived WAL on different
storage types. The instance was running on NVMe SSD, allowing it ro
generate ~200GB of WAL in 1h.

Thanks for running these tests! And sorry for the delay in replying.

The recovery was done on two storage types - SATA RAID0 with 3 x 7.2k
spinning drives and NVMe SSD. On each storage I tested three configs -
disabled prefetching, defaults and increased values:

wal_decode_buffer_size = 4MB (so 8x the default)
maintenance_io_concurrency = 100 (so 10x the default)

FWIW there's a bunch of issues with the GUCs - the .conf.sample file
does not include e.g. recovery_prefetch, and instead includes
#max_recovery_prefetch_distance which was however replaced by
wal_decode_buffer_size. Another thing is that the actual default value
differ from the docs - e.g. the docs say that wal_decode_buffer_size is
256kB by default, when in fact it's 512kB.

Oops. Fixed, and rebased.

Now, some results ...

1) NVMe

Fro the fast storage, there's a modest improvement. The time it took to
recover the ~13k WAL segments are these

no prefetch: 5532s
default: 4613s
increased: 4549s

So the speedup from enabled prefetch is ~20% but increasing the values
to make it more aggressive has little effect. Fair enough, the NVMe
is probably fast enough to not benefig from longer I/O queues here.

This is a bit misleading though, because the effectivity of prfetching
very much depends on the fraction of FPI in the WAL stream - and right
after checkpoint that's most of the WAL, which makes the prefetching
less efficient. We still have to parse the WAL etc. without actually
prefetching anything, so it's pure overhead.

Yeah. I've tried to reduce that overhead as much as possible,
decoding once and looking up the buffer only once. The extra overhead
caused by making posix_fadvise() calls is unfortunate (especially if
they aren't helping due to small shared buffers but huge page cache),
but should be fixed by switching to proper AIO, independently of this
patch, which will batch those and remove the pread().

So I've also generated a chart showing time (in milliseconds) needed to
apply individual WAL segments. It clearly shows that there are 3
checkpoints, and that for each checkpoint it's initially very cheap
(thanks to FPI) and as the fraction of FPIs drops the redo gets more
expensive. At which point the prefetch actually helps, by up to 30% in
some cases (so a bit more than the overall speedup). All of this is
expected, of course.

That is a nice way to see the effect of FPI on recovery.

2) 3 x 7.2k SATA RAID0

For the spinning rust, I had to make some compromises. It's not feasible
to apply all the 200GB of WAL - it would take way too long. I only
applied ~2600 segments for each configuration (so not even one whole
checkpoint), and even that took ~20h in each case.

The durations look like this:

no prefetch: 72446s
default: 73653s
increased: 55409s

So in this case the default settings is way too low - it actually makes
the recovery a bit slower, while with increased values there's ~25%
speedup, which is nice. I assume that if larger number of WAL segments
was applied (e.g. the whole checkpoint), the prefetch numbers would be
a bit better - the initial FPI part would play smaller role.

Huh. Interesting.

From the attached "average per segment" chart you can see that the basic
behavior is about the same as for NVMe - initially it's slower due to
FPIs in the WAL stream, and then it gets ~30% faster.

Yeah. I expect that one day not too far away we'll figure out how to
get rid of FPIs (through a good enough double-write log or
O_ATOMIC)...

Overall I think it looks good. I haven't looked at the code very much,
and I can't comment on the potential optimizations mentioned a couple
days ago yet.

Thanks!

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

Attachments:

v13-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Improve-information-about-received-WAL.patchDownload

From 8fa43c9c577b19a2d4b7bc0efbe180912bac37b1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v13 2/6] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently.
Without that, it might be difficult to know the path of the file that
has been written, without data races.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index babee386c4..ba42f59d6c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -870,6 +870,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -961,7 +962,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -987,7 +991,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1327,7 +1331,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_tli = WalRcv->receiveStartTLI;
 	written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index c3e317df9f..3bd1fadbd3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -284,10 +284,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -309,10 +311,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -321,8 +323,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -330,14 +332,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39df4..84f84567cd 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,14 +74,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like as flushedUpto-1, except that it's updated
+	 * without waiting for the flush, after the data has been written to disk
+	 * and available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -142,14 +153,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -457,8 +460,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v13-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From 6e60d070002820adf61529dcbc64723d1466699a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v13 3/6] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 105 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 699 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c2f3..5f6df896ad 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa63f37615..2bdbebbb91 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,7 +211,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -911,9 +912,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1417,7 +1420,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4347,6 +4350,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4390,6 +4394,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(&XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10115,7 +10155,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10251,7 +10291,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -11873,7 +11913,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11885,6 +11925,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -11895,6 +11944,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -11917,12 +11969,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -11948,10 +12001,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12071,7 +12124,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12174,6 +12228,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12287,6 +12345,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12364,7 +12426,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12387,15 +12449,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12421,9 +12484,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12437,6 +12500,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12473,6 +12540,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(MyLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..22e5d5ff64 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -439,7 +723,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -476,7 +761,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -487,7 +772,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -497,7 +783,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -511,15 +797,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -527,9 +814,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -539,25 +826,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -573,7 +890,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -608,7 +926,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -626,7 +945,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -645,7 +965,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -664,7 +985,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -974,7 +1299,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -983,7 +1308,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1147,34 +1472,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1189,17 +1563,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1217,7 +1594,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1228,18 +1605,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1247,7 +1624,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1256,9 +1637,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  (uint32) state->ReadRecPtr);
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1404,17 +1785,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1423,58 +1805,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1500,10 +1861,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1523,10 +1885,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1554,12 +1917,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..db0c801456 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -351,7 +351,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -829,7 +829,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..083174f692 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4009,6 +4009,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..4bc22deddb 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5d1b1a16be..86e10c7316 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -812,7 +812,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index eae1797f94..39797488d3 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,7 +240,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -423,7 +425,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..7259559036 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0b6d00dd7d..44f8847030 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..374c1b16ce 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..c5f763dd44 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -954,6 +954,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.20.1

v13-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 3af0008331820f1a44ce8f0d229be845365004b6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v13 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.  On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v13-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v13-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From e94f7a8470d5fd04b0a16456fd8b03524cd04e12 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v13 4/6] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  22 +-
 src/backend/access/transam/xlogprefetch.c     | 895 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  26 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1387 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..ef847b38cf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3341,6 +3341,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..c10e30ec91 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2878,6 +2885,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4859,8 +4938,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2bdbebbb91..5691850a74 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -109,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -3686,7 +3688,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6528,6 +6529,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7205,6 +7212,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7215,6 +7223,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7244,6 +7255,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7415,6 +7429,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7431,6 +7448,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12209,6 +12227,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12470,6 +12489,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..a8149b946c
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,895 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 22e5d5ff64..fb0d80e7c7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -866,6 +866,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b6..fb3199a8ae 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -841,6 +841,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 083174f692..9434ef9ace 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -287,6 +288,7 @@ static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
 static int	nReplSlotStats;
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -364,6 +366,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1386,11 +1389,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2838,6 +2850,22 @@ pgstat_fetch_replslot(int *nslots_p)
 	return replSlotStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4639,6 +4667,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4852,6 +4897,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -5134,6 +5183,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5408,6 +5464,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5513,6 +5570,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5832,6 +5901,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5908,6 +5978,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6090,6 +6172,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6783,6 +6872,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..ffeb7b0704 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2636,6 +2664,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2956,7 +2995,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11636,6 +11676,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..16c5cc4fd7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,12 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..4f58fa029a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..8c04ff8bce
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c01da4bf01..8e028eb35b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6185,6 +6185,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c5f763dd44..01abc4fa2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -64,6 +64,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -186,6 +187,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -497,6 +511,15 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter m_stream_bytes;
 } PgStat_MsgReplSlot;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
 
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
@@ -644,6 +667,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1546,6 +1570,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_wal(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1563,6 +1588,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 073c8f3e06..6007a81e95 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -441,4 +441,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 097ff5d111..804f4e24b5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v13-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchtext/x-patch; charset=US-ASCII; name=v13-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchDownload

From 67c0dfbebb854be803a9859296e5934094109864 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v13 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record.  Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

XXX Needs review to figure out if it's safe or steamrolling over subtleties
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 ++--
 src/backend/access/transam/xlogreader.c   | 13 ++++++++
 src/backend/access/transam/xlogutils.c    | 23 ++++++++++---
 src/backend/storage/buffer/bufmgr.c       | 40 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c |  3 +-
 src/include/access/xlogreader.h           |  7 ++++
 src/include/access/xlogutils.h            |  3 +-
 src/include/storage/bufmgr.h              |  2 ++
 9 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5691850a74..0628ba7621 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1453,7 +1453,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index a8149b946c..948a63f25d 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -624,10 +624,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fb0d80e7c7..9640899ea7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1651,6 +1651,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1860,6 +1862,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1874,6 +1885,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index db0c801456..8a7eac65cf 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -336,11 +336,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -362,7 +364,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -391,7 +394,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -439,7 +443,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -447,6 +452,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -505,6 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..ece9ec35a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -598,6 +598,46 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+			return false;
+		}
+	}
+
+	/* Does the tag match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Nope -- this isn't the block we seek. */
+	UnpinBuffer(bufHdr, true);
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..c998b52c13 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 44f8847030..616e591259 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 374c1b16ce..a0c2b60c57 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..c3280b754e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.20.1

#56

tomas.vondra@enterprisedb.com

about 5 years ago

In reply to: Thomas Munro (#55)

Re: WIP: WAL prefetch (another approach)

On 11/13/20 3:20 AM, Thomas Munro wrote:

...

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

IMHO we can't do much about that, except for documenting it - if the
prefetch can't work because of blocking restore script, someone has to
fix/improve the script. No way around that, I'm afraid.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57

sfrost@snowman.net

about 5 years ago

In reply to: Tomas Vondra (#56)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

On 11/13/20 3:20 AM, Thomas Munro wrote:

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

IMHO we can't do much about that, except for documenting it - if the
prefetch can't work because of blocking restore script, someone has to
fix/improve the script. No way around that, I'm afraid.

I'm a bit confused about what the issue here is- is the concern that a
restore_command is specified that isn't allowed to run concurrently but
this patch is intending to run more than one concurrently..? There's
another patch that I was looking at for doing pre-fetching of WAL
segments, so if this is also doing that we should figure out which
patch we want..

I don't know that it's needed, but it feels likely that we could provide
a better result if we consider making changes to the restore_command API
(eg: have a way to say "please fetch this many segments ahead, and you
can put them in this directory with these filenames" or something). I
would think we'd be able to continue supporting the existing API and
accept that it might not be as performant.

Thanks,

Stephen

#58

thomas.munro@gmail.com

about 5 years ago

In reply to: Stephen Frost (#57)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Sat, Nov 14, 2020 at 4:13 AM Stephen Frost <sfrost@snowman.net> wrote:

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

On 11/13/20 3:20 AM, Thomas Munro wrote:

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

IMHO we can't do much about that, except for documenting it - if the
prefetch can't work because of blocking restore script, someone has to
fix/improve the script. No way around that, I'm afraid.

I'm a bit confused about what the issue here is- is the concern that a
restore_command is specified that isn't allowed to run concurrently but
this patch is intending to run more than one concurrently..? There's
another patch that I was looking at for doing pre-fetching of WAL
segments, so if this is also doing that we should figure out which
patch we want..

The problem is that the recovery loop tries to look further ahead in
between applying individual records, which causes the restore script
to run, and if that blocks, we won't apply records that we already
have, because we're waiting for the next WAL file to appear. This
behaviour is on by default with my patch, so pg_standby will introduce
a weird replay delays. We could think of some ways to fix that, with
meaningful return codes and periodic polling or something, I suppose,
but something feels a bit weird about it.

I don't know that it's needed, but it feels likely that we could provide
a better result if we consider making changes to the restore_command API
(eg: have a way to say "please fetch this many segments ahead, and you
can put them in this directory with these filenames" or something). I
would think we'd be able to continue supporting the existing API and
accept that it might not be as performant.

Hmm. Every time I try to think of a protocol change for the
restore_command API that would be acceptable, I go around the same
circle of thoughts about event flow and realise that what we really
need for this is ... a WAL receiver...

Here's a rebase over the recent commit "Get rid of the dedicated latch
for signaling the startup process." just to fix cfbot; no other
changes.

Attachments:

v14-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v14-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 584f8f09651f554584bb4eeab6a7fe23b7582300 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v14 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.  On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v14-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v14-0002-Improve-information-about-received-WAL.patchDownload

From 8ef4adb6a88055c0f86a6a81f285e92b2ecd2ce3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v14 2/6] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently.
Without that, it might be difficult to know the path of the file that
has been written, without data races.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index babee386c4..ba42f59d6c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -870,6 +870,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -961,7 +962,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -987,7 +991,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1327,7 +1331,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_tli = WalRcv->receiveStartTLI;
 	written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index c3e317df9f..3bd1fadbd3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -284,10 +284,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -309,10 +311,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -321,8 +323,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -330,14 +332,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39df4..84f84567cd 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,14 +74,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like as flushedUpto-1, except that it's updated
+	 * without waiting for the flush, after the data has been written to disk
+	 * and available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -142,14 +153,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -457,8 +460,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v14-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v14-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From c78b9166b7bf8bcb9349c34d7e768fbeefda7c34 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v14 3/6] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 105 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 699 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c2f3..5f6df896ad 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7d97b96e72..49d8172405 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,7 +211,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -911,9 +912,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1417,7 +1420,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4347,6 +4350,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4390,6 +4394,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10115,7 +10155,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10251,7 +10291,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -11873,7 +11913,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11885,6 +11925,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -11895,6 +11944,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -11917,12 +11969,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -11948,10 +12001,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12071,7 +12124,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12174,6 +12228,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12290,6 +12348,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12367,7 +12429,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12390,15 +12452,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12424,9 +12487,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12440,6 +12503,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12476,6 +12543,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(MyLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..22e5d5ff64 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -439,7 +723,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -476,7 +761,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -487,7 +772,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -497,7 +783,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -511,15 +797,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -527,9 +814,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -539,25 +826,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -573,7 +890,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -608,7 +926,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -626,7 +945,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -645,7 +965,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -664,7 +985,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -974,7 +1299,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -983,7 +1308,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1147,34 +1472,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1189,17 +1563,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1217,7 +1594,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1228,18 +1605,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1247,7 +1624,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1256,9 +1637,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  (uint32) state->ReadRecPtr);
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1404,17 +1785,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1423,58 +1805,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1500,10 +1861,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1523,10 +1885,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1554,12 +1917,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..db0c801456 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -351,7 +351,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -829,7 +829,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..083174f692 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4009,6 +4009,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..4bc22deddb 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5d1b1a16be..86e10c7316 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -812,7 +812,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index eae1797f94..39797488d3 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -239,7 +240,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -423,7 +425,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..7259559036 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0b6d00dd7d..44f8847030 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..374c1b16ce 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..c5f763dd44 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -954,6 +954,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.20.1

v14-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v14-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From fd7f3ef96825537cc685072a99ffc0b4c90776e8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v14 4/6] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  22 +-
 src/backend/access/transam/xlogprefetch.c     | 895 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  26 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1387 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a632cf98ba..04ae57770c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3344,6 +3344,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..c10e30ec91 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2878,6 +2885,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4859,8 +4938,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 49d8172405..d0ee4235fb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -109,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -3686,7 +3688,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6528,6 +6529,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7205,6 +7212,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7215,6 +7223,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7244,6 +7255,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7415,6 +7429,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7431,6 +7448,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12209,6 +12227,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12473,6 +12492,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..a8149b946c
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,895 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 22e5d5ff64..fb0d80e7c7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -866,6 +866,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2e4aa1c4b6..fb3199a8ae 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -841,6 +841,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 083174f692..9434ef9ace 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -287,6 +288,7 @@ static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
 static int	nReplSlotStats;
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -364,6 +366,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1386,11 +1389,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2838,6 +2850,22 @@ pgstat_fetch_replslot(int *nslots_p)
 	return replSlotStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4639,6 +4667,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4852,6 +4897,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -5134,6 +5183,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5408,6 +5464,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5513,6 +5570,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5832,6 +5901,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5908,6 +5978,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6090,6 +6172,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6783,6 +6872,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..ffeb7b0704 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2636,6 +2664,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2956,7 +2995,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11636,6 +11676,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..16c5cc4fd7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,12 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..4f58fa029a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..8c04ff8bce
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c01da4bf01..8e028eb35b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6185,6 +6185,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c5f763dd44..01abc4fa2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -64,6 +64,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -186,6 +187,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -497,6 +511,15 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter m_stream_bytes;
 } PgStat_MsgReplSlot;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
 
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
@@ -644,6 +667,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1546,6 +1570,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_wal(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1563,6 +1588,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 073c8f3e06..6007a81e95 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -441,4 +441,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 097ff5d111..804f4e24b5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v14-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchtext/x-patch; charset=US-ASCII; name=v14-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchDownload

From 2be9a6f9b30af79dc930b1b3c4d459aa664ce1d5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v14 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record.  Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

XXX Needs review to figure out if it's safe or steamrolling over subtleties
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 ++--
 src/backend/access/transam/xlogreader.c   | 13 ++++++++
 src/backend/access/transam/xlogutils.c    | 23 ++++++++++---
 src/backend/storage/buffer/bufmgr.c       | 40 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c |  3 +-
 src/include/access/xlogreader.h           |  7 ++++
 src/include/access/xlogutils.h            |  3 +-
 src/include/storage/bufmgr.h              |  2 ++
 9 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d0ee4235fb..d8804499e0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1453,7 +1453,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index a8149b946c..948a63f25d 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -624,10 +624,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fb0d80e7c7..9640899ea7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1651,6 +1651,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1860,6 +1862,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1874,6 +1885,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index db0c801456..8a7eac65cf 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -336,11 +336,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -362,7 +364,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -391,7 +394,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -439,7 +443,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -447,6 +452,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -505,6 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..ece9ec35a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -598,6 +598,46 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+			return false;
+		}
+	}
+
+	/* Does the tag match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Nope -- this isn't the block we seek. */
+	UnpinBuffer(bufHdr, true);
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..c998b52c13 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 44f8847030..616e591259 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 374c1b16ce..a0c2b60c57 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..c3280b754e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.20.1

#59

sfrost@snowman.net

about 5 years ago

In reply to: Thomas Munro (#58)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

On Sat, Nov 14, 2020 at 4:13 AM Stephen Frost <sfrost@snowman.net> wrote:

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

On 11/13/20 3:20 AM, Thomas Munro wrote:

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

IMHO we can't do much about that, except for documenting it - if the
prefetch can't work because of blocking restore script, someone has to
fix/improve the script. No way around that, I'm afraid.

I'm a bit confused about what the issue here is- is the concern that a
restore_command is specified that isn't allowed to run concurrently but
this patch is intending to run more than one concurrently..? There's
another patch that I was looking at for doing pre-fetching of WAL
segments, so if this is also doing that we should figure out which
patch we want..

The problem is that the recovery loop tries to look further ahead in
between applying individual records, which causes the restore script
to run, and if that blocks, we won't apply records that we already
have, because we're waiting for the next WAL file to appear. This
behaviour is on by default with my patch, so pg_standby will introduce
a weird replay delays. We could think of some ways to fix that, with
meaningful return codes and periodic polling or something, I suppose,
but something feels a bit weird about it.

Ah, yeah, that's clearly an issue that should be addressed. There's a
nearby thread which is talking about doing exactly that, so, perhaps
this doesn't need to be worried about here..?

I don't know that it's needed, but it feels likely that we could provide
a better result if we consider making changes to the restore_command API
(eg: have a way to say "please fetch this many segments ahead, and you
can put them in this directory with these filenames" or something). I
would think we'd be able to continue supporting the existing API and
accept that it might not be as performant.

Hmm. Every time I try to think of a protocol change for the
restore_command API that would be acceptable, I go around the same
circle of thoughts about event flow and realise that what we really
need for this is ... a WAL receiver...

A WAL receiver, or an independent process which goes out ahead and
fetches WAL..?

Still, I wonder about having a way to inform the command that's run by
the restore_command of what it is we really want, eg:

restore_command = 'somecommand --async=%a --target=%t --target-name=%n --target-xid=%x --target-lsn=%l --target-timeline=%i --dest-dir=%d'

Such that '%a' is either yes, or no, indicating if the restore command
should operate asyncronously and pre-fetch WAL, %t is either empty (or
mabye 'unset') or 'immediate', %n/%x/%l are similar to %t, %i is either
a specific timeline or 'immediate' (somecommand should be understanding
of timelines and know how to parse history files to figure out the right
timeline to fetch along, based on the destination requested), and %d is
a directory for somecommand to place WAL files into (perhaps with an
alternative naming scheme, if we feel we need one).

The amount pre-fetching which 'somecommand' would do, and how many
processes it would use to do so, could either be configured as part of
the options passed to 'somecommand', which we would just pass through,
or through its own configuration file.

A restore_command which is set but doesn't include a %a or %d or such
would be assumed to work in the same manner as today.

For my part, at least, I don't think this is really that much of a
stretch, to expect a restore_command to be able to pre-populate a
directory with WAL files- certainly there's at least one that already
does this, even though it doesn't have all the information directly
passed to it.. Would be nice if it did. :)

Thanks,

Stephen

#60

/messages/by-id/20201029024412.GP5380@telsasoft.com

thomas.munro@gmail.com

about 5 years ago

In reply to: Stephen Frost (#59)

Re: WIP: WAL prefetch (another approach)

On Thu, Nov 19, 2020 at 10:00 AM Stephen Frost <sfrost@snowman.net> wrote:

* Thomas Munro (thomas.munro@gmail.com) wrote:

Hmm. Every time I try to think of a protocol change for the
restore_command API that would be acceptable, I go around the same
circle of thoughts about event flow and realise that what we really
need for this is ... a WAL receiver...

A WAL receiver, or an independent process which goes out ahead and
fetches WAL..?

What I really meant was: why would you want this over streaming rep?
I just noticed this thread proposing to retire pg_standby on that
basis:

I'd be happy to see that land, to fix this problem with my plan. But
are there other people writing restore scripts that block that would
expect them to work on PG14?

#61

sfrost@snowman.net

about 5 years ago

In reply to: Thomas Munro (#60)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

On Thu, Nov 19, 2020 at 10:00 AM Stephen Frost <sfrost@snowman.net> wrote:

* Thomas Munro (thomas.munro@gmail.com) wrote:

Hmm. Every time I try to think of a protocol change for the
restore_command API that would be acceptable, I go around the same
circle of thoughts about event flow and realise that what we really
need for this is ... a WAL receiver...

A WAL receiver, or an independent process which goes out ahead and
fetches WAL..?

What I really meant was: why would you want this over streaming rep?

I have to admit to being pretty confused as to this question and maybe
I'm just not understanding. Why wouldn't change patch be helpful for
streaming replication too..?

If I follow correctly, this patch will scan ahead in the WAL and let
the kernel know that certain blocks will be needed soon. Ideally,
though I don't think it does yet, we'd only do that for blocks that
aren't already in shared buffers, and only for non-FPIs (even better if
we can skip past pages for which we already, recently, passed an FPI).

The biggest caveat here, it seems to me anyway, is that for this to
actually help you need to be running with checkpoints that are larger
than shared buffers, as otherwise all the pages we need will be in
shared buffers already, thanks to FPIs bringing them in, except when
running with hot standby, right?

In the hot standby case, other random pages could be getting pulled in
to answer user queries and therefore this would be quite helpful to
minimize the amount of time required to replay WAL, I would think.
Naturally, this isn't very interesting if we're just always able to
keep up with the primary, but that's certainly not always the case.

I just noticed this thread proposing to retire pg_standby on that
basis:

/messages/by-id/20201029024412.GP5380@telsasoft.com

I'd be happy to see that land, to fix this problem with my plan. But
are there other people writing restore scripts that block that would
expect them to work on PG14?

Ok, I think I finally get the concern that you're raising here-
basically that if a restore command was written to sit around and wait
for WAL segments to arrive, instead of just returning to PG and saying
"WAL segment not found", that this would be a problem if we are running
out ahead of the applying process and asking for WAL.

The thing is- that's an outright broken restore command script in the
first place. If PG is in standby mode, we'll ask again if we get an
error result indicating that the WAL file wasn't found. The restore
command documentation is quite clear on this point:

The command will be asked for file names that are not present in the
archive; it must return nonzero when so asked.

There's no "it can wait around for the next file to show up if it wants
to" in there- it *must* return nonzero when asked for files that don't
exist.

So, I don't think that we really need to stress over this. The fact
that pg_standby offers options to have it wait instead of just returning
a non-zero error-code and letting the loop that we already do in the
core code seems like it's really just a legacy thing from before we were
doing that and probably should have been ripped out long ago... Even
more reason to get rid of pg_standby tho, imv, we haven't been properly
adjusting it when we've been making changes to the core code, it seems.

Thanks,

Stephen

#62

andres@anarazel.de

about 5 years ago

In reply to: Stephen Frost (#61)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2020-12-04 13:27:38 -0500, Stephen Frost wrote:

If I follow correctly, this patch will scan ahead in the WAL and let
the kernel know that certain blocks will be needed soon. Ideally,
though I don't think it does yet, we'd only do that for blocks that
aren't already in shared buffers, and only for non-FPIs (even better if
we can skip past pages for which we already, recently, passed an FPI).

The patch uses PrefetchSharedBuffer(), which only initiates a prefetch
if the page isn't already in s_b.

And once we have AIO, it can actually initiate IO into s_b at that
point, rather than fetching it just into the kernel page cache.

Greetings,

Andres Freund

#63

[1]: /messages/by-id/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com

sfrost@snowman.net

about 5 years ago

In reply to: Andres Freund (#62)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2020-12-04 13:27:38 -0500, Stephen Frost wrote:

If I follow correctly, this patch will scan ahead in the WAL and let
the kernel know that certain blocks will be needed soon. Ideally,
though I don't think it does yet, we'd only do that for blocks that
aren't already in shared buffers, and only for non-FPIs (even better if
we can skip past pages for which we already, recently, passed an FPI).

The patch uses PrefetchSharedBuffer(), which only initiates a prefetch
if the page isn't already in s_b.

Great, glad that's already been addressed in this, that's certainly
good. I think I knew that and forgot it while composing that response
over the past rather busy week. :)

And once we have AIO, it can actually initiate IO into s_b at that
point, rather than fetching it just into the kernel page cache.

Sure.

Thanks,

Stephen

#64

Jakub Wartak

Jakub.Wartak@tomtom.com

about 5 years ago

In reply to: Thomas Munro (#58)

9 attachment(s)

RE: WIP: WAL prefetch (another approach)

Thomas wrote:

Here's a rebase over the recent commit "Get rid of the dedicated latch for
signaling the startup process." just to fix cfbot; no other changes.

I wanted to contribute my findings - after dozens of various lengthy runs here - so far with WAL (asynchronous) recovery performance in the hot-standby case. TL;DR; this patch is awesome even on NVMe 😉

This email is a little bit larger topic than prefetching patch itself, but I did not want to loose context. Maybe it'll help somebody in operations or just to add to the general pool of knowledge amongst hackers here, maybe all of this stuff was already known to you. My plan is to leave it here like that as I'm probably lacking understanding, time, energy and ideas how to tweak it more.

SETUP AND TEST:
---------------
There might be many different workloads, however I've only concentrated on single one namely - INSERT .. SELECT 100 rows - one that was predictible enough for me, quite generic and allows to uncover some deterministic hotspots. The result is that in such workload it is possible to replicate ~750Mbit/s of small rows traffic in stable conditions (catching-up is a different matter).

- two i3.4xlarge AWS VMs with 14devel, see [0]specs: 2x AWS i3.4xlarge (1s8c16t, 128GB RAM, Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GH), 2xNVMe in lvm striped VG, ext4. Tuned parameters: bgwriter_*, s_b=24GB with huge pages, checkpoint_completion_target=0.9, commit_delay=100000, commit_siblings=20, synchronous_commit=off, fsync=off, max_wal_size=40GB, recovery_prefetch=on, track_io_timing=on , wal_block_size=8192 (default), wal_decode_buffer_size=512kB (default WIP WAL prefetching), wal_buffers=256MB. Schema was always 16-way hash-parititoned to avoid LWLock/BufferContent waits. for specs. 14devel already contains major optimizations of reducing lseeks() and SLRU CLOG flushing[1]/messages/by-id/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
- WIP WAL prefetching [2]https://commitfest.postgresql.org/31/2410/ by Thomas Munro applied, v14_000[12345] patches, especially v14_0005 is important here as it reduces dynahash calls.
- FPWs were disabled to avoid hitting >2.5Gbps traffic spikes
- hash_search_with_hash_value_memcmpopt() is my very poor man's copycat optimization of dynahash.c's hash_search_with_hash_value() to avoid indirect function calls of calling match() [3]hash_search_with_hash_value() spends a lot of time near "callq *%r14" in tight loop assembly in my case (indirect call to hash comparision function). This hash_search_with_hash_value_memcmpopt() is just copycat function and instead directly calls memcmp() where it matters (smgr.c, buf_table.c). Blind shot at gcc's -flto also didn't help to gain a lot there (I was thinking it could optimize it by building many instances of hash_search_with_hash_value of per-match() use, but no). I did not quantify the benefit, I think it just failed optimization experiment, as it is still top#1 in my profiles, it could be even noise.
- VDSO clock_gettime() just in case fix on AWS, tsc for clocksource0 instead of "xen" OR one could use track_io_timing=off to reduce syscalls

Primary tuning:
in order to reliably measure standby WAL recovery performance, one needs to setup *STABLE* generator over time/size, on primary. In my case it was 2 indexes and 1 table: pgbench -n -f inserts.pgb -P 1 -T 86400 -c 16 -j 2 -R 4000 --latency-limit=50 db.

VFS-CACHE-FITTING WORKLOAD @ 4k TPS:
------------------------------------

create sequence s1;
create table tid (id bigint primary key, j bigint not null, blah text not null) partition by hash (id);
create index j_tid on tid (j); -- to put some more realistic stress
create table tid_h1 partition of tid FOR VALUES WITH (MODULUS 16, REMAINDER 0);
[..]
create table tid_h16 partition of tid FOR VALUES WITH (MODULUS 16, REMAINDER 15);

The clients (-c 16) needs to aligned with hash-partitioning to avoid LWLock/BufferContent. inserts.pgb was looking like:
insert into tid select nextval('s1'), g, 'some garbage text' from generate_series(1,100) g.
The sequence is of the key importance here. "g" is more or less randomly hitting here (the j_tid btree might quite grow on standby too).

Additionally due to drops on primary, I've disabled fsync as a stopgap measure because at least what to my understanding I was affected by global freezes of my insertion workload due to Lock/extends as one of the sessions was always in: mdextend() -> register_dirty_segment() -> RegisterSyncRequest() (fsync pg_usleep 0.01s), which caused frequent dips of performance even at the begginig (visible thanks to pgbench -P 1) and I wanted something completely linear. The fsync=off was simply a shortcut just in order to measure stuff properly on the standby (I needed this deterministic "producer").

The WAL recovery is not really single threaded thanks to prefetches with posix_fadvises() - performed by other (?) CPUs/kernel threads I suppose, CLOG flushing by checkpointer and the bgwriter itself. The walsender/walreciever were not the bottlenecks, but bgwriter and checkpointer needs to be really tuned on *standby* side too.

So, the above workload is CPU bound on the standby side for long time. I would classify it as "standby-recovery-friendly" as the IO-working-set of the main redo loop does NOT degrade over time/dbsize that much, so there is no lag till certain point. In order to classify the startup/recovery process one could use recent pidstat(1) -d "iodelay" metric. If one gets stable >= 10 centiseconds over more than few seconds, then one has probably I/O driven bottleneck. If iodelay==0 then it is completely VFS-cached I/O workload.

In such setup, primary can generate - without hiccups - 6000-6500 TPS (insert 100 rows) @ ~25% CPU util using 16 DB sessions. Of course it could push more, but we are using pgbench throttling. Standby can follow up to @ ~4000 TPS on the primary, without lag (@ 4500 TPS was having some lag even at start). The startup/recovering gets into CPU 95% utilization territory with ~300k (?) hash_search_with_hash_value_memcmpopt() executions per second (measured using perf-probe). The shorter the WAL record the more CPU-bound the WAL recovery performance is going to be. In my case ~220k WAL records @ WAL segment 16MB and I was running at stable 750Mbit/s. What is important - at least on my HW - due to dynahashs there's hard limit of this ~300..400 k WAL records/s (perf probe/stat reports that i'm having 300k of hash_search_with_hash_value_memcmpopt() / s, while my workload is 4k [rate] * 100 [rows] * 3 [table + 2 indexes] = 400k/s and no lag, discrepancy that I admit do not understand, maybe it's the Thomas's recent_buffer_fastpath from v14_0005 prefetcher). On some other OLTP production systems I've seen that there's 10k..120k WAL records/16MB segment. The perf picture looks like one in [4]10s perf image of CPU-bound 14devel with all the mentioned patches:. The "tidseq-*" graphs are about this scenario.

One could say that with lesser amount of bigger rows one could push more on the network and that's true however unrealistic in real-world systems (again with FPW=off, I was able to push up to @ 2.5Gbit/s stable without lag, but twice less rate and much bigger rows - ~270 WAL records/16MB segment and primary being the bottleneck). The top#1 CPU function was quite unexpectedly again the BufTableLookup() -> hash_search_with_hash_value_memcmpopt() even at such relatively low-records rate, which illustrates that even with a lot of bigger memcpy()s being done by recovery, those are not the problem as one would typically expect.

VFS-CACHE-MISSES WORKLOAD @ 1.5k TPS:
-------------------------------------

Interesting behavior is that for the very similar data-loading scheme as described above, but for uuid PK and uuid_generate_v4() *random* UUIDs (pretty common pattern amongst developers), instead of bigint sequence, so something very similar to above like:
create table trandomuuid (id uuid primary key , j bigint not null, t text not null) partition by hash (id);
... picture radically changes if the active-working-I/O-set doesn't fit VFS cache and it's I/O bound on recovery side (again this is with prefetching already). This can checked via iodelay: if it goes let's say >= 10-20 centiseconds or BCC's cachetop(1) shows "relatively low" READ_HIT% for recovering (poking at it was ~40-45% in my case when recovery started to be really I/O heavy):

DBsize@112GB , 1s sample:
13:00:16 Buffers MB: 200 / Cached MB: 88678 / Sort: HITS / Order: descending
PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
1849 postgres postgres 160697 67405 65794 41.6% 1.2% -- recovering
1853 postgres postgres 37011 36864 24576 16.8% 16.6% -- walreciever
1851 postgres postgres 15961 13968 14734 4.1% 0.0% -- bgwriter

On 128GB RAM, when DB size gets near the ~80-90GB boundary (128-32 for huge pages - $binaries - $kernel - $etc =~ 90GB free page cache) SOMETIMES in my experiments it started getting lag, but also at the same time even the primary cannot keep at rate of 1500TPS (IO/DataFileRead|Write may happen or still Lock/extend) and struggles of course this is well known behavior [5]https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/ https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ https://www.2ndquadrant.com/en/blog/sequential-uuid-generators-ssd/. Also in this almost-pathological-INSERT-rate had pgstat_bgwriter.buffers_backend like 90% of buffers_alloc and I couldn't do much anything about it (small s_b on primary, tuning bgwriter settings to the max, even with bgwriter_delay=0 hack, BM_MAX_USAGE_COUNT=1). Any suggestion on how to make such $workload deterministic after certain DBsize under pgbench -P1 is welcome :)

So in order to deterministically - in multiple runs - demonstrate the impact of WAL prefetching by Thomas in such scenario (where primary was bottleneck itself), see "trandomuuid-*" graphs, one of the graphs has same commentary as here:
- the system is running with WAL prefetching disabled (maitenance_io_concurrency=0)
- once the DBsize >85-90GB primary cannot keep up, so there's drop of data produced - rxNET KB/s. At this stage I've did echo 3> drop_caches, to shock the system (there's very little jump of Lag, buit it goes to 0 again -- good, standby can still manage)
- once the DBsize got near ~275GB standby couldn't follow even-the-chocked-primary (lag starts rising to >3000s, IOdelay indicates that startup/recovering is wasting like 70% of it's time on synchronous preads())
- at DBsize ~315GB I've did set maitenance_io_concurrency=10 (enable the WAL prefetching/posix_fadvise()), lags starts dropping, and IOdelay is reduced to ~53, %CPU (not %sys) of the process jumps from 28% -> 48% (efficiency grows)
- at DBsize ~325GB I've did set maitenance_io_concurrency=128 (give kernel more time to pre-read for us), lags starts dropping even faster, and IOdelay is reduced to ~30, %CPU part (not %sys) of the process jumps from 48% -> 70% (it's efficiency grows again, 2.5x more from baseline)

Another interesting observation is that standby's bgwriter is much more stressed and important than the recovery itself and several times more active than the one on primary. I've rechecked using Tomas Vondra's sequuid extension [6]https://github.com/tvondra/sequential-uuids and of course problem doesn't exist if the UUIDs are not that random (much more localized, so this small workload adjustment makes it behave like in "VFS-CACHE-fitting" scenario).

Also just in case for the patch review process: also I can confirm that data inserted in primary and standby did match on multiple occasions (sums of columns) after those tests (some of those were run up to 3TB mark).

Random thoughts:
----------------
1) Even with all those optimizations, I/O prefetching (posix_fadvise()) or even IO_URING in future there's going be the BufTableLookup()->dynahash single-threaded CPU limitation bottleneck. It may be that with IO_URING in future and proper HW, all workloads will start to be CPU-bound on standby ;) I do not see a simple way to optimize such fundamental pillar - other than parallelizing it ? I hope I'm wrong.

1b) With the above patches I need to disappoint Alvaro Herrera, I was unable to reproduce the top#1 smgropen() -> hash_search_with_hash_value() in any way as I think right now v14_0005 simply kind of solves the problem.

2) I'm kind of thinking that flushing dirty pages on standby should be much more aggressive than on primary, in order to unlock the startup/recovering potential. What I'm trying to say it might be even beneficial to spot if FlushBuffer() is happening too fast from inside the main redo recovery loop, and if it is then issue LOG/HINT from time to time (similar to famous "checkpoints are occurring too frequently") to tune the background writer on slave or investigate workload itself on primary. Generally speaking those "bgwriter/checkpointer" GUCs might be kind of artificial during the standby-processing scenario.

3) The WAL recovery could (?) have some protection from noisy neighboring backends. As the hot standby is often used in read offload configurations it could be important to protect it's VFS cache (active, freshly replicated data needed for WAL recovery) from being polluted by some other backends issuing random SQL SELECTs.

4) Even for scenarios with COPY/heap_multi_insert()-based-statements it emits a lot of interleaved Btree/INSERT_LEAF records that are CPU heavy if the table is indexed.

6) I don't think walsender/walreciever are in any danger right now, as they at least in my case they had plenty of headroom (even @ 2.5Gbps walreciever was ~30-40% CPU) while issuing I/O writes of 8kB (but this was with fsync=off and on NVMe). Walsender was even in better shape mainly due to sendto(128kB). YMMV.

7) As uuid-osp extension is present in the contrib and T.V.'s sequential-uuids is unfortunately NOT, developers more often than not, might run into those pathological scenarios. Same applies to any cloud-hosted database where one cannot deploy his extensions.

What was not tested and what are further research questions:
-----------------------------------------------------------
a) Impact of vacuum WAL records: I suspect that it might be that additional vacuum-generated workload that was added to the mix, during the VFS-cache-fitting workload that overwhelmed the recovering loop and it started catching lag.

b) Impact of the noisy-neighboring-SQL queries on hot-standby:
b1) research the impact of contention LWLock buffer_mappings between readers and recovery itself.
b2) research/experiments maybe with cgroups2 VFS-cache memory isolation for processes.

c) Impact of WAL prefetching's "maintenance_io_concurrency" VS iodelay for startup/recovering preads() is also unknown. They key question there is how far ahead to issue those posix_fadvise() so that pread() is nearly free. Some I/O calibration tool to set maitenance_io_concurrency would be nice.

-J.

[0]: specs: 2x AWS i3.4xlarge (1s8c16t, 128GB RAM, Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GH), 2xNVMe in lvm striped VG, ext4. Tuned parameters: bgwriter_*, s_b=24GB with huge pages, checkpoint_completion_target=0.9, commit_delay=100000, commit_siblings=20, synchronous_commit=off, fsync=off, max_wal_size=40GB, recovery_prefetch=on, track_io_timing=on , wal_block_size=8192 (default), wal_decode_buffer_size=512kB (default WIP WAL prefetching), wal_buffers=256MB. Schema was always 16-way hash-parititoned to avoid LWLock/BufferContent waits.

[2]: https://commitfest.postgresql.org/31/2410/

[3]: hash_search_with_hash_value() spends a lot of time near "callq *%r14" in tight loop assembly in my case (indirect call to hash comparision function). This hash_search_with_hash_value_memcmpopt() is just copycat function and instead directly calls memcmp() where it matters (smgr.c, buf_table.c). Blind shot at gcc's -flto also didn't help to gain a lot there (I was thinking it could optimize it by building many instances of hash_search_with_hash_value of per-match() use, but no). I did not quantify the benefit, I think it just failed optimization experiment, as it is still top#1 in my profiles, it could be even noise.

[4]: 10s perf image of CPU-bound 14devel with all the mentioned patches:

10.30% postgres postgres [.] MarkBufferDirty
---MarkBufferDirty
|--5.58%--btree_xlog_insert
| btree_redo
| StartupXLOG
--4.72%--heap_xlog_insert

6.22% postgres postgres [.] ReadPageInternal
---ReadPageInternal
XLogReadRecordInternal
XLogReadAhead
XLogPrefetcherReadAhead
StartupXLOG

5.36% postgres postgres [.] hash_bytes
---hash_bytes
|--3.86%--hash_search_memcmpopt

[5]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/ https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ https://www.2ndquadrant.com/en/blog/sequential-uuid-generators-ssd/
https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
https://www.2ndquadrant.com/en/blog/sequential-uuid-generators-ssd/

[6]: https://github.com/tvondra/sequential-uuids

Attachments:

tidseq-without-FPW-4kTPS_cpuOverSize.csv.pngimage/png; name=tidseq-without-FPW-4kTPS_cpuOverSize.csv.pngDownload

tidseq-without-FPW-4kTPS_iodelayOverSize.csv.pngimage/png; name=tidseq-without-FPW-4kTPS_iodelayOverSize.csv.pngDownload

�PNG


IHDR�d�?8PLTE����s???���___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@��������������D=		tRNS@��f	pHYs���+ IDATx���	��(������^��a��o0&P+�������e!@���l���tX���
�L���`2X���2�L��d:�&��e0�,��t`L��`2X���2�L��d:�&��e0�,��t`L��`2X���2�L�`=
Ss6�����M�`_�s�m���V��#	����p9ukN����mv}P����!���<'�s+lj\�}us�p�
��^���7������7����:]bdn��~�S��un�Mm�����9�������s�����v}��������t�����~����yx�<���8l����!���G���oO	����}y�l�4y��x���:�>:��0}��
,��iQOK��x��-��������z&�>�=���L��s����n����S���_/�g�����
��R�������g�Z���Q��oL!��+�<��u�`d}���������4��~���)�S���]���;��[%A��5���r����L���
|��
��w�;+>'��^t���}wzZ���y����)%-�`$��A�����4��_Hy�|��v�b<0��z~z����T��C0}�����v��1���wX��=M���C#
!3w
����9����*?����]��4�kR���������`_	�~��u�4~
�x�������F������[��?��W@�B_�����������
��w-����s�W���������u��9R��~��O�\��e�W0f}:�?u��
�]/iMTw
��K}�O`�/t^�����\`��e��u��!��5 ���?=;���niG
v�Y�?�>R�}�s���~�}���������+���9s�>���_�.�_���]\�.���p���M�N@(��5���!��>D�6eC��]dhS�.2�)�iL��`2X���2�L��d:�&��e0�,��t`L��`2X���2�L��d:�&��e0��y��<�������k M�
t:����]oC�h���ge��3&�i]/����V�N���F��k���d�N��*��|
�;qi�@���d�N���V�����F����r���"&���@�iM&6�l�/���{�_�0�L���'>�&~)���z�L�)��*.������p:�OJ�Z��&k��zY
������v������O��m�j�^V���'�������4�>�k�(�|��oMV����*��������4�E9=C���H�x�[��p�>��r`���'�r;4��6|��|�����p�M�j�^V�924%��zY
��P0�ZPK�bhJ����������4
�Ge���~���WK-���[]�=��Jy�t�L��j�E��sw�����\�*_��U�����5�T��Og ��}����yt�G'��k_G���j5\/��
���m�?���G1Ew��9��F��>��:oV���j�^�����m������;pf�>W���y���	�dP[=�L����{��{|��+�=i05*���{��xD�_�{��3�k���9��Z�{�}��j�^��s9=X?�? �{�oK��L}%q��^S���j�^��5{��c��|��������_!��>�#���b�\�u�IWN��'�\����~�iL���m����<��5��k������{@���j�^��|��o�V��2,���`#�}����2,/c���L���9����V��j�R�������p�Z���,0m��,�0,�is�d)��>�7\�1%��z1���5>�s��;�n����&��d5\/�����t����.��Y`�\�S�D��-y��?�oKoc�w��{�w��`�\�@z�?m;�o��7��[��a�������~W�B%��z�u����ox������z=������;6����p����5:���?���w���`!�#�8������[`�^G8�q8��|������[��a��=�fj�NL����~z���o�}?���^��g�������Z
���^�� �������k��A���d5\/@�	����&0x�#�|�V�U�d������*W����-`�>�x����0}��
����JV��j�R�2(Y
��%K1�B��j�R�2�6WK�b��JV��j�R�2(Y
��%K1�B��j�R�@!��%��:2��LZ����R�U��cxxL~!�E���`�7^�z��u����w\N��`����3�#�^	v}�A?8�/{��q�g�����6��`w���k��1���'��
����������7�����J�{��zY�d!@������=M������_|P��w��{%�������|���Py!���4�����#��J��L���OrE�Gw���y���~J/������7��{%�gz�^�v�g�i��(�,����?
xG��W����d!@.730^�_��8��xG��[��W�>��*@��`�]��?��#�^	v}�A�1���nm&�;�q�����d!���%B[���J������L�Cw�
����L�w��� ��[��B�}B�5G�B�92��d(���-{%�A�G�e�Sul<~0���XV��ax%���gu�� L�?|����&���+���$z��v�{�W��������|c�_�\���Z&��"������������O����������g��-jN���g�����>#����+O��t<ta����}�'|�PxqX������~W�����O~�=\{���x���>����g������uC��o�TX���!�����S��-��
��������p-�8�'B�F0p�'k�N���Cz4X����1p�D���y���W�����`������w�=[��Y������tL�q t��P/�W�M8~�U����aT�tH�t����_� _v����
p�`����|��6@k�o�����N@�}��a}����Z��!~��|r��XZC���x �����`_�3@����
���Z���d)�a��[���|���j���%K1�B�������ri�R�2(}3��3�����.fD�����e!��z��K�:���0�cW��%K1����G�����c�(�si�R�����&4��[����o	�4������eP:F���1����8��zPz=�V3�����^����s�:2���2���P7kJ�\�q�HY��r�K����E���������t(X�^���]�s��������	�~�~��^��T�����F2��t ���d�L�\���edb#
d!��JF�c�P��G����Ato����&6�`X�}5S�g�����*�n�����`��	&xt�Ni�1�p�K0�pI���dX�fP�Io��a*����~�p:����@3�`_������I��A�<�K���������d�`_��&��s��P���/�m#�t��i��_�?
Nl�C�B��zg ���l ���hq"p` M���}�-���0,���@�O'�
��j��-j����9X6S=���PX ���z#b������i*�����	:>\�}�C���x8����W��E���@�9�t�`�B`_'��J]F��,�P�`��5�� ecN�����=�f�=� ���SX=�H����/�%K1���3:[�l�,"����~�mk8��e"LZ��!���_���
d!�$�
�v�`bW+|����������d���]$�g�,<%����e��/���n2F������������7�8�����z�#�4L���
���C��j�����������c�6
�92s��
��A`��XpQZ���J��a�w�*8���^��c�A�^B�$/���������T�-Y���@L��c.h�a�(��1X/�d�1@4w�!������		 G!Hi�R�2L���@gH�����3��>WN�@�����W�6;�V�*�%K9,X���81��������K(�q�?^X�~�8�h�?0J��I-Y�`X����d[��cNxe�?�Pz����,=���Kp�T9�A��Kd!�$�`��@��yV��9h$���|`�@��<��P�/�$��T��������p�O�1��O�?K<�G2����
���o@��Q������:g������A��`(����dX�7��o`��q@�c
��;����6����wp�d)��Y0I;����jd��c.��n==i��@��B-Y�`X�I��@[��#�B����q�<m��A�p�PK�r(�w���m{�, ���P����2G��@�����,\`)a8����e)�A�����)0���{0�'D�"_p����J~HK�r(����T����db�k�YT��:|�6�Z���������ude�O�p�e)A���y��g���RK�r(X�^{����������tl  ��+pc��
x6�8�%K1���/:��
C������&P��q�c==�����@����JL�� =4a�p�,_�PF����Y��s��V��[[�0M/��@�`��@"������D����{
����d?��PK�U�������
f����0T/��r~2g1& ��������6�����o��[e�kG�)@@�����c�*%���`R����b�������ji����U�l����re@�w�	��C7-�B�^�����f�`~=!,�
�������/������iL��v@��~�������)E ��������B�-Y�l\�������5|-\>P��4:���9(G���\)}�H|#~������gi��o�6��������_<��1����y�5�w�~���<����w�uO�t�����}7�����n�
��o�����B���.�����t��	T�s����\�����6`�bz�-<��P��"`/�Hz��e9�����R@t����dv<�������������J��^����
 �$��stM�d���]��F�G!���1��5M�.�}'����*�o�6��5|�����w�����u����]����	Hd��=:����r��tY�-Y�,� ���	�n	�k7��kdX�� �t�\��w�9{��s�a;���	��5D$���?Q���+����������W��;s���U��1@��w\�]��[P�$(;�dRW��%K1�P���
��#�����$�Q������V�q�%*E��J�@Rj�R���!��6��O�	����{����0�]�\tu��;���@����,�0(��.pi0������-�(��HB}��!�p���d)��|YP�K=��t#
�8;�$4utR�� �b�2�����S*�"�d)�A�z9��v,=�<�p�����$F�����N�T�m�R�k�1��H�
��\	��L�!x2�����!��1^�TK�b�@��p��X�B�/�}~U�o6���%K9��B����[�	7��qw|aP�3�����z���(��u2T�&�����s�|���0�a����"��mY�^�!�+����&���Z�6o�~���`����^;�9�'�<���
=����
	��O�c�:�OT�b��O!����Y->!�����@GA=�cX��vk+�z9����d~{�����J0l��v�,�k���)�`y6�7���
�B ��n���]V/�`�F��p�=�c��k�>����0Xw
�!�ji����U�z��w��k���A@/�����������\z\|�N���<C����%2�.�(��:HV/0��	�� .� �����GC	���:�dh]�5��O�c{/
��M�
��2�ZS)���
�e!I������<>�>y�����}%�Xb�_'@|��-aK��1�9����-��)|��M��
�k���&>��qY�F{�>D0���J)��(`5�dE� C��T��eX�=���D^���gS�`����(���B�c��7�E���/��Y%��.2`c�N�~�HTH��=��2�����e!��p8��)+C�.��C�����?U��d)�AH{�O�
�� `-v�lJ����*@�������=C��~\���-�E��,dX-Y�`P���B�d����`P���k�[��s�*�61H����
���F{B�x'@o�~���)���u(�����B��B�p<.��cF�(���� ���Z������ �P�`��q��I���pm�����$B~���e �x#�h6����yR��|Y@R��z5�@|x�10@�`P��� pK���i�w���qY�F�����;�2�������@�AH���e�,@r�Z��^������B��:�����C!x��o�>����e i���������������{�	h%���������
��^�����a"�1D�D��X�?���%K1�@��m���_��Kpa	�_N�
gWnQ��:(�:Y@Z�IG���i���}�����)?(�`0y��7�}��d ��/���a=�������~v;�0���r:�%���r�:���hA������Pz-8^�K��[o����5�IY��r�K~.?�����d��T@��&(��IWB4�y�������3�,�|JK�����<mb#
�e!���qj�1���qE3�%����t�6���2�v���n���w�<s��*�T���S&6��`!��
�*�n�c!e���G��ox�k�+�o�.��K'��������Vd!i�1Z��p��>R��1�����@z/��+���Q��]AC��|}����Ml�C�B�5��?����PP>�
�Hp�����_�?M��H�AH���5�!8�� �������<�=>S&���0_�FB���`���o��D-Y�`P��i*����
�8DK�iW����w�S9)M�(�QK�r(X�\O����@a;������D)�N�c(�}	06��Ul���eu�����1�!e�^����R6�����%@x�a��=��Z���d!@/��c��[Fg��\
>�<	��*���GH�?�0�5$,�	�����MR�@�g���,�0(@�[����2"�����h�@�(Gu IDAT��ye4��N���������,�P�`�����xt��@�` ����z���ZY����,�0(@(:���L�GsX�$��\Y.ib� �t���r� N����j�R��� �����d�#gs���E��>�R=/H��Vd�I���rH*
ji��� 0����g �*��r��@�)a9�_������g4���d)����$n�����d�P��V\�w��*��PQ5�AV_�����J=S-Y�`P��]�����1�!@7�.������~px>�4����B��+��e�M��G�}`q�MAO�8����j�R�����,�z<~�� ���x��G3�5o@���c%	c��2�%�!9�����TK�b� �$��@\�� z�p��	pe3�4H;h�0�e�Z��C�B�Y��\�a��YC��o���F� F�x2���/R`W�!�dp��;p'�%K1���'}��}9B�OgW��@�7*B�5�R9� p����-�0U#�M<�<H���g�Z��C`��~�T� (t��F����!8����>����^���!�j{�mYzh��@q�������BK�bT�nrSdms�7������5P�-���^z�s>������PP��YAztbdW��`W)�<x74u^��p�<�j�@9�����5nr��Y�����C������Or�;�m�M�I(��(��.}����m�>}�C�Cf��yp���GQV���JM��`���e��V�/�V��#z��J�aoL�^	v�_8������.�	`{0�@A2���,��0��	��r�o0
��uf
u��m���Y�h�4nXP��#�]����U�
gKG�t� y�M$��-��m6�d�<�B��
%k�A���B}��9������p��w���[�k
����b�P���@1*��#�6L��	�)�o��J�8��f��@mB����X,7/Q
�����1�2�d!@�%n����[�=�m$�����J/+	�@s2;W��z��B��� ���y���^��3�Wa���nz
u��d~�\-/�
T��w�HWd�B�z�Wk�,���5��%��e������f�TY�@���x�`�X{���%S���=]��{J��	fG�����������x�J2V�X.���RSp%�����]�u5��������/@VMl���;[>'�2���]����+M��*�2de-�{lRx�C��^�	UO�������ix�|n�@W�e�G������g ��y���X=����(
�$��@�/�~I����2�eg���
k���/.'�p�2�W<�k�(������o��U�6��0P�J�>�Vbg�U��`0
����m��?�9|"<��n���m9h�t�u�qPh�����������^F7�A��l�����7���*�����0���'vX�*;:�	,�6��Q�ya<b85��MQ�P��u����jn�*�.�N�����������!��d�F��U�{�� o
��r!�����^�DQ�{QJ�vD�ty�c��������VVj0811��i���uy�>&@���r���>�j
���J?<�N��;uT~��"�(�b��
������:?�ne��H�2����Cz���9���[��&����#\	y����|��� ����T�r�%�����d�#�����J	�~?�G�sw7�f��k����5^�@��LZ7�Nd�<e�77�+/�k�,m4��S�
��Tb,(�v5@�
|�8�U�����S�����txYUyHry4g)�]����L_	���v�`'!��*;"�*����(�� ��}:��@4�lE���)i����9H�8��01	���>����F�����4}��o|`�b�
��. _���-�4�d��-7�V�.�Q_3�(l��.�L9�oe�x��f���k�2+dT��U������*���"2�u���Y������R��gl�<�E`g���/���������|�(�F}7X��t��J+�����@������2�r�d�^&��.\/�E�B��3nL�*IX7O�WJ�W �@H�BPf��f}�M����zTl��9���[�Z��6������9f$(d(�����\[��R����	�y�-���r����M3��mQo���3����u���/�_j����(k�A88x���n��=�*�rEs��]`��\	"�2rk���$�g�Z!D�����'�Eu����U��L�Y����I�V;��j)�)�-�@�X��35�����q��jjS��[pjZ��B!��Ri%%�
��kY���%��e�;
d�# w���Vw����4�dW�c!b�mP;-s�Y��!(�>@��B{�m*F������mb�'���A���b)2��D��*iYe��Q��^���ex�s�3���~_�Q���&�#���Al���W ���<���0���U�KY����,��\��0��(Z3i�~c��63�~�?f.�� >L'�Izybi�3�z����6���Z;
T$r����3�IT���0~�{;x5p�-��������8���
���%@W���#�NL�Q:���M��n�?�56���3��!��e�lW�����g�}�zW
���,q|n�G:�Mo�5)}���Fd�5h��9]��(����-U�3�;��uBi�p��,:����F}cy�����_������	*�p+����B
�����%n
���:�AVW �I����G�l�b�%+x��0��|������.(�"�R3�O��@ql��E^�|o�Ac�7��I�bw�B/C�(v�c����C�2�|K�0����H �����E`�l�V@�U�� �s+�fn"C�H���V��)���g�W�$ wI���,�!3��<���?�?����B����W��pX�U�bG�`�\�T��3��(:>�}����
&! ��mq�	�C��=}�&�W���,�
�,�<�W� &��.���I��	i3_@�|����y�u��L�e����!�6(U����d�����A����u��Oh���kR�h�B����^Z�� }�Qe�Y��xV���S�I�L��O��+!���}Q����{�l���`����I��W�-F�3-����U-xz9�gY	��N?p�����C�	�`�T5
��a����'���f�`��:i��@�2�E���Xs�'��F!@h��`�[��@0��`P����S��@�D&J��^
L�Sj��R�O������3�
�%c+2��1o�iz;V]���@�0C�`��@�ic�[��3e���`!�5�W^�/�`jO~@&J
�!�(�M0IG
Z@!���|�B��7l�C`�Z�����~&@�����G@n��[B���>[����*�7������������4+��f��!�������[Y���������	���feX/�`_��:=N+j�>���!��� ��p���`B+X9"5����O����`{���.o��@��):N����u:������.re��:c��"�5��mP=`C���/@��?���4_q`���k�{�RA�����>M@hao���|
�pP`���
�����A���}��#��K���!��0�~�!�P<X~=<f��@�a���P���"<����N�������r-�2��W ����Hi��g�*@
~O�G`�U��0_{`�~�.�x����C�[[���$�=��)�
����/~Z-�_�����:��f�r�t��v����l��x���S����X�}���������E(�Y���m�����S{������`�����Vn+cl/��@���v��o�����@�;M���|v�>&���7�Aub�<���e�,?�b�E�g�����i� F�?����c�{V�`��^���b��]����MP���r�
������w �5h�^'<�/m������&�m������*�
/�a�r}X�\��2�������@+^	>��W�P�K�c4`u�k�W�,�t�<r�`	8�D�\��
���*>Q�����jKC���|��}��^!��](>�3G��2|�>?�����0_��B���e�w���3e(��X���u�$s1@�����i��������t�I�d(���L�
����~_������e���r\��p:_��k�����%�x<.����j��7��\@Q+���.!��������O>A�=�3]�O����`����.��|LW�� ��E��K!��Ll��o�`���������5���g6���H#�����B��J`!�\\�$����������c������;������"�-d1��X�Y����x��s��{
V�a���g(j������������A�t%0��m1x�/��'��G�on!�21�7\P��C/��-���zq���;/�87ebc����?�7I�O�@�����{!����:�]��r�@�!�`�>�>���x��}�6z�@|H/������P}�����R�x\�������e�yq��mU�Q9!YJ�>���K
�{�5����O�8B)��vY:�xq�K<�VA/R1}>��>�>�Vb�t�J�!n�����6�����@�`�1�W���p
���k���������-�r�E��:�ko���`���?�����2��4Va�D�a�����~��`!@���t�CfG�Vu�s�=�fX�_���
��~1�
��~�}js�2���]0��H�@���e�
��!�+P�mg`�|$�3^b�J	�`�Jz�co�r�@!�`3e�����T@hx��^�}���"��@�u��B��.|%��#V��t��,��R}o�� ��0G�OTx����pt/9�{|��@30���oFvA`F���=�������~?�� ��2�����&��g�`!�j�[n`�
\/��e8:�q����N@V�B��qp���l������v2_H�+�<b����Q�"�w���U����>��>f��C��`��t{�`�(����x��K������>��'���~��&���=&�����z�q������-d0��A1���yC������/���<��f��:��>�
�����������P�/,����Om/Y���|����������|��d����9�Jm���s<$�5�o� ��6~6�l_����yC_�i��_�w0A[�}M ;�Wx�d0���q=��7��]���o2W@��{0��������>����h
XVB3�yC�
��9�����wW����R����r���,J��7�`������Q����������@YQ�@'@��};?��.� ���]�.��� ��{B�x�v����wW��AY���i>y��L�R����N������� +���!���%X�I!�[p�`�?+>	��`��2�����������y�4���d!@)����o8�^����u��|��L�R�+�@��O�(�i�������`!�Tm�>}$�'*�� �!�|��2x��{<�|5�������S���<_!�d0���x!�g\�����<�z��y�����.�*hAY�`Hl�'=K���y2�C�b�G���-/{�v���,(f{t$��^bX&@1���c��}���������l
�e������]d!@1���l��������|&�&�XSC�A��0�h`o'�����De��@V(���>�����|&�&���!�:L�	W��w�'�2E��^��I�?���A��;�4�d<���m�K0o�� �g���6.���ST��2��R��2|�8E��l�������<�`o�����@������|��k�`	�(�J�����_r��>�)�mI����(f�E���ra�V����`��tRR:fl���i����K� �j��Q�`}��a*H��W)��
0�J>����)�v��s=�,K��oX�<�<E�>�
� �34����gy�@���2�N�>���0�-@^�W \�����2���k�v��:C���[�e�e��fX���xy/��p;��B
<�c`����=���*{��R��2�e��Ztp+t_�7�|G&`]0^���~^ �,�K��H��n)�`�J��2�o�#��:,,sG�T�A�70�/��d�,X��0������X�q`@<�>���������S�������\�q��nK���uC^�g�q~���
�����p=_~�����+X��B�l���������u�W���}���eY�`���2w��3J�t�\~~�EY�2
Lm�����~�����s���������o5�V�0�W�)��<��~��X�2
��9y~��i�o�����uXX��Ps�����K�*���7

�2����8�O��]��2�+@�9�<<�N.�26�
&���^F�����W�w~�6�~��g��.��f�����	�����
�^��@�}?�!�9�t,��X�w�;�:: y��B
��1��n������>B(S����T�fV�s�$g�������v�D���kd�����l��tC�-�j����6p�������A7@�����U"e2�(��0���C��}�f��!/���z�KZ�
0�8�{\���zy�/s
Ll,�y�����6@`�f������������
Ml���do���@;`�>`�}����_����Ll�o����gJVb.�eN5��:Bq������?����Ll�/��!,��`!�zL����_���v`bk��4+��Z��IY�����;�[�xv�d�;����X���LEX��OF}qbR2h�8�P	5�}��~���rR5��,H�<�s�<+������!�Z*xD�	8wW��p���������"\/��Q����x�.�����\�����.��'������'��B�59V��� IDAT��2Y��-��
������Ml,�0�-��7�2
Lm>�<��/�;�+B���)�
@V�M,t��=�gi��u�u�i���^@�"��O�8����'X'�.���&�H�q��<�h�y���X��C�V2`��L� �B-��B�9��g�*�sD����S�q���� 4���`(�`L�4��3�J��9������ �utBxk�2�
�K)���9���a���#�e��3=�)x��^�qm�E�`����-X�#b@���4����'�!A� je��@�����^|8q9���ZK�U��/e����y��zy��f\���X&	����O�.x8����_��/&��5�]�`�C�c�{�G ���`Ol�-��@���L���M��e�'[0���w�>�
����`���,������%�S}!��|�/i}5�����Du+��������Id�C����I�;nB^����<&-�^`��v��/����c�������M\��������<��x���<�~�0_���{��o�x��R���2�;���������������o��@�*�~w��5�z�Z~��u0}�{�E���c�`���X9���`T������J/(j��
��B���(3)a���/��
���Gh�u�`b��q��`X/�/���-�:��\|E�Y:�b#�"�m�_V;�RO��aux=�
J��3
|��P��Dm ����|dO`�Y����ko,�������F������A���*V�}X�k�*2��3Q�����%�� ��	�-l����^�0�mk�#X�5xJ���"�[�`L��2�i���������0����B�=0!0�����X�`�v�B�V�`T������������� ���6r��X�%f<����J�j���w�Wk\�p�_R�<��a�@Z�m�0���5����������L���`��tuG�����K=�+���BT7_8cV%�:���f�*2h��P��6�Ry������P;��i�����6�~Y�[+����m�9�8_�%[^(�����H���T��c�7e8�*��y�>����;������	�>���-Y4�����0$� ��exM@6���0nU;���J����_��e���ye�2�����<G-�T^�&|v�B�V�`TG�P��@a1���&��.:[��<z�F�������m����`H�@,7,TC���H�`
b�_�@%� 5�,�1,����-���n�~��"����-2��/�~!�R|a`�F�l1���&�;��� '@%���P;�Zt��`�yB�(m#� iz�cFP��%��� C��]�R���dEn	��!�����0+H�R�v�z���0$������ '���E,�T���������a����L�$�l��`0���jAN4�I�X�0P�Z`h~��s@-t
�k�:�����
��V����eh�j�B��`f�|`X(�^
�`0p�6K2� '^
�{��O�D���",���h��<�m=�{��@�f0�^�;t���������!��Q�BX0GM�}�Et��#,��9s��K�%w|:?�:�t�o�@�z��Qy%5p4T���[�K��!D	tDJI_Vx�Oh=���@���&zS���=�a�<�m2E��������j���S/���u�)z���kZ� <�/�ycX�=������
�1�)��7�����4��O&-t�yS����54���y��o�
n��$����@�>�f{V�
0[����1@Ey}?^�:/���[P����HN?�` v�sP��T��dZ@1�z���K� ������U�C�hfe��ep@\��"'�������@f�"V�� [mX��B�����L�6<v��*�,b.:����������0���j���@�q���8���y>B�"�kZ�	�z~�o��>V����`����@���;[:�c�2b
�u��W*cE�����k_p�@��}_�9`9d&-`�^����@�=f^(�2]��-�;�& ��^���0Td0�f��x�B�1��n��B}�[X�6{��I�/v�E�nzfz�N@�J���(���
���i����`�d��0`�T��)��2k`w��R	�\�G'��N���r�R���X�����X`q�������{
�Zc�0���> |������ug��e��y����1Q�]`)��B�����y{��0���?�G�X��!�������\�x��Rd��y�F��xHTc����&�����q��Ll�C )���bDC�t��%�<�@���d)(�#X�W�X`q��.����O�74��>���5ht/<<�<�{�����T �<�V�h���������D�|�6� ��?�����s=Nl���cv�!3���,���� ������*��Q"��@a�[���=��L�BrQ�`g���TK3����.��9�[�}@Sl;
��f >n�����4��Iu�$�:��� ������H�����|:[hG�m��v�����k&�V#HO��:��S��"�D������;��]<(�} S�lm22��F<��8()w[�5�f����
Z{����B�yH$Y�S����X0��C�7�l�����������}������1205 X8:S���c��d�����N��Q](`O��l �dI2��7�;���nrtXw	�!�I	x%��	X@��o�|�tbs����-�\��������!��M���$�@�>�>0v�=�=o1�{S0��py��^P7���{9?PdM�0�~�<�NX�_1�[���M��mz�#���S�����XR�74s��0�����l��=�{,�����e,����G�x�ubtc#@sA6�F �
�{��Z�@z�\w�����������2�:H#
Y����X��%��C�?��O�'B���V�F�QKLcF-�e�41�d3q�c��_@����j@�S{SQ������Z����������>��	���d=�fb�c��=�'o�_���G����E�9Mq����jG}/����d�����Z}��e����<,����e���Wi�<1�2�I�#};)�{z��������6@6�[L���|`�������=�BL����cY����f��������4����?�����C>��w>r]+\\K����L��S������i$����A7��O��xg��.�?n�/���)M��C���P#q���|��
��X�J��.e7nQ��������"����c2�d�U��rB�<�.��M����<��! ��*C�G�c��M�FF-�f6Z

��_�)��VO��&�W�#�=j?��K�GwPw�a�&�
^aC{X:���+t��|�� <8���
����T;���q���g���]�Ep�� w����`h����H�O�z"�92���iF��2�L��h6�������Ue�v J�E��L��
K��(���h@���te�V��Ai&x���y<J�^�
7�d(Ji� ������C} G���^@r���<�=�O
)`���<k���y���W�G��u��d���2V��O��45y0p����U��NW��������9<�1"S����C�H�5
�����������Q�:a�Qc!��D	�}4i�]�, ��\�K�{
�6UK��x���"VB{�_p#�2 f�4�a��������
_b����b��$i[Rn$z���A2�������!�zi@��Z�C��1},�*rwF<<&tp���v�r<�S��K���u���;�p������C�;t��{�A�	�Y$�0���2��I������u���}/�c����6�CC�-�y1'��*d@�Wl:�w�
��*����dF�����/�����7��m����uo�m�G�aY��p������������m���G��Y��,�\���P	O�7p��x������ya�{��gA����{� �o���@�o5�f��^�V:Zwv��GrvG�����	�G.��I��U�]�x�����C:v����O��:�5�0"/�����1�(	�����72�!2W	��=����Tzr\��CZ�Ga�w��	�x��*��c�$���=�V�����Ly�����H�3����-*�B���O�;��N7u���vKu[������p��A|�:�tW��
����&���c�&�8]'����?��B���$<��	���-��"��'���0��	@8�p� /��|�7YB��&PH��3p��g���0<9��G��Q��s��2t��3���-���<��uG��9K$�,������wx�M2yfDP�=���@��� �Dm��� B*��;��~�@e��hGx�t��t��8,�$�
tl=����m���W��O�������V�X�a�q�n<������z���X�Z���V}k3!JFzl�tN�xU��+�A�����e6���X��d�cO<����{������u������|�OKl��4�)8��[��������`E���^z������Fe�L����> ��6c��@o�����%��i�3�!v^BM��������E>���0fe��w8=��E����O��C�N5[���@�������������*��8T����-xn��G�������r%p��A'���~�\�:�=�D������������*����/ �!l~�@H%g7�=���5���� ��
�9����>�7��L�y<��z��#�03�������N?��EK�J��JkP(�(9���`jJ�#����
��������3=�;�������AC�K��N}���Ef
��7u=���,�3�|���y�B����M�nE��T;�U������ �[C~0��Y���e"�	!!N�~��-qf��s������+�M�#S�)��D8`��`�g#�9�� �����E�k��9b{���oE�
��}��X����
u�! b��A/j	�18����`��>�F Fb}R(�6r����'F��F�4�)���g���{����^���)���X�}%��t`��Ka��{��1'���]2��\�&B�`G��:�w	�����J�(�����G����jB�	����$v����'W6��x�������;����+�5�����<&�@HVtW��`�H���f[�u�Rx������q5��z�����3q����|<�������*���^�T� ,��b��x���o��x����^�=�z��`���E�����LpY�<��`:��M�w��y'nB`����������g��<�L`�_��B��O����r�qA�`����*�@v��gH[I�
��������P������]@���Ye������eKl8��t��@S���k���9c!�;~����H��)@s"d����~gd3�-
�������H��� ��?`f{��Y��s��	w����1#-�u� �\84[�����w��+ ��� ������`��*����O�H!�yx��e1t5�a��{������C����M���
����3@:�r���K���0��Y?���#S��-@�w�=Ko�m	�z�e@�s��D$
|iR�g��LY�<��"���;�`��!�D���"�&�1�]�]�a�6���������,��,����g��x�py- �(ff��1��M!������.���u�,�r:H`E��@%���<@�P�;�Tl��
���%a�~��5��,AO>4OR��.��[V��������PX�����N/
�`)�������'��@�w@ �{A���w�1�����\�1����g+E.8�����|E�!�����A��"�W���_5T�p�����,�
����S�eL��V�:�Y�!@�� zT�^q�j�f�G���4|8�[.�w�X�����!/��}�A���@	8wt�)8����H�9��P����0���!�m���b�dU�a�d;@���� ��v��%`�g��s�7���>�r`Q,�BM(�DFy�.���D ���~��{ B���P�������N��P.B���El�k�)d-�7i������0UyA�'bf(���J1U&1ax������a����,���A����S�:}p�e���o�6T�:T�[�6���!~J*�w�w�/���H&A��0�W����<O�,q6���I%��.�-��>����c��/@�S7��w��x�DK^
�?��) ����������&�i?�l����Y�.uFV�RH�O{$@�{j�������s+p
*���da�y��
��G�[��hNx�lS�.9�� �[����5��x����,��p���������Z��Cg-m([K��vrCx"�6��J���X�����N��@$��k\m��������|��O��;��,����+z���86bQ7���������c�u6�M����u�1f����;��8������y�a5���@/�.�!N���a�d�	e���`�\��Ci���t
;4���-6p���1��+#*�����Ex�7��n?!x�����oA�����^n���y����c2p�j�/��Z����Bd�s���H�7��y�h��wR��B�%��,��������41
�e�6�/h:_������~�����p��P�~��0�SL9R�|~-�yb
����0������W���N@~�}e�2J
�R ���38�Y=������������o!�`�)�	�����
��e��C���	]�l/�����u��2�S5��d��uC�q���
�:f�2��������w]�B�92�T�M�v�#���7�o��<k�qb:d�g3�^�^]�p>g��WxhTGi���lHM�R,�
��lc�\f��������Em�h\���hx��Xq;Q&�����Mt��\|���7&]�����N����>�=�{��L�:��aR��[	
��	xF���N������y���g����^��`@��6���4 _������<l���R�c�%}u�f IDAT\�$7��[�����v{f��o��� rE �_l�=.���s��G��\�A 6c�n�r��}�s��G-����([`�);����Qm�v����z��=\!�a�I(�P����TX�"t!5��/�����@���$�l��������s���n)R�g���a��[����a�&x�6@&��������/�(H����M}m���$��-��$��u�b[�k,p��wV���/�7�8������V�<�yv!�$y�eXP�`������s��[��np�@}t����P���/�+n.�v��L����R�a����3�a��~���=�}��pp"pwqW���s�@sv:��*!�B�y:~�=.W�p%z�������!y�:=&p��<� �a��8bNo�,!�qHV+��y���!�ci����C.�;%���W�AKp@��U�f������)�h,�K}a�"��\���2����-�;2c�.���^F�<�����G5�M����#�u�p���F{��.��`�8��\����C��f�1O �`@+.���IO���������?�s����_l����?}g�������_��u���>��:��\���w��t������I	��|?�ls�����?Ny�z����(�,�M�_,��W�e_��2��tc�?]~����>����gi�c����tl{���}!���p�������t��s/�����_�|�x��?����<����/��� �_�����qH�4��{J��<��A�[Kb�8������M��SR�7:�xj��=
��it�UG���c�g��C������T�__�pF���c�?����c.
{�~4�dk�*@�r������8�?������	��aGR�4`�|nX.k�.��e#��&UE��pCaF0�/�.@<��o�:����8��f��e�Y����uOqx���x�r�s���b/��bT0��n�8��u��3R�a��4@Q0�*H�&&���1��\������\���0���lf�-o�L�-���m�����[}�6->����P�Bl�hP�������A�1rf�04���a6�u��E	�s�`���*@H�F01p�$(-�w�,�c�'l3>T��8���j#3���}�r}+�Z$��.��{�^_ u��L�wI��t�
�������c��Sj����x18O�P����4��;�&F@���������������G����b���e����-������2(�8p�^�88����8�1r��B������B
R�O=`��=����;4gt\�3��@���n�G�*�W7���)��m�>P�k����FP���\�e8vF�����
�h�>��w��m�h����0�c4����4K.`�h��3c`L�>1�L���������D<.>P��;0�HQP	�67]ZU5<^8��4u�q����d�����@��SNWC�0-�������I�� ��8p�!7�UH��o�@��
8�_��8���2p�#���]*�j*�{���d��a��P�v�{=�$P���a���]u�G��h������ps �����q�^�g�s��Z
8<G&U@%�8'U���:�L������zC�`-m�
�
Y)���!@60��4h�/�������q��ZK+	�yGx�4�'�����!@�J\�����yr�
Qtf*p�q�@�B@�O3
�Fcn�Y�4�V��
B��AWI��Q�uX��� e0�4J�U�@ny�G����C|���a�0���3�GjB��WS:�	dm�L�}R'��B�����@�=�������w���i{�~�	�)�v)���@��z������\Hf���j��F 9�:M��k��d|��~�~	�vc!��!�����lx�&c�1 w��y��{�6�h�G����sKI�����wf�|&/��>�C���NJP�z18��a�7��	�����CH1n��7@,��P�@����6|E4tr2E3�c^n#�~6�X�b�!vVH�>r`��WX�m�w`�I�-��n4Uy�x��^�3�X�����G
��mE ��dP#H�u�������A��0���"����.@%��+�c���s��G�� ���j�����.F�>6�%��u %� ���t��B��c����Y��?t����L�q�=�\>f�.����s���UH
�� ~�"�o.�X!.�(r���.%�V���D�F�vh�l`�.c��#rn4��s�<�OX�Z7�@�����@����2�rTT����@?lZ0���� �i���2�!���'o��a��\�A��������LH��epb�N�xfD�b�l����
EM@��|��8� ���x�k)��%�cp��c��Yu����[���	P�6X�������"�m��55e ���=�4���2��--oI�
\��lk)$��4������O���<����<l����=�[
��e����P@p�)��6�Q8��-@�O���3@���^.T��z@���q����j�jW�-p*�	�������K@��@:�|bq���v����
��2`�y�B�8?i>�0����-��*N���E�����N7`�{�B�9j�<�%c�3A~��b��Gz��L�Q��i
L`%a�Ll%�l}����*�����j��.C��,Y���;.��z����tf��&=:��.��� \����C�	��kV�dP �yjnZ@�=�0&�d fV.,`�%��-����$��"��{.@'@�c+�d @�w{`l�����,$@�|���U&�l`�>��0�0����!�@�!,�hg�V��8�[�F��G�~:�wn
��6g'���=��������tW��������0b�Z;
> ��(`��b*� ��`M���N��!�<1��
Ml�� ]�p���FO,���`���@�(B�W���EEW�JUVf�!��|�kC�-���B�k�s�<I�M]��[���%4�
��RD�����7�N�pP��p9�Ol�C���@�����=q�0+�y��hI�����[�����f���C!_�nc\��gh_�p��
7`���� �YP�%����9�*2����R"�r��8JRqPJ�5���#)K��G�0�:��#@��e�E�
x��qQ�_�����E2�t���p�J�d��Q��p�������jI�RN����c�|�$�@��=�c�
��(3��3�vj����;�0V_��[����U>�`����wo��=p�,Rr{�[�1�0���T$z��n�JU[T8#�v�$3<4��.���!:[�m�Wr�peG�G#��'�&P�u�HD~h�N����8�[Q?���f����R�2��9v�:��r���l	��5M���������8�zW��R�������
�7`D*��b�e����_T�!
���k�w����{���S��"~�z�b
��k�t�������Sw�O�%p`bc5��t{�!�N�1���s��f�8P@��Q�����,��cX!��t�<�
1��=x�)��)QfKr���J8�[lg2�d�>oU�Pc���U� qcl�d��<�*�5z:_n��V��w.�
��H������+���P[�<�M	��d�I�� �������jnRHC2���(��O���������+P�b�@=n�&��=�@g��6c�����`�wQ���a}�a0���0t�4��������"@]�%Zn�@����*����$��d�����
�����.��d�����79�@/>~ �l/����:�	���cX�����B�9zk�\�1��.)��
iR�p���Q�C���(nM-ZU���V�X�`�+���^}H#. !7*D-<%L��A��t��36��
s��17�*�O=>�`�>�QW@)��A��f������U�����{V5v�����G����L�_W��*exN��;d����8f��@mO����)���@VJ@�E�����Z�MU6A�H��T8D�K�_F@]���9�Z^\���$�|�F�{
��b9��0�<��#KU0�����j��� ���B�9z�U��!@����~��P��e��pIJ4��5��I%w���+�U,��`�>j@M�� �z�b�4�*��0�����b���N���4#@]�x1 �|����������kP������h��~����y���'l�27�i���J�F}'m^�P�����R��A����o1���i3t*`�X��P��EiUn�n��	��<��Xd�6� ��stT�;z���o���V ��
`��C��=�����9��60�e[�|���2��f���v@��8!`�R\<<tit�_QT2�u�1�*�����]�{.��r�����H+d!�Mz�����`��{�@>nU��_o�����[N�
�<�U��7/�!�z/�{��.���$@]�v(���m`�W�Y�~�%��6��um��l�4hncdmx1
^���!P���)���E pn#��:P���������7 h� ��&'��z�[���W�����;�B^���@���Z*6�-���k��
��@�0��W<�AJ�(������z9Z����]w��� _d�\j�v���u����-�`�>��X�}�j���e��
����=xK�r��<+}s�U�*�g����������^������o*������4���u	�����+�(Pl��~���D�/��!�`!�B��V`t�]�bm����~
��5���c�����rb�B����������B���`��6���w�/����ql���{��\������?���R:���;�|���B�������V��@����Vn9�>@�o�f@_g24�H,�����:���u���FB���t���^`��zB��}Pk�� B�U��n����W�}�u�s���9L���l�-��.��V�"�j��E����Y�N���0�����@�����	��^-��)y�n>YV��{T-�����T��B�bIHm�N��]8���h	I�L��{5p���n�%� v�j7��vO�eR�\��(��`7�p�b��1��\�x�%���~�@� �{� �;����U��:�\���>������Y��C�P\�<�)f���9e�*Bu�~
����f6�xk_?���'��]��+ �(iR�\O�n���qo0�.�k,�B�9z��+x��0�����`��&1���w)��$��'���O�����m�02���7��=W������`�f��zM(c������E�D�a������/�9�������������X�>��B�*@p�A���I�
���?���M�t_6��*�<��
��iv`��@�[��������F����c��:���z���}�5���1�
C�yc��eW��t�x��t(�c@
��R�54�r�,��|��������]�C��x������+�m2��(9&���[�FS��
zx���������`

�]*�25h^J
��P�C�]��P���i�}B�V`�iL���� -~,���T�/�}'t���� �G��B�E���6��s�����`���  ��������^~�nL5�����-����;��Vb��E[���dX�khX�dL{��v:4<J.�����_���.>9�~KT}6J���
:4�T ����I�O�t�����0M��j���6�f��|����P�8�0
�P��?��������/��+eX
�����6^��W%���hz�����p���������Z��+ehr�`z�m��m�A�U�j�@���2�(�����
�W�`0XG@D�Z�V�Y��eY
��D�`�!��r���
�85��k	�t�n!�)�0���W��B�F����d5\ �`�l�JV�2��G1a>�����FYV��>\/��U-�����* ���p�FP����?���Oh2�y�/�
��j�@�(Z�^Z����s=�����u�R��W�j�@�8_CnM{i.������<M��4�A�z���@�n��0�U����
�M�lx�@%<���0��E}5�B��U^K�`�l�JV������� ��Q����`o����Q~0��lo��f��������G�Cr*�m4u�o`����5E�7aONu������Oc���I����8��m�mnlOi�afA(����������W��KT
��������:}�k�aC������WN�
aU&MZ�X,��_0����4d���6����Z�9����@�Ud�����[4u��1k�/�������'�����4K���|z������A����i�:������wo	Q�#?y��Z�k1nI��M��:�"�m���D������"!)���t@���� 	�{�������H�WP����3���\�$'��������m���q�IbIj�,�N���e��Y��P=��=�Y������k��wS@.�,@EM�����l�zr���h��]��� �� �� �� ����E�*�{�b&���bT���K���
S5�S���f��aP������U������A/�y1��uU�j��P����qF�|�0s���>��QE�u�X���A/��_��$����0���S�����%a�(�[��{��+�[I�������z��
/K���a��0+����q�<��A�<
�/�4�9���yR\o��B5�T���W�u�2�C����u�����V�K V#K�v���S��=�g�if[�/1�V�,��3X�a^��k>o�c����?1���)�)��'��0(�������O-jT�^�!k:s�^���V��jd�p�se��=��q&�1�G5������AG|x]�[e9�N�Z|���B�u���I@FK��*<Q�4�9�DZ����*��T�\���p�[���``��WIDAT�``�``�``�``�``�``�``�q�����T�1��?(������������;���Y��bH#��@_`/�}�O�������B�~�����`h�5�A7`/#4��olSe
�����=�����������e,-b@@"0��Cv��'/�����;����j�Y�X�����d,-�����/0���{�{M�Z^	�����[�z#0�'��y;W��>�D`/�����t�Wj���zC����e����'�"Xm�zxu`�Tl��s���G{H�j\�/s��.�}[:����a�u�c>G�8�1���V�-�=������Dx��o�HA���_1����u��`@$0�����!��A@2����g�5��Y�����\�	�e�O.����]��=`Bw >`cs�E�9���	:O� �v���B�Ho�
�� �� �� ���^�u�4�IEND�B`�

tidseq-without-FPW-4kTPS_lagOverSize.csv.pngimage/png; name=tidseq-without-FPW-4kTPS_lagOverSize.csv.pngDownload

�PNG


IHDR�d�?8PLTE����s???���___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@��������������D=		tRNS@��f	pHYs���+ IDATx����� ���<���%z�i��
�*�$�gQ����U���#����J�:�����J�;)T�K�R�X
���R�T'�@�:�*���P�N,�Jub)T�K�R�X
���R�T'�@�:�*���P�N,�Jub)T�K�R�X
���R�Rai�n�������R�J�i��]���?N~����qm�\?������l?����/��b5�`U�R�J�i4{A�<���^�V5K�O���-�u}|�.���V��w�zTc-V�(����Po�_U����]�^�k�'W�O�^w��Q�O�]���������M�v�~�65l�Vf|��?^-%pg������	����^=6�Gn�����
Z�v ���=������w�X]�����Z�nK�o��S� �^�q�W�7I��>�5�.=���z��������{u=���j�� ��&4V;��Q��oTK�+y/m"j�t�������aK�[�u�W�!���C&��q|�g��n/'����	��z�*��_h���u�`?R�J�K�0����s�.�v}�������������.�?��g��4]}��_�}�|�kt�A7O�;��V�o�����Yes�����>{j����������~�����Njm��(H��~�&��}������p0��$ lq�!�gU�K�+�~��t�a��jQ��?�	5���w��m������w�	���h������������4;D�t��]n���0pu���.�/7� �3��r�;�����H�+�9��8��*���L�nK��@�����Eh/4Q���\��h�^������e�<<��1������7k�6�@W���`GR�J����pG�|nW��Ssu����q��0�����G���a3[
�K�=T�^C�+�	���X
��>�:��h)J�N��V��t)T+JP���P�N,�Jub)T�K�R�X
���R�T'�@�:�*���P�N,�Jub)T�K�R�X
���R�T'�@�:���zs���_[�,�Y��J�!p*�_;�Zn6������]�vW���`A�R�8����;�jn6��{C
���V����`A�R�8����z�,s�xa!��`�T*�8y�og�����l*����C��T�;�j�`bM��7ya!M���o�W�Ts���|�c�h#���!�-��T*��S�og���8dN����R��+�����U��
-��K�Z7]j�t����2����I�k{]�����J��.�n�;U����H�1����nG�p�%l\Mj�t�u�%O���=�%����,��NDaR��K����X�u����JP����R�ruj�T�PI^�(V�_�+�n�J�
@�*��u�U�W�%U�J5E%�a�J��Jj������<�;"���r-�NR�a�Z7]%5���R5�"���7���^��Uv#V��K�T���H��I�@uT)����KT�v#�9���	
�
	�H-n�*;�U��K�@����^�r���F��M�`���&?n�u@��*�9>*{�.�k�j�C^PU
�99��'�/������z�/��MWI�uw�����_HRv#V��K��./��r�jR,�����~'�j?R�2Tv��MWI^�(Ve7b�n�J�
�J��J�
�J��J��=�����~���X�n�r��b^{���������������Z7]
1������ �����j�a*�����]/W������w�W�����=`��si���N���sY��U�u���_xk���8���F�{xX����K��I��X
��{*;�U��K'��x��p�I�O{�_{�o�G_8��7@;�[�����o�j�t)�d����x�������~��~��>���o_�N�nU�����U�� JV� ����nx����=`�����0#5��*U�b��~������������=�~�}Ee��j�t)�d
D\�8�w��_^�X��9���M� ��K�oh�\_���f���@x����
�:��d���	X�
�w��\������Kw�@�')T*;�U���$�P���Z7]%y�@��X%y�@��X%y��X���u�U�W(�U��X�����B�Rm���b_�;�+�TG�@J��B0���J�I�!���
!Q������`�5b&�._�`0��B��Y�k�.������>|��W����`0��B���W�������~�ac^	����T���1��Z���/�U�uy��v���w��z%�_�o7*+��R���4��_��^�u����>���~�;�����oG��.6C[[�~�Lm��+�DU��`������~�;�R���E
�:�F��
1��B0����	l�+���U�(�Z�����+�D����rk8�;���L��Z�dc��W$�/k�4?���G���t���b���u��gJ�� �(z!X{@{'����w��+�TeH���P��&���,�P+
��ks�"���h��UI^�(V�b�U�@Q^�P��������
�J5_�>������
�b�A�tm?����G���o�i���Y�'sa����M�b��[���ey�HT�*���}q<��EZ���J0��������p�:����*������X��
x��/����/;�+�T�+<�0���6����2���y^�}t�[�^������F�@������s!��~�V�XR���6|�6�����8����oS�����������7�oW|�\�������om��
) �8���,�}vN�:����E���%���MA�@3�_�����fp��Y��S�@xqX����|��=jL��S��3������,��{��1�#�qw�S�\6��>�WNH3�����f
���F�L���
����K���%��������pk����M��nh!N�������'�@%�8�%B�F0�g+�����.���Z���(!���7�G�u��Y�w�9,����|�y���7��
�^���7��-P�L�7|�)��f!
o
�1B|�\h����Ap�tN����l3
0���p�AF�*TyJ��KL
{X�_�zS��0V�h��*@��c��a���*����u5����&�v��(o�R�-]�����	������a����o;��������)sM�o8*�~d����};�LC�8���(�+*U�����?�M���DiU�W(��!���4�:�_�O@C�F��$�P�����l����w:@�v(�!J�
�J��/�����8�Q�W(T*/�;pK@�i'��	�@�:h���j}���-4����(�+�J���E��X�@�@�"�TT�jX8��bjw��g2o�}����7���{w�*�������o�����K/�7���{w�N
����[����s���-��da�M�U������/,$@�R��k!4�� �!@"0�sh
y�N��B:5T*.:`���H$���Mw�{����7C
�~�1O?)K������@�p<�+ok��a�*����:3��u�ys�V`a;}�%������Fe4����%����,�5	)h�0�>��������J��E��|�{|k70�|�8����U��i7^XH
����������C������_.����`n���i����3tj�&���".�zVg����I�t75�t L �&��Uz�n��wgHP���N����_���-	��(�u��- .�G�r#����P�B���X;u%`���`����`}T�����>h��������Pn,/���k��Q0b����$�854���M�c����g�4��E���:��c��5��&`���;~2,��%y��X�u^t�?���O�G}t�uqp�_h@����<�-B�i�R��!�4����������,k����;��������g��K�
��4��9vK��������pH��p�O������%�{��)������ {���N�7��s�O��w�0�G��8��|����}�5BxR�W(���������?pB����i�wX~dp�p|��M:t(�0�'$�@�+}��h���o�D��#�������~��_N���(%y�@U�X��/�a�w�4��w(��^�C��������OF��|(@T�W�'�gk�9�hd�zx��C��������~~�oa������W�T�X�u��Z��(�g��A����>#�6H�I`z��K�����pE������#>2��
�_�900�@C�q��T�������(2Q�T��?*�sl�f��	&��fCD�^N�����D<X�W��
������E�B����
)��w�!@������
�N |AP���bs�K���F�:�aF&�d`M���OM�4�0���3%y���*U}��k)�x�>u$�\��9�0�h����A���?���(�+�5e�����=G��(��mA�M�t/���0f� 
CJ��S��A�\�x`������9w|2c�f���o�CB~��!�$�P�3[70�
|P���A�r�7�(����<]I^qj�J�F#�~�����"D���N 	27�^�J�
��@
��#��F&�C9�v"���@���C�(w�����8s�=W�z���Mk]�1t���y?f�!�AwP
<�Z�W(�����q�@�>�}o-��G�6����C�����$DI^qj�J���y�@yWf�q�������*����H�0>'3~���	�y������
6� [%5�����Zy���4�j<t��@�7����	�1�E��@l�����:����V0}��o�`�����%����3���j;z��A%U����y��	vb�^���E��	H�*t�@�+�6�^�B@���<��b2~G�y��*�8���%y��p� {��@g>�"�0�	�;NK��6��kp��P�T�Xgv����/���@��7L�~H)X������!N��@IEU�G_���W�+{$�!���.N���n= ���@PRQU���6�)?:	��}��~����
������u�U��TRQc�9����������"���Ex�����Q��RA��AR��=N�#f)SI^�(Vg����]0��Npp����@y�Jd�����bp�����T��}��tj
�`����BOk�z~ D��.�����	���iI^�P(�����0�'!w����M���@V��qPo��?au<"[�W�g��j�!�M�C�>�=�Zi~��|���~�.�5�$���-X_�������RIE�uf��M���s�>r� L��bCw�!��}p��|{bj�$�	��Q���G�x�����URQU���c�~�][��>@�[4�/�`�}po Da���b$n�`���,�������J*�j?�(�F��?��0l@w�K�����������$�,�+��x�_U����&@R�2u� {���!4��F�����'��h��kR���k5���$�����[�����%.Tgv���-"C/0��`
0u���w%�`�WI^1����;T�
�������v���'~���L������3���Ypz;������%��O������w��eX�n�:_��}�}7,����|6�~�6��M�%������/�XMg�����<�Q>��|<�&�Ew�����y��� ��E�x���^7��H\���bs�%�,>��/�������|.-]�")������-$�T�x�Ctt�?��h��7�	��}!@�i����E��S����|]/��s�S�R!��}�[����	8X�������������A�\�h"�0���n�U�`~�L�z�XQ%5��]l����};!�3��.���#��_��
�������P>�����.vn|��|��!��P*������g�"|�v��K��I@i� (o�TRQc�9���-�h�C���CRC�k�����_ ����:������}��G�h�������p�U�����������P�*��^����^o�@�]��9�
�X%U��@xUx�g���''���d0b��URQc�9���
��E^��7�
�&2�!?>���~��URQc����j�9p�pg@�9��d]�`�csTRQU���
���=������`H�3�V�����#v���{`��HH���F�(b�T�Xg��j�!�A������\�s1P�*�����bs�)��s?�F����U�O��� RIEU�G��I��@*;
Uh��gU�W(T*5n�{���yFF����:s�=W��
����d,|	`�$���BP��l]<`�]���{��>�a��P�*����(�
`�59^+g�#��8+w��d%y�@U�`�O����Og��a�{:�T�W�g��j�!����?D��
2��3VI^�(Vg�.����wdFc/X@������d�������=0��U�*����(�A�'d"���f�J*j�3�s��uF�F]��h�R<�z�%��URQc���fk���`�%����H��J*�j?"o��������z�4�`��$�P�
t��"����o��`=�M�����������Z�{�C����A���@R�����3+]I^�(Vg���
^3�F x�(L>;��O��D*���������������s���?{�R�W(T
n��x�_�����
�O�fF�{����T����]XF������Z}�0���W������z��W� �yE���5�����ZXH
�buf�����csxR!�*$05�������U�_�-oa!��Re���yA���-�#���_��n����u}�-,$��@�w���~���|`m�O~�,,�S��A�\���� Y��'����f&]!@!:����w�K�.������|��
��@un�N��ogz�.��:�R
(�{R�4[���p���^����tj�9���M��� ��|�� A]?��q��R�����yIP��l]�};���	�
�,�.��Cay1������'ga�������@�t�u<QI^�P�x���3��N�*,a�d!�b��3
�%W�W�g�'�I�o'�y���c�����;d;��~���yJ�
@�*�:0�E���{��.^�@��D>��{����w��
�L�TT�la�{W���;u?��4���"
��O��T����-�+�|y����uA	��  �D��_B�@ W��*�K����8f|�C"��%y��Ph��T�u��D�3"��n����w@���aa)J3��%y��X�g]�Wh`l����ax1�DV#3~5��O�4��;J��S@5N���Y�s����'���>���TV��x��
����D���PIEU��QOJ�b���M?��,��?T��<�G�`6�L0`l=���@yA6Vy�������S�\��]���o��� IDAT���=�<����\�����N��,�+��(������n��|�07o�l�$��3��~K<���g�g*���
n@g�5O��~��$�85T�
��H|$������N����0 ��
��1PxAzs��E����Sf�TI^�P�����{A;�_D}��Hf�$�@m���jb��wo�Y��Y�b��,�+N
����H�XG<��z����I������x#qz��h�Q�������Y@E�eCqJ�6%(�+�hb,���
�[#�1sa
���@}�O9:{,u��w)h�5����;4�A0:��r"(��@�*�+N
��4��Mx������+����0��u��b���T�����`�2��:))/4}%y�`�����)���#��)vJ!HZ�g�H�1R
��~���w���x�8y(>��P}�����C���s�N#��m|��Q����zc.��?���]���$s+�e�������o��v������X��1�����������W�=US �5�]C�H���C`��"��24�����
#4q����)~�'*@��pQ �L�g9^1�M�/�n���-^����*_(HOe
��34��>�/�
��������G�j���U��6��
\�����{�G��w�c��
 ���d|{]��R
}o�*E%������-�T&�f�����$�=f��B	�n��q�E�!9�6���������v�.to�%@{�5�����}�C���j~5_Prk���[<�����V����$���Qn='�I��[��xJ�X��l�7�����O~����?
�>b��D��� ��k�+f�>�3�sDXb��we���8<?��(��{Z'�H��9J���xF����_F�5-�����=�!1���3(�����?�������q W@�'(�a�z
}ka��c���D�)�s�H�]�	"�!�\|	?$��g)�0�jO������
8E&@&Q�����@�K���7�j��hzl��6j
G@u�� ��z��S/�[����b�O:0?1�T�cg������,�Ce��)�*�p ����^D��#�@�?���s �z��A0u���p��u:�v�	�tt�P8������&b|���I�~+��}�~�{���6���������Vy������momh��[�����=�q�1����]����q��v%���b�F��;.O�����3�-�+sg�������;7��R�<�-�� I[������������5������r~�6���m�����T�C[�&�;d���7j��^������ ��T�byg+jU��qy?q����(��5��Y�<���.����mx�c�������3�
������	� �xD\<���+t�4�D��{�q:z�H��9���{�p1����#���X�%��<.�����=�[���#�JD<C j�A��
�z����0,���Y����N0	1�A��O+�����\�@BZV0;�#������1I�Lr���C��k���%7`\f�R{���:N/l����}���b	6j+\I>�E��o���OYh�T���;q���1�F�m�90LYV)p����ZA�������C�Vam,3�$Ii���:�-��F��w���]���'�8�o2v���-��������G6�I\���(���>
�����E;P�g��mzZ�5s[c��������{Zz���0\����V�i�
\�$IJ���za�����1��L�5�`����!��]�Z�7��?lw�����@L�T&�%�y$Lb���N��}&�$�9�����6�Ad�!��c�6�����*S��s4B}�:�3 ���	
�0�CID}z=�O8�
��!1�3tS���x?�C=�yd'�O�]+�:�Yl�vQ]G�����
;����$!wY����@(�����_��5/�7�=�(�F{��%��

iP��<��������Q�Y��/����x���k�2 �����a�^�
RW��E�8R��w%==������@�d�!���)��]�]0�yH*o���}G�D����H���'�Aqv�5�E�o��drCI�������r�8Xd-���U`Y����s��x�H��7�!t�����'IaO,�B�������������@�GIYC
~`��R{�J1�����g���u��24��t|�����������'�18��=�|Hr�~�?���?�$���X���'b�H���lh���@��P�b�3���� ]wP�4��Dg���
����p�BPo-��������-�o�0�zy�.	���8&t����`nE:���T�l)r��B�t^�#
����W��T�������}��$Xr:�~����8�9�G������08
����kBKF��1�����W�P����L�4�s�
��X�	Q+��grV1�~��K�[�BSQ���5o
��w����aOhPH��em��{A�D����;`hq�}����vZ�6���� m/>��!���7z[_
I�jH4�'[����H^��������������~�4���^<x��`M�4Mky['��#i?��1�
9���":WO��6b���l��)���$�@�]{+��a��)�Rc��,_2��*>)��p��&��/�,�@��l�r������G���R��
�D8-OB�,`�1�)���lB|�	�cC�	���@������H�3YV��s5e�pv��z��"�H��L�����@B���f-n��Ac��oz�T+%V��.�D ��M�c��=(��g���l�!���������3@�8��7��*)�q�)r�d�.'1�D��o�����ie�a_0��;Z�0�V�;2F�N��<��=��5��@�Sfh`+�)0��<b�RH���G*��Nu�F��`�1u�C'�T��I`��"���)gF0,;�)������!To��N;l����{:����*��.0��D�!�Z!f��Z�e`�@�K���������:��%����@����(�3��� ��o_Ow>`@�!��5�|s};�2��+�V:5�`��3$�� ���~���wV�e���P0'v���@y����Q����(ruj�L��S� +��L����`���zWv&��$X�����J����9@��=]
�YRL����2V��7(ruj�LE +��#���,3�`�j��
�9C�)�v�s6�~��1�m�mSSHRL�`l��1�����J�����a3�����=�_X��lX�����O^�`9��5k0�����)���w�
������qWp�@��@��T6T�2���DM@�*��v���3��$)v��@�	`��q���<l��!�����f~�`�!�[V�RL��0'�2x�t���sT�U������l+�0{������XT���=�7(�:5t ���������
�>)���?�a�`��qx�~Y����D�`8�m����'KkK���,M�
�H��U6�PH*:���`��`A�s��.@��]|��������!@b0_}�
�)�(@�8h�m���d�p���@�K0]
�^a��k�[��S*���;������s����n� y�"R�Q��!@B���x?��}
�����,T���XXX
�b����P���z�� x������uj���`��x\���ko�mp�����X{���t��9K���!@B��2RAC�\�����,,-@�R��0����s�
���!��������S���D7���RL�`Y�kOd�_�$�
$��!��%�����
���.,-@���u�C.�
@b�+�$�����]XZcp�9���U**IOZ��V������~'��s��k$�G�Y�y��`R��Iw�gnWg���9�*mUT��u0Qy(b����!@����z�3�)��C�� ��+r5�W*��JR��\,(�����
������.�|do9����l������5��{�j�A~���;�s���o�/�������s���J�H�N#&t`�^����P�C���p�������WI�{�	7Q���+���Ly�*�
�B���
���!����gX��:��
����j��>�
�\�y�����
�L��v����MQ��`�*�%�-S�Q�(!�N������:���X�`�-��HNN�R���P
)
�u5��_R��`��hW���@y���+�
�����l�\,e��IP� �t����>�����j�����T
`�����T�~��`��U*g����C����!�=��`��l]18�U�����\�����*U����q�;�o'�����d�
��`����h��Zu�\��l�9���_�7�����a����
`��?1F����j��pIk���T
���cTF��$��y�����k��)�6f�R!9��������
`h�F��F�buf���s
��9�?5T�j�2�Ff�R!9����TJ��t�X����8��r�F���(Vg��Y�����<FA��P���`#+��4��@���)�r����=�S5:5�d��O/��!�h)������������x}}^��O
U��le�H���^_������v@Oz=2w��@5Ai��*�VE�y��<
p�0��Q]��z\�>T6�d������N;h�������c�����~[��t����S����y!���	�\{6�������G�_��E��@
�C��Y1R{@��7���K�%���%�;�C�	/�.����������[7;8/�>�{��6:�jq�>���TF+��8o������{��v�Ra}��=���r�!�R�8q�=[?d��i��_�T�DX���kYu�����.6[�@^����/���e��U��r�l�P�
���k��g�?��a��	�*T�4;�����b�*;=���n���Nd���������?����	��`�{���f�W7�epb4O��H���6�8�$��P���#U
�}�����$��`�R�RU��=5Nd���#.�v������Tl��dfpZ����*T){@#� �j��7mk�()�
�����!@��� W
�bub�R��@��S@U�$�� ��;F_>��@5A�#�_h�XIe��A�l��K H0]'�.u#Pv
�\��B���V��@�@5A
��tj�8������.� W
�bub�fG
�\M����^@X=���R�P*%���\�nW{�U��G��OX�R�&(q#��v�����=�}�������a�l�8�������_���
�xK0]'�N����<��!@W�Z�*�B��:%6��G��}>�&�W���`s=���F�#������������i�����M��Y������}�{~�m{{�����P�@oH��^��%tb����d�y��.�W^7���;��U����c�D�����>���W���PM�`3M����z���T6Nd���C�����
p�|m��U�{��:���D�:�����sc/�Uwh&����9����>�%tj�
�l���������j���uX�\P����j���\Z��-�� Ke�\'��`�m�e�����W+�S���R�['D�TV�
�y���@0Y�J�P��q�o��4C��E0C
�C)��3������9���!����>���Uw�5W��]����R�[w.\�wPIw���g��N
U�"(�]�����s{�t��R�����,�Y�<�
��E�[�06���N��[K�/���!����#+�`u)%��O*6����C���Uw
����Z
�|)V�%��L
Uy�7�����X]:�Z�
����S�.��R�K��tp(�*�`})%��L�@Y�"����!�fR�[�0)�/���K�����XL
���8����(+P�:�u����;��Rl����dPV�)V�%�8���XJ
���8����(+P�:�u�N�I�Uw
����Z
�|)V�%�Z���Kp$9�I ��B:�X[
�ePV�)��`c)��X_:8�t`3)T�)�
�K+�`}�`cm�,����;��Rl,@��K�GR��\Tl ���s���(+P�:�u�����N��kL����;���!����KI��G��o�XL
���C����`�5���N��k�9���Uw
���C�#I��I�*O�F�����K�k�!��/���K���R�/I�@X�RI�#���X_:�X��t�e��`})6� _
���C�#����V,&��RI
��tj�(r���s2(���K�������e��!p�|�f�z|Y@*�B5;(KG��R���K������,�j��i*^��we��j���V�(+P�:�uf�`YuwD\|W��~]hK0]'���
���#�q�|.��w������*T�P����7�7P��9��@��h��� ����l��������t��U����ue\���>�{���s�������Vw�u��*���&�_7��U6T��(�ot@\�wP5���+Z�R�D�;�9����
�����. �
��E�C[7�y��VwG@��u\�����Oo6s�Vw
��U����W5������G�I���������A���P'��o(�������X����X��EPX�)V�P@�`��Y�DdJ"���4��5b��������������LG�F��D�q&��dG�U����D`
���������^��LZ��_�@��X�~``~c�jR�R��<@����u�w~�b���'b�IG(,�e�h]��Y�:��@au��������8@����V�f2��u�ml��5b���!�'���@��v����9�@��`���F�p��1/�hu�x�@��i��C���X�sL��os�f�5BV:����������`C�}=�w�����=
'�	FC�.,��@aa,�$�j�@p{��Q��i���D�����D�}*#����	���L{@�z���Klp��x�	�He�C�F����u������}Wl,N�8�X@J���Q��J����m���|����%PX�4m <����t���g���V���o����F�4�J�
��tX��o����#�L7	'n����`3�����u;u�������=S�P�`A
����M@(�\��*���9�x������3�^X�4��O��`3v2��� �m�}k���1�`��P�����Q
,,�S>��U"�"���� IDAT��&��,���8b�I�k@�N7�>�u����������������`~�OH3��F@�'�K&X!H2��w��S���	3pB\���[��U�z��@�oot��\����.9����(	p"G��0V|A|�"�����e.Rr��������]�^��-�=������8	����8+wz������4��NS����qo6�Y������V��������_�����K�X�s#=vI� >�h�D��jH~z]X"����g���z�a�xH��M�l��]�M���{Wp�sK>R�
3Y��A��x�y���(/>�G�GK�Pp�Y�s|��Q6��0�����)� G<>Ntk�E�
�H��P��g����/��,�FC9n�:D!t'!/����%��'
b�a��d�������z������B��*Zpr&p���Q��sX������_���30@�(�F��6iI��N#*���gTX@D�����;��x�$�Ya �N6��������q#\x�z��6H6_�c`��6�g�d|.D�.j��/)���?���@��Q�����BD"x�Pw�@�=��
��d�����N�K�������D�R��N��������k���Zh���Nv�Dqd���g�{�"���&��B�)*������$1
qO�T#F�JBy��R�����AR����)?JO�la�k-���q���*�c=���.��n��YXX�G�Wq����������+�8#��N-N�
��&�l�Y���)�����S���|���_(,j�h�?�����q-�CBY���	�(qI�57hfH�x�$�_~�K�OO|��XH��N��n;���R"�������������5d6���Hv.@G�iq��c����lX���e�AHrENN-e����6�{4����[T+�����;�$I��>�fpA��f�9��Z�i�R1��aX�����'-�Y�-��f�Dz6����A�o���=���>�y��I���hJ��7\:���|����X������+,�
�+T*����q�,��/�st��
F6d��7D"X�$y"
O�#�����F��O|��	�M[^R6?)Bs��-,��iX�"i�9��6�MX�R����b�p���S���C��d�����kC>~j-^=J��5����[|0�\�7@o��5�[��s�Ys���C��
W^i���$�G���A��2������r?a[}�,
+�:5�pGB��<I��N�1����
p����a'����&��j�U���r*p,������2B�Q6B�����
�+m� �C�*t���B��	�?����BG>����QO��d�Q��{"�rA��g�-�H����@�Z���>��s�#�(�X,�!Z����8�Z���@W���O��vu�B�mC+�6Z��qHc����y����!��
��&mE�P�*u}V�dQ�����e���8��`�TA��&@�s\T���`�<�����������Q�3D��B�<)���J
���B����X���Pe��0����	������l8�8��#�V����$j���5)�T_;�nS�]�c�?��?z,�����sVj���|�S��tzYF8��2�LO
�J#9�����5��]����i�Y��^�-:KK���~�MrIP~,���.�%���R�m���D����Yh���x�C��r��?1�e���7M#Yh������g�����@<9��m���<Z�&*t����
����y�?���?v%,M��6�����u�Y�
�� �����������B�o���D���1����*40@���1o
��H�UL2{��O�������%�PP;hK�����*�e<!��������6��Cw�$����m$'�������l�
Tm���E5Q%�2�%��Y������������`bY��F��`�#�n%-���p�	�X�����@u���6<D{s+ux�7eg���,5�a���m6N�5:���r�G��5y
��E�yAQa�B�Zr#�tb�H�}$��ho2�^@N
uzT{����/\������Mgv��2��N8������"����W)��%@@�8��@n�o��P�)4r&�6�BmtY�r��&v����w���BgjQt���2�A�.�B��.����r��C�{�qR��&����L��eY
K����Bzt���^pM���n���C����p��$9�`������9�S�]��
g���C|zIcB��,"�,�L�d��`�@=�F�$3h��E���������]�'@����O�Z��-����'y�L�Yc'}�y��"L	Z�9�����hA\������&
g,�C����B���G��R}�e��������P|��b���
����[t���*�<d��r)j�a�����C�-�Y��G��F�I�����L�����,���@f��%��z(-��X"���~���Ew��<�����������#���ZF���������?!����:����jY�Q`\
'l������t������E=Xty�| _<�����wXIO������>�(�8����>�_�����y�[�������B ��H��B6~���!�k�N��!�	'������t��6����7�����0���h~N`��vd��!{�vI��O/h��p�g�^�v�,
�j35g��.td��P]B��y������!����G:�����x]�7:L��L�|���8$�y�Iu�����0B��!E`���F�����E�f9i�F�Jlox��+��
!��c)4�`	�(���?��6��������h<q����;��w2����
��-aTH1���FC3���E{��5:V�3��j��E�+=b8�4�����:M�Cs��3�i~8 ���{����#�C���G��	@pH�T��LE/�j)�q�}��-h�!�<Q��(�J�_�����G��R��<��]|t	�a�]�k>(}��>���q��p�R(�m�`K���!�������1��u�7L�����CT!�{��9�j�0��3�IP@� $pad5@o������`
���#�n�!��W���"'��y��?@��!(F�0��V��?\w��H��x�/TpL|��`Ae�r��Dk��
����2x�lyc��i�#'�NGT����"_p�	Y���E��4�`,�S�@ ��;&�%Ak� IP�F���APh��6���j�&�=b�����h���	���:��yY�����@�v4V�0�����/��V�TV�_P�F�n��e���4�z�2�k�M���^�'���r|mLF"��?��=��qi���|��� ��0t��}�q�����;���'��:@gP�S��F3��;5R�fHp�!�};S��B��s�G��T��R��J�E�Pl�DY�s=�	(���!���6R�C�#��[#W��)��G��"��?�;�*��ej|3�r^��F�	Yz[������y>���.�������Np���b������<
��l�O�����AP���GE�J*��n�w��&���+������	���)��qo�VIE�5!P����2
Z�`�H���������������������J*j,�t�
�F��Vw���j���&@X%u�'���j;��%����.tD�/]A��|�VIE]D;m�[!����-{u�C��t�{��@q�����X��!�������~���I4��mu@�^6N��)��=v�������~����g�o��p��]����y�'�<��#�9��ds;C7��)�6�������@+�
����bq�{S�I��9���[t��ud\���+Z�)������{���?�x��T�u3���� ^�
��#��������by�e����d�����F�4d��`��kauwTk��� ��K^H�b��5����:��I����{I�b��ojd�i�`Y5���9��$���m�uT����6�[7���T���#��ch#���e4���N����{4��7b*��K�.z��;��P{���`��bg]0��������8��=Ze��V���6�����@��Fu�=��o�:���:q����O�h�=��f�YF�`��{�$R3�m3���;���{�n|������^�&�;��b�`|�3���h6p@^&����rU4p���vr��������6��n3�O�Q�������[��[�!��fD	p�����v��S
��T$����_r:��#r*D6{Hv1����w����3f���<��W���SW���-��@���H�vl`|,�s��	c�7������;�c�����nIp!��u�D�8�6�#���hr��M�P���g�D)�� ��,���k�&�������7����"B�<�0r����x���4d�56<`����pp�~�`3
�}w�����~q{��9P�
��f-�N���R�[2������t����<�`1m���u&�{���8������}��{}C��)��
dG:I���0��9%<���a�Fd�p���K:^��#&�����6~
p:t���������6���E�h��|C���v�� 4f���F�Wh�$�����&�	Y:JO�I(��3�k|������YR�dq�D���?�c�=��pMm��E�y�S��K���	��CK���w��sC�k���b�B7z��} ��a�Pj^�v5P��`!����G�T��C�\�2�C� ��	���1�����'�  ��)�U�7��e	�w������w<,��/qD���h�p��;��������������L�D�$�w��������
9$���X�}�Aw���W�p<�@�������? ���8r�W2�RaTE3	U	E����@aR�	��#����9����Q�O(��B9r�����m�`V����^@�$ �3	)<`O�T(^��#�����1����X�a<���}��;� _�����C����?���['�@�)�����x����y���
H"�h@]6>������{a��?g����@g�}���t��i��g@��*��W�%��Q@�e��@M�.(a}�g���{���K��08���4`G�A�	e����wx�������"wG�$��I���
�,�;[�}�?�N�������`������6	���<(Qg��K����0}4��w��2B��0@08Q�f�0)Fit-�����F���S%�E]�7+tq�0H���m��I�2��2L���`����p��Vl������9��N,��WpOO6�i��bc�tC�I�ts�����`LT��4�Gd�������$�<@C���ap`� W��:zX��p�x0���h���G�aF�RL@�8`A�h��P7�p�#�����~���d��>�P�!)rU,�F�;`�������z����v���F���4�Z� ���>���������������E.������oQ��9�R �A��^�S�<�@jC��(	��'�x��#������Bg�(R�RhK}@��}\>o��G��w"�����e#��a���!@rz�Qe�-Q=.��k������[����!2��3c0��A� W�&�l6���e�5�_���a���-��Le�>����Gd�r�fG��h�f����B�I��b:���� ��"`�"njp�����	�L ,�2�w�����B����B�]
�QJ8�0���,n�� �.�t�99���7�����u�����0)F��8sn�N� l�d�:�@w��I@4L��h�<�i����w�5�h��a&���l�b�����Dx� ���=7L���yK�.{�T�FF��������!(j���p���n8�3��U�������.D`N���sD&J����rp)Y=��V���c�V�y��t���qR�����@���@t�a�B���&K�Dq�+'�����RAW/�u���i��cJ
�Q���"�iC� �$S���@�!��C����Z�c��L��=3��t���0!����0&��z�|A O���������&;�#����c�K��r�F�o�������@�F�Ll����l��dK����w'�(9����XK�.'K�e�f 6Y������{w����`
�rT,��9�1�0lX9s��{��w�(���(I�W��j����<��]~�P��������oW���@�A����<�����G�C�j<0��P��z|�f��.T�x`���.~�L�y`C/�$�}{>����[r�����j������O���*l�z~�a{0q`��F�L�y`C��-o#���|j$�[�%
��g����U������Y�n�K�P��&=<�����������O-��Q�oRL������:����0�)������7���	�H��?���P�2�z��������R� X'r���9��k��	?��9@@���X��w�,p�7hM!��s�T�|gF�����r�����PT�$���8Y���x���'�i�3T���%x��{@�>����M��5�H��??��^�6���I��W��npWN�-���++�%
��?-C�g�P8q���^�_�h���@zS�7<;G}1r�T_�[i�n`����8WW&h��a�!�J�&�lqJ������\���F �V.}��\���[e�`�gai�
}�������[��K����1]!tS�������b�
�
�T|@j:8������7z��y�lQ���)1��f=�����/8���P��k{������v�`B�D����Hn��
�_O����iZtT
����6����������8��0�7��J��� l� {�`�HC�^����ks���%�gaa��s���@f���[�C������O�o�v#���w���-@P����)$e�`|�@"����D����F
8#i9�%��+�T{�@��z���~�����=�\�,t�|�/�j���>,l�6�%3�?���!hI�2�;�
�P$8�a>s&@�B�N x�o/y�o�6^G;R�':�$$��}gC�f�^��������x��;_E������(���T�<a�!���NiE
�\�W��G8�~9;,���Sc�P�3�kph��S��{���l�*������������2�(!��*�$$EA��)V>������0��� x���N��nyc��y�&Q@�@������(�@�z���X.�h��:���]R6��/��"�L)rUB���l/� ���W���BY�,�.�#�u���(^H�l="�O�#�0���������9���@Rd�Y`=Z)rw@�x�[_H����G��������$B�AT�W�`��,)r�G���<�����G����Y����M)?��'o��p���>��I!�`�k�p�c�t���@z�� .ZB��IDAT��/
���nl*�fH��	������O���Gl�K���`!��SyHl������,&��!�4'��Vx6�h:"z^d�4�(� q�N��)���@��x8 z^d�U6����	����@NI��+�;���O�j*����"C���Di���d�I3a\����N���Q�)r5k`��e�l2����J�{o?����>�+L��������_d���K�������4YcC�h����T@�E�Xe��9����3��e���������I��������[��3���m����}����[n?t��;��z��7.?X��	�m�����o�A6�C>O��n��:�l����Q�x����G��7�g��f����m�w�{n�
��$l=���~>�/2�*; �V�H��k�)L�N,�<*����@�E�X{���*r0y��
�_i�}�[�ma�x�O�I.���'���M�����'~��{������{�!N6:� r�c �`aL��9���*�Y�)���>-��-O@bS!h4���1:�~��8�`h�q<��� q�b}�)w:�e�t@�@ ^�"��K��,���!������5�C�\�
��9  �m��H�u`�cL2���2�sU�R�"��}*�M�}R�u�i��������C���������� K
�l��zl����,�5�C�\
d� �����������O(
���>�d}\X�'wSh�xp�Rd�p�EP��g�f�Rd��C�����&6�u ��=o
`e� W
�6�~�Cdx����
�Btj�OF�Me�N|���������������ZS'<�'��8�bRd�gC��`���?����5X:�����Z������L��q�Q��)@���|)ruj���_@�)2R����J��k��TT��� [�����v9���i{�D�h�e��9�L�����~�S`��(��u��Q7�1�����!@W�)�����>Q���C�\	��� ���<v�`��7���7uw%�5�O��~B����������E5�+�y�������E\�
�\��k@��4 �
�����g�sc���&Mp����}�w���R:>�3�\�)K�EI�%����@\-&��i���f!*����I���q@T`���J�����\&���_�K���@���P�)��4��|V������@��(Y�� �Q���� �|}~���m�q7��hi����33A6U:�UQ�F�  RGOC�AFR����m���'7=#|�;� o����8��7���0)@���or9s��sG����0pho��w>"�����~x�VL0�wj���O���70=Wq,)�d�D����������fX�T>����#�(��[�L�'�2�s��3��u��I�#���(�����5�C�[�������0�@W�Q.�{�N����k������3y��.�� X@K]j�z��`;�u~����@c�Z�8	�Q�gV� W���	�i�k��$3��\�g�^�������^�n����^��u��4����j����zm(��/O
�Lm��C�������?���{��7c����X�oI
��Ww
�L�, �@M�����w���6��w��.���
�N��
F��V�����9��D�uv0X������{�	?��I���AA=�J��rH��?R
�<�9(b��P�����&$��:�c������l���S�	`,���6,g 8�b�0>����'�W?~����S��+@���]w#d~���p���cm�XB-�\S��^�yZ��A��n��r	��DN�"����{�������:�y��!@:/�M������Q���Vd�?c�*�<��� ��0�y"��
��@H� ��}�t��/�M#fS�t�����+���
�N��q;��� t�0�g	��=��.L�y1���~)��`��m2i����8����<qY�y����k~�Kt�Dz�v��~�2��G��d3f� O;�E��v�)�U�h����F��h��m)���
�N��]jz{�O&E�F���fbtQ�8f��yr�������t���h���xLRe�����$�$�@�i���p�NfD7��;��9�xM��*,�f*�:@���?E��_��3J)=���8�'���\��0�y���y�.b�����^b42�h�*C
�<��?3#�P����!�oR
K����'����q;E��L�Y���LX�f��� �u[7�q�Ha.�T�u
�<m
z���5]�&�V)vP
�<�E�c�����I���@��6
cG��d3f� OSP/���G?����bL�Y������k�G�7��_���J:<������� K2�M	��PO�������5;�^M����Y����k����xK����m� v�M�X������
�N��)@���������:|D�lx[0Y�A#���_a.�T�u
�,��k�)��	��Lg]=w#���_�������Kj��
����G��m0�����i'p5�����\qR3����;�L@���cf�$��J�����?D�'��o�2�u����Z>�G��s>�PPc���s�dB#�����V��Sa��@����n��Wu�+�N�rp{�u�����.����&��u��'�w�;�Fdh����C D�2�k�����2���[���C�#T��1f]?n��=g+-������@k�����RRm����?u����Rtk�����q�~C���7�s��c:�n��!������[w�6]Mv��lo���V�
!�I+���������b��%���\��$����f]�9j9|�o'c?=�r��Ya�%�������u���l��]����6w��fOWv#��7�G��v����`�1��_X�7;���>�5?���$���?�Y'x�vZ��;������G���-�
X����mQ�`h>��������<��E��v`l����Jn@�Tl�����>*��k�s6��s��;�������/����@������m]����~�:�n����f����w��c(�;..]�'C�����#���x�m�j�������������4}S�N6�8�\������,�����n�vo����o�s�t��a����]/�s�#��`�1�q h*�}ei��r��N���^���5h�?���U��X���r7��^"tW	�Kt��Z������]�3���
�c��t��yG����]X�7��y��U�P��0+T[(�
�{��Z�N������T�������*��d)T�K�R�X
���R�T'�@�:�*���P�N���!���/����-t�u��2l3�uP}eY�r���Yw��B��h=��EGHw�%������G�m�fz�|��e��u��KUh�y���;��p���^t�������=9[�v�e�fz��e]s���QFmn]�uG�~��#��"{r6�g�a[Z�m�&�����5*�:�z��k��K;�{r6�'.��--����8��#�W�u�ZWp�����w@v���O����8�lu)��Z���������Q�I�Z}��8��7h[w��%����:}������q��,�e���j�Zw��N����W~��#�v"�~	��,l���F����u��
���]/e���ZwD���/:B�1�'ga#u.6��--�sEYw���e�������Z�N�B^v��;�����
��e�Vz���J��Zt�]��;*}@�:�*���P�N,�Jub)T�K�R�X
���R�T'�@�:�*���P�N,�Jub)T�K�R�X
���R�T'����y�d��ZY�jKT�J�[u_<w����5��W���O�>U�����8���`�j������]�5�����U�W�^>8���79�6�`�r~�c�����|/���m��A����v��u�����,.�G
�����w8,��6����9M���|�{m�����?gQ���U[I�[y���o=��^t��{o����_>������G�b�6����#�n�	�=�~f�v��!�������������MDY�uN�����xp��|��z�9���;o�����:�_M�����T?�`��n���|�����t�oI��#�������,48�������on��L��9����a9t�DR�V]������$ �
��~�\���,��
��iwh���'�`������{��
���}7��X�Hz���8��� Az'���8�z��Z�8���>
�j�P�N,�Jub)T�K�R�X
���R�T'��r�{�\6IEND�B`�

tidseq-without-FPW-4kTPS_lagOverTime.csv.pngimage/png; name=tidseq-without-FPW-4kTPS_lagOverTime.csv.pngDownload

trandomuuids-without-FPW-1.5kTPS_cpuOverSize.csv.pngimage/png; name=trandomuuids-without-FPW-1.5kTPS_cpuOverSize.csv.pngDownload

�PNG


IHDR�d�?8PLTE����s???���___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@��������������D=		tRNS@��f	pHYs���+ IDATx���	�� �����^�����`*�6EE�6��k��5�I@�O�X����o����,���b��,���b��,���b��,���b��,���b��,���b��,���b��,���b��,���b��,��P_�5�Nl��X�����UK��������f6���c�md&.R>����[�=`���xSk?p'�H9R�/���sI�X_��s+����:��<���'o�q���l�y3���')�r
���#���g���������>w��|�����~��4~����/��b��EO��q7��)��?]/�7��x�V�����mv*�caz?�x(H/�P�3�66��S�����*S�>�l	��J|*nf=u���k���F�(.e������c�a~��ot���_��e�?Vt7�fy��h�O��?3������c�wJ��s&��>�F1P�����]<F�9g��9������(�o6Bb������%(�q��!��f�'.��,�7r<���cK��m�p� �L(�}�M���U��h�p*�`�Ap���g=m�� hK��r�MR�V>�2���l^&���u},�������B
�'0�;H�}
�Om������bt���!#�h�[�'���=`��5��zg����e/���C��;��OL��g�L�`,�cM����@P��{;HL�~����*w\3�����<���l���@S��6�(.�����1�+d� ~���8��C����o�������������sp���w�8lK"I'��3�a�#@q��x���n��w,H�&0���a�J��M��`��yP���.�{���es�@���\hNPy-�Gp�X_��0��������j
�y��N����/����L�	g#�@����'���� ����O4JS��>��=��&@�������ri���#���(���an����v�{+�8���C�#q���N��u���6;��1�cAj�����4^�t[�1[��T�������S��	hj90������`�~X���`�~X���`�~X���`�~X���`�~X���`�~X���`�~X���`�~X���`�~X����{�0���`�X��V5���FM�k/�[����u�N&X,�*�U]/Oy����k7�[����S����	��JhU�W��1�
'
iq@*��N�X���V����c^��J�k��aX,V��m�~��������1X�������_�����b���*����tHR�!������=6QI�t�9��S�{��:5���
����L�6����v{J����BZ	��j����p�������x�]�p}<?�3f^��A�`C�������:�V}�?Q�_���+I�j��qVh^��A�<��Sz�]�O������%��$�d�`�?Q�_��oT2����:�Z��q��=��KA^���}-�����x`(��������`7=/�����P���������y�����H�v����olr5�4���I��S�OKx�ow�#�l��q}�N��+����	�����~Q+�rS����x����?���]��C~�@
����
�G���}����e4����8����B�P����v\��}>�L��+?&���r�`,g�|9���Q��_��Bt�NCR�`�G�h���cGey��G0<��<#�,|\�k���fCu�W�>�L�zK��c�r"�M{[�
J���>�/�@i-����#�xk�5��N<��tK�oC9^���W����~L��4	��1eF�����������Ct�p�&[�Oo�\��`���������j���
���!�����8���n�a���~����],g��8��Ne���9�#��d?�����~�c����F�����-�n9N��1�wx���F�c��9��%h���U'�m+�]7�]n�!�c��k����r�`,'�
 U�`���w�l�9���m�h��n�6�n�wg��~�s��ny4��w���	h�z{��_�/8�G
t7`���*��zK���~�����a�s���y.\F�w-������O�dwj��<F������sI1ZP`��6ua������!��|�����}0�#u�n�K�f�UuxO��/G-��x��y�����4�VK<������,��Z�`1�~��<�4�K����*�y�����=�HY[���&�
1��% ��*�~d�B���k���s����M��TK�h<���v�5���a���T��/w���7p�w�hB��Z�`D��)����3d���7�������#Ci���-��r�+d$�R#z����@�&	`�{J=T���A��I�V.�z���.��B@R-U0������x3(����A0r�T��3�*p#�M*�k�RqPK�_! ��*�8X�v�u:�~3�{2T�c����45��aFb�6?����*@R-U0"=r�E���,�C����WSp����a�*�pP{���f$�R#2�\p����0����1)�c��9�)�3�j���A0Z�����
��yy\����@x{����
���"�}���8pJ-U0"8�q�/���. C�=n��7��Mw��;��0Rj��'��w�X��*x"1��r���V-��x�����RO(@51V��
�HT�%`�Z������K�*�T��C�jZ
�<�d��"����` @51V��
Z6 �k�<$���X4`(�T�@�{8��;$��.��R�����c<�fl��z4�F~W���11V��
J�1����>��g��X�0F��!��_^�jb�RK�����?����c�-lP�J�C���d�RK��c����z���0F��!������P}��Uj�����]�S��z�1�RC�
��T�2� �j�>~�Kl@0�x>��	l��`�=��z�?��}�(b������|�a����1�X�+p���-��� 2 �j�(��?�v}��P}��H�zy��1�FN�R%S�TO����c���`?'�����h�����
�P�jZ�
��%�X	�}hoO��J-U�D���.�de[����C�?�v���h����bT���@V�%�����4Z&j��'���Hh��_�f�����W�����x�`���Cw0��h�_�b�u�J����kb��Z&���{��;F]/��)�T�%{	=����m����h��e�C���&�.�DR-��J����pU��~1-U0�����`�M�`����` @51R�Aw�����j��AG���oo28��;���?
�I�TA�q�/���y��y�o��a�_�P}�k0O����G0v�W��M����?3���~H�Z�=���X��*�����SwZ~��L��6"$�n=y-,@5e�}P��NxK��+P�z������I���'c��)�[��`;�C�����������@-��O��
���_f<�gd�0E�pD0p�����������f�	n�>~	��v����K?�
�
���F�����7��$�o�����W��11QF��������7p�}�9���3(�[<����s�=�ecb��~���`��1w���S~��C�����dm�z���0C�>nW;G���(��C��������"����a*~���`�������	2F	���=��
O�LL���T�����F/��LC`&5m����($���������\C`
W�Q�����9���\���Si@��	]�^�G�?!@3{������v2f���9��k�
sPM�|� �0J'���)�����Bb�W��~�������!;]���~"p8���D!��y�����=N�� O���3��gnb@��(e���.��q#L�����tM�8�2<�u�ej�>_�Z��l�Z��@��T�imP��D�"���nf�hf���@��*��V�04rPM�9� j��m�s�PN+@��?@e5�9��C�f��=���3*���fV�f���/9HxD��������b3����Z��W�8���[<���g�!@3{\��Y@@q5v�������4r�R����� ��_�_�RK�����PF_�PW�y�����=n�$�a�Uc�����qc �0XS���@���)1��i4�4����$h�s����qK�L0*���fV�f��=�\���u��7��9$
�����h�X�RK�q4�9��x!@��6�E���#PL�9[�h�+j=&�kJC�9�UK[��W�8���!����ic_$`�sb������R���`�Z�`�bC�U�`N����v	&3����fV��=���Z�`���b�RKlS�c��J��C��=�����`~���(��*��88�)�/��}�x>������
�)�1�H�y���������q0'��XK;�b�'C�)��~^�)oO9��]��`~���G@��
��NP�T�6�kN�z�����[6ekC��	t]��T�6�8�8����~��#h�s��*pP}��$�<��3������{U�O��.g=��|~���osWg?��u��v�H.Se
)>��'���C�c�[r[���
p41�!�$�I��c�pu�r`N����_�%{��/�������>� b���Pc_$`�Z�`�j5H�!��hf��e-��+j������l9��%�8	�g�-��9p�X%��	5�E2V��
��m!@5d-9���R��bt��PM�U�`N��!@�]2V�0'@B�}��Uj��mjl��u�X��*��8p1V�C�9q�Pc_$`�~��H��/��R-U���C��b�RK���`�~9�|�������K�*�2$����/��R-U���4b�RK���`�8��	5�E2V�0'@BM}���j��_��`�Z���3������73���~~&�3et4�9�T�5�|<�������������D!1����^M}��p��9��o�y��(���)qp���9�/nNL��r�iO�����#��=1QH'@S��^���.��y���j��a�WS_dq���0�h�phqpU���?��9�CkOt�����i��=�v&&
�$h�s�3����4����:9QH����2��*��4����������}L����2:	���C��(T�q�Sh�>��8���!��v��$����H�i��PSM]7�X���H�V��
~A��9��`�Z��t
��0�EN�,O��j�s�UC5�!�4�Ia�b���PS_���OM��e?��%�h��c�h���u|DJa�&<����-��I����!�z5�E2V�0'@BM}���j��_����Y��+���8���!��8H��/��R�91j�������iqp��J-U�b�FS���$@M5�9��?f�%6��%AS_�<����d0f�H��/r:QK��j�� ��O��0��0��
~AU>���B����q���)	�����9��������x�K7�r5�9r�^M}�<�c��R~W�S��S��Q�o{0*��R��InO�I@�R��� 9P[�	]�2�[�-U�b�F�0����9��}�!���s0#Rj��dLI��) `���"2nF������nv�.c�^���}�����������n���~��~~+�}�Z&�wm�x��/u�����So��9���=��t�[���
���GvQS_�"���sv�[�x����@*��* cG����I���u
@%�����X��j*���J�E�3�9��
�)�T�/�pm@�L�d'@S�c;!@��;$�
���!���P\M]7;���Wv��*�T�/h/Hj��J-U�b�F�U�`F�������zj��a�WS_$`�Z���!�i�X��*�������!����C���d�`Fk���eT�%`�Z���_P�C�!1V��
~A����J���!@4�E2V�0#@JM}���j��_��`�Z���8���!��8H��/��R�1Rj���B �@Z-U���4i��VK�����(����3�/���p���y���3�b�h����-9@Z��RK�����^�8��$��8�J-U��&�%�H�[��C���X�� e��)1f�
X�s���f�e���@VP�����!`�Z���MH�����`�8���!@Qp�X#��)5�E2V��
~A��1gF��^� ���.��`�Z���]t�r2i���8���y!@Ip +�����5&�7)�����]�c��8M������!@���"�`B-U�:.:�+!]/�,a��s!@{�m�|p`����0!�+@�@8,@I\of���~<s'��$����C��2���|<�������&
��Y�#y�����5�p��9����y��(���C���0t��b�q�w��9�������($@�]��nbaj_~��a���9���9�t|Ym�� X��C�,����*��zh�������@��U�qA��m.p_KU�s$��D��!�k�K�E�9-���h\�p`��S@+^��������&
�$����4B��;X	������"��
*�����y��X���a �!@�
lb��x�>>9et����9�eg�{b,�`�0�j��e�!���cm?�8��������J$���U�;��!@�]2R�^��{<���J&V���PC��^��:vG�>��R��C���m�j�>V��o|}&On��e�`�%��F��}�������$���U�;���8X�������}8���J&V���:��]2z\��H�FPK�� V�v���Q�dk0��K
F�\�����w��3�t�r��Zez_���{�2�vA�����������]��L�v�_����o�����'�eb�<��w|<����s�l6����K�,���ml�?�'3�[��*�����g��q\�O�L<��R�j�������������q��@����[��V�<����T�_���$
��!�8��s�9�l�� `���@�` _-Up�f��}�&9�9��j������
���`�N���!������$`B���bc;�_@'F�Uj�������O�Z5��D�,��`��Z��&1F�4L�_�3&t��Z����B�� F�F,���b��x3��d��9M@�Z��u����X�K!���C��R�:8 W-Up����``�p.>��d`�Nt�U���]~5�+B��h.���O��-#T� IDAT^QA��@�	J�a���zU/����PK\'���C��H�T��$����x���F�60&�RW+��� d]�19F�<���9B�]M�2H�C�:�� ^����e�D`B-Up�8E;=����2t���>�� ��*�Z?]>�J��;��!`B��"3���v@K!�h��7�=�1��Nh��4������!��	�T��*��u5vf`�Z��j1@?
�{��0!���
���@p$+C�K��N $<��b C�$@�Z�	|�x�C�&�v�VM4�����@�)�T���6�{��+lh~>(H�L
0�:U�@�!@I�r #��j�rSb�%�6��[@�����Z��j};�+s�����9 �0��*�Z.��1bpw����B"l4�Q��������0�!)�;po��R�h�3�t283�P
�y:�K������2�������`�0��ye�RWk��������o��R�Aa1���p������B/��&�{��B���
���x��I�w�^�a����E
r
i�8�)��Bu�#'V9.wi����T�EJ_�3!@6X��4�������:��/��z�����K�������P@�)�;�0Z���sS���`&3�;�P$�R&��
��~!gs���v������@�*"��W�+�% T%fp8�N@���l��K�-���^g_��(M�����������Y���*�D5�_�g#b��_�h�c���p��?��*�h�~����Z���/u�g���
P��q�������^�?�{#�=�"j�p���*�� ��%_�����
��`[���g�� �
�����������<T����o�;�Ro�!��������I��H�K�����n,��������N+`� \�C����'��I�pO�N2$X1�	���h)Nt��sa&�`�yS|fG-c�j��wy,��J�T�0&�����-m����b8�vY�(:��!`|�^PM�������&&��+�]����s��-vkP�6��7�@CS" TF%Dj�g��U^oObLy�>o��5����c
��:r5C=���=����&�!@��x��t}c��(��x\������>���g���?�G�H�8���������#���=�����D�����B�p�U��G��X�� ���Z!���hL3�������z�{��(|P��i�f���'$y��t���~�������X�X0�����R��"c���(�u��@Fz�7!�)�>�U������ ��S��������X��f|S�{���������uf�Y����|u���������W���������}���:Xo\&���;9/t��j��
���?���U��!���f[���.5���e�;�)��,$��k��
�����Y�E�U�����\�
pT?����ky%so�e^r
����q���S��8\���B�c?����C�
pU��R����� oC�\c��60b�����| ;V�2t��v $����W��*��hL3����c��3q�EUm�}�����
�*�!��m~���tr���)�	6�31��X��<���AH��������� ��/���~X���������^�����^DU����
��v?�W�#��T65��79r��
���m�>�~e�l��
�52.t�f�e��?�
������(�����B�7v	&��@Rlm������3����p�Wm�&������NF�����!�B:T^�Q��,��s���J]	���[��	[z�s-��U�������X���b���e���1��hB;���!�@�-��R@�0�3{}�c���:T~�y�?�P-���k����������;?��o����
XB�i�"3���Y��(�����:��N@��_ t@��.�=�����C7['����eUP-�~; �g=�n��b��>�K�y�`I'��b`�j��f`�x>��=	r<G�R5s�����\t�H{�+����%�tT��V��+���I��[�
������Ps~���R8�/$xD<9^�7�,��x�����*h�
�h�_|�!�v�+��h� ���YR}[���������"�F���n
X����h�]���Wy�������E��d'@�������Y3:���C�I�V[�)X��!���� ���S@����$L��-c��Y�o�����}H��c	`��/�L��P9��y`���I�TAP��A��@V�)�"������fi��P-����9"��:�=���;b��Pn�0��,=8_�I`�`)j�>�Yp�����O����L����Y-X��!�Z���������m��;p�Q-��!�� qC�-a�#;$y5�x��A ��|u�U0�C�S�m��S�1������
�zD'���]����K���E���c��\c� �th�>0��|x&��
j�:�O���`������o�^����@��%���_=K�Ps!��%���8u���	t�@>��c��������Z�o�"���������`*�qjc�L^7��^O1~n���9@)�zErp����������x�������
��F���,�EU ���X�)��	I���.gVtw�@A1b���N
�c+�.x��d�Xosp9�R;d,TF~5�J�s7Gx�7�o���B�������Y�y	+]'��=���xE�������O��$`�Vz39���-�sc���j`�����Ju���<�80 _��7���e����
duh�!�I����BB[ :�r�`�Y�`��7�=���9a�r�d,W��]�5����f�!~�wZ������<�v��M���5��]��[�������Y�`�!��]nrN���>�EU �@�&�t�#��o���TS�3��'{d��X����A&�sp���P�edk�j���'mi�A�6�p� 
���������]n����i���[2b������J�����Y�'��Q���%Y`�#�tIc�����`qP�R����O����U�-*����"�!Oh;��v �-H}Y�������9�� @J���p?(X
��!����S��g�I��'�P���v�=Y['n����Gq���jv|.����c*��O���!�^[0
p�0RJ��;����.��u�,����xX�h
���%������1;�yD'G<w%@J	t���2��0 !�\�	 ����$�eD|@Jq�eo������{��p��!��j�`rd�`���W��1H�N�<��.i��bG6��������{�Bl�F�T��������Mr`V� V���xVT�A�
��r����t�����q�3�@����o�<�?n�&���ZT��l�l��@D��|�u���.����`q@k�SEiM��	����w���.9����@�v�
0!�,m_a�L���
��:oN��7l00����{@������I
���;2��kQIr/�{�L X�\�;�_�?duh�@AY����-h,:���o`�F�~���gp Wi��e��`�� ���Z��@�eN��� ��|q	�Qe�C<�F@y2�g��0����s����N���r1�3@�Q�nd�
�]`�'x��8^&��L��nMd��� �@}������?��������
X��Y����OU�����x���I�	��V�o�;z�����Q�����h���������i`8�s���q���#Z��~����l�8{X�,�-�`��:���UBA�����z������]@�T:���`P��;�f�S��������p��
R�,�=�{��Uppfn|~5���:�b c���J�L���_t=t]�E������wd{�����M��V�%-�����GF#�:���?��'?$�,P�����p��1�`�M��I�t��������)�^�k�s~�?|���6�D�-�{��.��}�'����"���U]��q�q{��|A���Z�=��/�W��uj���.G�)�e�e���C#�����+�>���3�:��0�4^��m]-�{��:s���N���1���]�'����5����,�Sx�5�!f�	`,�o&�t��N�R���?����y���u4�����*��.0
[�R��%t�������1����6 G[p}����^�]�6D�V�9�b��X��9�MA�~�I�*9���c��""���Y�[��b��5���\��m$0��(.@t���V������
�:"�h�+�� �`G��=v�@�7�>���"]�W�TgJ��@��6���9�&�+�B��P����S��:������H?36H?��X��7")�@Ye�Y�
<k���@�@���K�X�w��!`�Cc/
�^	h�G���K)��t����������*�?�X�AP]I�v�i�|��������3�11QZ;�����Az�\
kyP?��
>���O��i�3�pL#�����O2���Di���{���7�,8��_5���+��
�D��@����*�7v|�fb��v�Y��3��]���<3�%[����������\q�{r��N��������j���O�.gF��!����/W���(j`M��������@E�@�z�1�!���=
��,8v@���F�k?����3��(��Z�i����=�iD��������?���u�=��h��,?���*?(��h��h����PP�����DO2���Di1"j
�����F����?�"������@�8���Da�e�gA.��@V�lS�i��k�R|S���� ���8#�*q�(m��Mk�#���(j�0�MNk�@2*��E�f��0}���;8��qe�Y�X��*j@�
�$�a�
�X��*�F1L=�RC��${)���Nw�v�
@�0�"P��1��1B�4�{��xU�T�+@G���M���WGv�
�f����p ��L@��s r[��R�(��U��!������@'��Y���� �s9bD��Y���!�HA�25� ��a,�IP%0��]v�������9�{l�4��u)*�`g39o���Dz+��I�k�@�6���r�c)���[�����=�oy���@B:]�k;:��T��������P��T,�F?@2X
�vsu�)w:�������}���7`TQY6L�Dv����� =��69���evG��?Y@z�����RW)���� ���(��w pH�YL��J-Up��@���9��{�q`e���PE���y4�&A[pg��P�x{j�=�����Oh���C2vt8Lx�Ys�M-��8Q�b@p?��jm �E�-�����YpxQt&@�Ah���	(�;$�:uh��i#�U/�%>	��zV'�q���`@06X�&�q��1vU#!��8k@>t�L��q��y�� m��3`�"1��]���O�\
<|�O�X;rR;�6��#qp4mHr;�C.�	�� !�?���@����m���a�6@����I Z�@L?����t1s��3C�r<�)m�B���P��j�>N@�C��x-�
��&����G�{����1��1���
�w���D�]����t��;&up�(@��R@��@��6@V��s��0
����P�f��V9�Z%�����P��=���9�Lk�l�\��)���:G0�U!�&|;������t�U��#C�����!�Hn�W
����<Rlu���@:����B���I3����7m��~���`'�����~ ��p��8��h9_�1��a�d�����:
��@t��d'��C��L�n����@4�/)@����)C��`��&����X@���8�$��p:#���IPXeB���h�E������wS���gK�@� M
�!+-:(p3|Fh�-o������~<��'Lq��:�F�]	�Wu$C����$���������!�������]��������?G5���j��
H�g=*0U�l���3�Z	e���������.m���T:(Y��!����zB�t\m��)�x\�UK\��l������@���W!pJm�������A�8#��'����jni
����H��wz����`/���w
`��x�G�� �`XR>@�����E$q����0z}��!�>�GbYt~������T�}.e���W����{}�����r`[g]m�/I�^���>�v���.��eu�����#�0��{}�u��I~��x3t���z����h=���y�����A�f�j���'�w�?3>F�����y�R�o�:/{|����k��Y>�G�[����'�����������V'��`�:8)[[�����<�l��z+��C���������{��\%8��U���Kj.���2�T<);	��Y|���n����s�	��*�J]���.(g����WB��
dt����BF�}m�
�(��*�J��Z�6��0fQ@
�����Z����t���3��B��`@�\|����`h�`�BBp9�`c�ij�eb������xm�q�������$`b/����\`m]v��b�U�`���@X�_�&�.��Vc�TK\��{�.��:�����r���t4�jF	�� ?�#�+�!������������s�n��tn'��/Hz
=����N��*����P@/�@V@*�s0P40)��RW��57�����@��
�;�S���
��[�/l���dP��� IDAT�lh:�N}
z���B�'{��~���o�6C��*1�l
��V�P3�� �5�0#y�����D���5�������/0�u���J@v��o��� �ym��S��T;@�#I��up2`�=a�uj�������j�����B��@�_g2������B}e��H~	��K���bh:������m�i2��������3���P#�$.$��:)�C<�q�4�� ��d
 H�g7
�n�%1�#��.�Q��>��v%`rQ�� ke������)��n����
�����&��w�lSA1�2�C�
P�y;���{-� ���:z�jM���,2���k�9xez"�H�;
/����`6��&
�p�����O�@��<�=���w@
���0�'J��@��7@?��f8!G�:V�����"��(��*�N�+��}Gf�@7��/
��<h
m_@�&@s+�`W�,�m1�$P;Q��+rB����@���0 �
����Y����i<�����x�L ��^��0�
P���D>e@`C�D�v^j�pV��&M������l�''J�$(��������v�0C	����9:�JE��;���y����y��(-@T9��8m�j+|[Y�n=��-{SP��/*G������mNL��IP<X
Y9!@."�����H~l�J_;	M�~�����'&J��@�wf�@���3�!X3L���P�
��Q@�����w�b����Jgd�Nk���?��~�{6�����l`���� L��9x$8�=5�o#w��O�
}���Vs%����o��(�m�l�4d���E������lK�U�0T�;K�W4����s��9j���Jo�*��}�yxIk�6	�����`��T c������������Di�
|�`�)�x��.PoK��C�T����{'_
5�����`+�����:9QZ�����@^%����O�
�}8�ef�g���z���XY�@��K a�p*�V��������ug�����������2����m��m4 F�� �GB\ ����f ��c�0
@M�#�U��=��7� �;����@/(��!���vE���U3�����Y�k�(0'�dP�E(r�o#���k�"r���,w6-D��&C.����a�`7������q�6�z�^���,"���v28nX
��Y�G�Z
���rE����
��C8;y5���*�����[;��l�/B�#����x���8���z����O�����p ��`� ����!�\9�oi?��+u�{�� >;��t��b!��� ��,�;9�J�/��$��n�?�~x�;n�z������������_��0�����o��@1a�#���B���:�r��9���2
k�6Z��Q���)�L@��BZ�
�����ec���;{�P22�:�{x�@����pVo?�h��-���	���PW}>�|����$:h�:���&n����1�L���*�a�8��'o,`Isp��4p$�q�<L4������?����n�-�sFrhs�Wp�~��)E���/�&7����GD�3������+�{6[���`�&C�n �`L�H�$���\K�+v���	A��`?U@��P\k���-��:Y������(�������CV����sA,
��b��&*i�;��t0��u��;+4+c�f�F�|� `�b�����P �^6=�1H�����r9ly��(d),����uT�
2���S��U�� ��
��jO���B{_�V=���jH ��7M���������KB���0�/�@/����=)�h_�V���0�tls���I@G`��,�����
@v~�?8�,@��$[����l@'�S�sr�ulS
}�h�}��n@�i��<.���r6��Zb���I@�q�z_k}
����5��$����cG�Z!�\3�>u	�����`�a�z�h"P�
t�������t�������F)~~&����@~��������������J��B����^��p�w��hm=C�'��������8���!@94hC*�Z:����Se�q�ys��r��]@W���e�%����������O%����4d���
5![����:9��4(#�����9��X^[������#�������P�n�Cqg�P���X��E��/D�M58!��nH�h����N�[�q�>p]�7����_�M����[���`^������o�L��x�~�����x)C�U��W���u�����2lk���g���@4�v����:��N��9�����2��C~
x�yW�^c���_j��N{/�lP�tv��q����`��e�3u�f@�{�������.>�E�kd����3�z�S��s�<��!@�|�jX�H�0�:��L�xo���������,4�C�5�����S�3`u0�������`V���~/`������@����X���L\U?����`9�^�����
��=&"�����`����	tYPk��Xt�s��j9�(L7 �}���T0A�u�	����|N@F��r>V��(�W�;,�XM���VW0tgi[�����A��GQ�lW����0	�.����C��:�����@�����1+`�{|G��No����aQ����:��������~�G=4O���->+dO�r��B	!���N�M�@��B�����@��9��0��0�qB���D@@Hl8Z�������s�!���	�����Z9"�Z����PC?F��7!�
�<t 1��� <!!�M8��J@�S���
����[�UP���`�[:P�x
��Q��YO� �P��Q��Q#�,pAQ	�	�x+����~j�v���������=�Zb����-d`�2B
p��`M``S��&$0�w�@�G�Hi5����;�s���� ��������P���L�������</$qL@�hB5=�6A�`���@�)���_$��G�ho�Z!`^
�? 	 ��x����]Z��~,�:� ��f��
�?��
����`��H�V�!r����p�,d�+� X�g������0��PK�1�7G ;\���O������@7�����X& ��zI@��`�`/r�� S��F��HR�����s�d9��Hc��E����C�C�`>X�	@o�;l<��l��PH'@Ye@�`��������H0s	\0��l@G�.�D��g��8����{�0�T���'!@�SG��I�<&�9���9::d��[�$k��z��=e���A����w�'=�Vj� �@ ��� K���@� 3��@���`��y���2t�R]�I�O@�`!qP\g�J��`m�
*$@qm@@
�C�d�� �Q��H9
�>x`A�
��!�S��jIv�����^�#0 KM�	�LH��B�(d���B��VS��8����`�`��B9	��]$���(
@]�lM���I@�k�����<�*�%�_��_x��Tq����f4��V������uc��$R�Sih����aXp����B�xlKB��{Ry����>�43��|@! !z��;�`V��]���]	t~� �3y"�Xb�j=��� |8�Ksv� ��I��p�9J ��9����d�����.�)�} �@��f�w��(�����@�5����!W@�@@�;�S!���!@1����e�����b�{�����������.�<�J�w�V�~a�fh/�#a����^��R!@E��
� ��s�s��L� �+����`��� k@2��������@q����\�=}	��@����p%@������JSpT��\�~�����tT
(��[�L7M��z��������	XJ��2����<}9���N����K x�c����`�RC�� �NGU��GJ���`'1"B��]a�.���M@"�
�Qv�t) .�0 �8���>����$�)���`g�RC�����9)� �01���!3:�f���`��@��2���c4����\+��z�������}�L����5������K�}�=���8��L<g&
�Q��@��~�+�^�� 0�����*����������������!@9!��D��]����/U@4��<Z����{�	@���@��8	s���M<�_��	���mr���y�����P@>�&���@�F#Nz����w=c���M���I��x\�'J�b0��a
������% ���h@�$�������ON���������oh�N������kgx���R��D<���ff�VMk
��<
��&M������a�f�/��qz��W��V�@�`dY�HY���V<�8r@`��F>^ ���6
����$����4�}�����B�_�%:��*���gw����a+��)u��O�9�0�I: �1`���������@����6t��������Z�J��(<D=?rz�/�
`�H�zJ�T#
��a ��	���\L���)�301QZ{�`�0�w��<������i��"�]�E��=���k�����1O�{q�N�
2�?zr��R�=��U5�_''Jkg�8�!��H�F�3)'�!p�s�8��J�Y�[�?hi���!@�3�I��p�H|��:|���I\��>�K��Da�����a0}H��)q���Xl�x@�'VK]���@��:�
�<����C���`��X�>	���4���
X�
_�^���o��"��{��g��"8�@V:���,Y�� )��k��
�^u:���P`bp9���0-����aI�Rm�W�*�)x��V0I�o\������}<��@�
��!��:
u�FS2���!z�kH�`�p��{Cn�=�R�1�Hmz�<�`����%p�n��#���pn� �v���=t��n��w��UML��E$�P����\�����4�0��� r<��p��$0����F�����jPK�@�=c����<��m����9���_����0����G����A�`<��(��a���� �M(�A���0���'[�r�=���$i��B��r %Z;����z3��=����>����n��{�1O�����]���3�@���n�:i�V����)@@��V�r9!�x�i������=�����j[g������>����;��a���v��	-��4D��0��M�[���1���/�0O��;v�*�Q"���T���t�BS���+�xb��J�������NY���S{0�J��|�
>L�-@���i@x9})�d�i+���B�� K���ox&��(e������o�����9]�he�P���$�w0d��|_N�@��P~qxG��\�* Pp[�9}�����<���P��~6�� �� ���5��0��=�����G�{O�����@�CO*��Y�(G/���^��;���D&6 �
0�2iM��7���'�	D'B��;��*����K��n �L�q�����`I����c��7^����;�O�K=�:p?����3�u�or�@1=�T@�p���`��>P�����z��������?�?oC|y������|{I������\���7���8{==��_/�|�������]��W���g=�N�L��q
j�k��_#T9��������9��:�~��u���,SOu���YW�?�=	w��?�xucyZ�~`f�K�\��P���_���1�������7���;N];{�u0}&L�������^/��`g2]R:7��P�>��Ln��^�p�y
U=�X@��Q����^���`��!����(���=e���������=��/i���}�	�	t+�
��J
�r)!+���:����0��C �	A���$��!��"b�jN�*i\�@�9��<@�0�4=���xR������	�i�`�l��Tb��L�8�����7�	��i��Qt&���b�)��~!:�� ������@�%���l4���@����
t;b]�)��B�4�) �0����
�v��}�C����;Kr{�`�'1���^���>�8zZ:�AU���� ��/@�wb�~�{clR{�j`�������So�z���y�F��~�Z��G���a7�5!@�u�w����I�1����d�n��_���U�0�����u|O@r�@�z��h����r:�!�����uX��r6@X�_x�b�IP>�-�ww�
`0��A�����q�X��g~�}���
@<��������>�cM�>1c�k���a[�/�
�!@q�C�1�B����V�`+"�a�}����vYj�0�E�<��l�W���+S�Z%@q�*=6��x�zrh�� ^���<��z�����bT�3=���a{\@�����@�$ GC������El`.h���#���`+�S�$��YU$��,�!��!���I�]d���p�%����:�/��
�q�>��`����jv#o�#��$�����
`���H�T�z�>h�`=7`��k	�~����hfb��5U(��D�o�7~5`�d�|L��0�$h'H@=�@�X�;%.��&=X������#��r��@t���+�f�+$��6=�[F���.��?`�T�����1#�VU��=��^����Jbh������MH��n�n���C�)��$�+;�z[��!��
��`^�`��kK=��#�t�r�N���9E=��=��$�YI@�D1�^|�IDAT�Ym���I\.t���;����@m1�����`�Ta��
�`}I,c l&PeP5���y��<u����Z��\��������4O��k[@2���fF��&`n�"��l��bp�{0��j�/��=���{:z�y�8x&�=%�/Z�)xQ��O+t�6�0���&�%��#���m<M;	�����Q#�[���8���EA����@�� ������3 �� t` ��6
����@^'���y�������G�Nmq��t���$�quSP��J6
�f�M��H���J��v�rT%9�	�/� �p��Q��R�������6�>���
�5T�j�sZ9 hI����b$�5<�n�R�M��i(��X�\�@_��	
 ��	���g��v��-r�zLUX��y<��i�������j�r�7��_�Q}�dW`�c�)=��K�0�����~�G������+�?�}L���6Or�����h?�vJ��	 ���?���n��\�=���B���&.�QPdrb`VM�p(c#��@n(�������+�GX
c��ht�\:����yv�NI-��9;�0x#��$�CI��J������x�_���y ��%��4fQ�����4�u%�i^p�'`-�/����@C�!�~c�0�j���x}�@�(LsI>�mJ4I���^L�����>�^��24Fw�u�|'v�OoZ��J��O�� �uL�C�<T��\��a
6������@�B�VMMD=���a
{I}��B��� ���d���x�e�X�rf:�`����"�3zZ�pC�u��-��{pk���-����v��q<:��z�:��b���9�:�/h�6n�N�D��[HL�Ik�l��@=����hN��Z��^ �Cb�/#P�;�O�����P�����0�g�3~8����m�.�k��1X��&|s��������)o?��{������0Q����)���,I��D=<���g����f��RmW�GO�i�1�����N
��2?�fk1 ���|���7��`������r6 0 {�#�����` p��b�94�A�@�����&��g��+u ��x��E@�+��!8#=�N�V�3��Z�8c�V������`,H��`�����H��yLaPm���� �I-�2��`�����"
� �"�Y�B��P ��:�R���}0B�����a�`pA��
���#�xpD�(�0	;���A�`i/ 
QE��A�0wqa��0��.�Fz���JL.^��*8G$����2��zD��>D ��.�l��;vL�KHx."`;�[L�����=��T��n7	�o0
��L@�`�t�C!M��@-��������KpB(TP�����<��@}����l:G��_��(��
{�	����������Z��V��� 4s0�p�=�`�S��<��aC�����Z�O/9
�c�`�
���N�|��x������Y�/�������%�_�M�'}<�J``�����2w%t`����$��*`[�=��K��t����Q@��w�c���/���t�Av�>Gr���R/�*2�Y��D��x���M��]`��������i��L������?v���=�����uJGF����MHm!t��CP��z�P��+���S;���p��"��/��X'l_7pf�.���]���m:���a�x�O��Z�@(? ��v��}3v��h�`0�����������R��-�L
�h6C�`��7z26xc3)1�|a���0��k�@_Cu�������9��K3��{B#��j�E������ps�q�\C�������DBE�Z����h@�@�6i:�0w�0(�X����� 6��.z��3V�$c�*��C�W>4�K
V�q&�}I;
�Rw-0�q���l3��H��JR!��MP &2%�^�Ns��K
�K�� � ��!�:�^w�D�QHu�  A����a��v��hA�r6��=�p��U0	�`��5���Uf<��K<+m_b�����ofNs�u�%�0H�1��_� ��'�Y��p�c����z/�^1�$,�S�H�����S=� e���]��a0��80[A=�m��4�'����|����e!!V���N��S~��5�$Z�L{pc�9��������3n:X���M����m��0��[�i���'S��	<�==#��b�Ck�	x�&�t0e����	��+x3�Z�@��M�0]��y�*
0.19���9\���!6(����lL�o@S0�`Mc���o����l�z��&	���m����<xv��*��^��~> X��T�
��;�
��
����K�WF�����@��m���d'�~!���R?$o�k����.���q�9IpR�@3�+Q�"�2=�a�h{j���q\��:��X)RBxs���E��
�
�Z��Z����	 ��w��Axpv�z0�1�gm��s�F/�D&���������z��*�����gcK�
�|���q-�����^��y��������=����fI@
���$��z��^�v�h���c��N
T�i,��P��D�E��5)�|�@�Q�li�j^0����+�9ass��.!��>�L7����vq��A={h )��A<�!������c������� 4Yc��2az.KC�l@h������o�c�3�G�z��'���z��>�?DfL�
f9��z��r�j./�Z�@(�4 ~s����c00H�	|D�z4-�fR�ke0��AG��Z3�
�B$0s�������UO���CvG�6�	��3~�~�&5��`�����.d[E����@�
mP\UoMwk:��G�����*H���r�y�p0+��&�
������/�������"��ol&���q��%*�����vd���y�^��*!�E`"�����,�I3]c�3i�J�F=9�,��/�Si�ou��N�^lM$�;`gU�H����.a��Q>v������@�;��\k1�n,T��)��#.$��!�a�����.L�
]����Cz7?�#$�3��3d\�60�S��6J�E
/�W�v��'��FL&����5�t�%@���i0��z�
_�����m0�b�7N����+0�L���������j�H.qi����A��+\@rC�~~�O�%�:��x��0����(<���|
���I
/����sA�O���&q����]��=z$m� d��9(�,��F���9����A?sm����%���irN�!����si��=H����\
;jh�������G��@�/n8gj�����V�Y���'�!��Q��w�C���!�9PK�q��`��/��m�[,
�F��\�W�,G����x�	#�JC���6h�4)C���!5�85�O�����@�y����&�(�@p�B�nlHbl[X3��A98`�0���FU��,<��\;n�^�A���[8��0����0��(�;���~� ^f�����Z7�3�i�"
	70<4g��� $Z-�(�{74E��J<\�d@���A��� G�Z-M�op����` �](�B?���e|�:\�Q��L�Q�����d/el�$��3}����$`4-AM��1�����2��A�om��ne��� �z�d���1Y��s.ouG���*�gL?Ma��\A�
�l�d-�1�q��N`�o�	��?"��`����g��gj��0����N�(x{� "xP�0�������`�YM����TK��$�m����H��>��'�������]�PA��k::q���_���k���`U��*&��	
����&�>@5l�B�n��O����?#�Y�@�z:�8��K���@8v4�G�-��0�x�1g�I����pW�f�Q0Ke�S`N��r`����$(X��Y8�^e*l5!���'7E;5��w+�Z�����+t�
�*@���|LJ��$��P�[7wW�G���TK���Ct�N�����n{�����C�0��U �~��M;���!�V�SNC
��Ir������3�>-bW�$`������0�s3��{I#)�C�PR/��;� ��G.1c�x��.�hq��Z�0�c��a|RO':.5#@�(C?�xE�}�-�b�jR�'��O�`���x�^$��r�/<;��+�T��~S>�Zp���=d,�����������;�K��4�����a��G$�/��-�W��d���<5k\�X�������"���������5C�@��4�	GR
� �x��x��1K�C��o��[<qHb�n?��W��Jw��z�"S����=8I��%�`
[���=�oJ�@`=`�mgl�X��D�OI|�~��
�;��b�<��o_2�#�P14h���C�%�[#���@4��K�ly�
����*	��"�'�&;�w�p��'<�yr�;��*�f!>V�2K�@_N6�\������d��0�"�����������6�k���{|#��]Ax�������"���%;�������0����1L~b@p*��kb%IJ��n����Y�(�J 3�#d�[������*T�����P�\B�>�`�)%��iF4�	�-a�0E��Tgg��4�CO�$L��G�`����3{�<�����en� �����hZ� K	�8��{���x���N�G�csa9x�+���l�����>/L�\l�&<1��m����d.@h�}y�[p��������0!���oz��*�\$�GQ��$�L�� �8X/�&�f���E
Q����fpA���*�IHh� ��<7���6���
j�����>)��u�`i9�s����# ���=pG��s��.�Hl+m�	Aa�@ ^�[I�IA���l��5T�j�����=�wS�Q�`K���*C�^%%�78�6- �VA�i�[`���F0�zoI�n���U�{�x�������"A��_�p�0�0n�����=B��
��s�G��0���z�m��B��"A���|	�3 y�_a������{H�r���Ur�����/�g{$� 0����8�v���w�sx�P��n)��[����t�@#�
`~%t��3�z{�:D}��m�s*�/a� 8����XiQ+@B=H�S��x?�����X1��#�8W��CX�8�@
�
��q��o6p����]��`�{���v-�IB\��!L���c��<�����Z���P-m��qW��kk�y���|�~�Y��7�x9
x[
BU���+��(����`GF6-����[&

��:� 36�O+� __@��ji+n������<�Y�6��ol]�c`\�����9%@��=m��wV�+�g�n���7mA�,p<�����&H��%F��_�-\m���P����x{�=ehC����������� �c�B�y��vb��U�e�����O5�F��H������$1,�X\�vn!p�-�x�v�5�����Y�WR�Y:;����m�R���#
�N0�GO�"�.��v��������m��@�	z{��P=m����>�
�%PC�V79=�^0���<���`�x;��}�;m�����Z!�e](!���U���������7�~&5��H������`�	P0�"����-�S��>O����{��P5��Y_2�3�f�'6Om'��N������.,%�xo�����]R�����)��}rDV���7Bf�ZS��B|���+���.V'��W��:���x�����T�0%�{<!X,VZ6TK;�<���0�yb;�+�7%!c7)�Nk?A)v��M��s�p�L��g���$�HLN�;`�@�i=�������i���C�n���������F`b��v`��%��[@ �r�e�~��H�N<y�; �"��p������&'����9s\�?x�1R����&���=e`C��W��y����j
1��0#�:�J�,���+�llo��7RG]6�l��]O�R�>��p����mz���P�Wz@t�n�x����	�������=�����0�X��7mh�gX,�1�`�~X���`�~X���`�~X���`�~X����F7�0���gb���7��F������+���L?�zu4{�XG��g���U�Y��+T{ot�z���(������]���������w����ba�����\���=�������LL������]������}�c!��h9;*��zu�w���=��%��b%��4�~�����LL��;ww��k�������R��v��*}�?Q��*
��7x�/*Z��LL����6��b��0�T����uT{�\���Y���+Wr����'��?���2���uj��T���c�:����_�����\��@��8?�>��G5E�:��/j��[�R�J.��rZ���pr�=^o�[�:�=jU����7rZ�*�P���8�����/v?e��1gw���������8�`X
!\�s���`��!�T�����~�J:i���o0���%�?�c�^������c'�b�N~�����FQU;/N�!���n��,����W����(�����w������O�}'��������v^��=5��;�n�+kG{Y4��h������L�N������P�@1�b� �A�(P�@1�b� �A�(�?y�`2��}������?
0#���j����x��ROW��pj��J�@f$=����<���%s�
�@��snHbZ0#e�����OK�]��]�^j���5wk��M,�������W-����2?q��P{6g�6�~�a?71&55��qo#%����+�h�C��=�=�� �H�m�.Mj�(���NJ�m��_Wg�/L��p�;���)��EOM��N�@f�q`�*cw��@����u2������U�+M�O�@fD���%�o�k��d��/OuD�/5kOM0�0#��D�S�e��2�/s����G`@���A��}e��
����
��e����t{����@f��N@W������6�%�}����	�h�F��v����2���@�

@1�b� �A�(�_*F�0���IEND�B`�

trandomuuids-without-FPW-1.5kTPS_iodelayOverSize.csv.pngimage/png; name=trandomuuids-without-FPW-1.5kTPS_iodelayOverSize.csv.pngDownload

�PNG


IHDR�d�?8PLTE����s???���___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@��������������D=		tRNS@��f	pHYs���+ IDATx���	�� ��n��Sx��W��H ,�E���}��5�I@��X,�e���`�X���b]X��b�X����,���`�.,�ua1X���b]X��b�X����,���`�.,�ua1X���b]X��b�X`W}��:��g�:��K%�]�(���
�Lhb���:~�> ����z������y���)�a�~�J>�r�z�^�d���X��PR����kb�S��p����S�������WC
����D�������
�w�U��_
��|}�+������?Yjn��S[L�/j`��	��)}T<��o����N�Kv��@�����@�X�]���`�6�zJ������p�}	�p%��Y�=.����?�U�%e�������-h�������go!��G[����',z��<�F�4������1 �[`=���D�{���z��KY$%x����q|�~����(�=!�v����9K0�c=[{�J_���
��<�F���hz[z����"�'A��=2a\w}��_���*]W��
��;�y<_�����gm��PR��)j���6�~W���A����z����h�-���x��a��/@�k��������n����8�G�r���{����;~��V-b��?����5�p�'�����!U�������^�,����X 1���&5U�n��Z���y}"D+�$�[;�C��Y�PR����=c�w��a�>wp����2A�+��:
I�#93����sr��;�~��=��H�	f���Veb�Th�Oj����|rA�����.��o�J7z����/�@
����U�-�#��������Z����b/@I��{�	�����
���@�c��6��E����-f�L��
@����$���]A>?�����)\�����qc	��w����[ 5(H+@�7�j����\�o�����+O��m�e��n"��I��u����Ji=P�+�,Wny0��KPa+���5M���
�y��{�Q�v;��P��U�,���`�.,�ua1X���b]X��b�X����,���`�.,�ua1X���b]X��b�X����,���`�.�ex�`�ww��b}+�(V=��:����D-@���j�0sO�b�X��=��~k��� �52QH����4J��X�r�b��fj�}�L�:�	���,������-n=��b��5%���]��(�eI�<l��bT�<o�z`<����^|X,�%�<5�B�"G8�y���J�u&�0]k lxG��:��_�����*���Oj_#��.	x7���pi_�/V�u�T�����>h�}�L������uS�Q_�/V�u�T��3Ez���6jz_���|�?Z%��SUV�|�8A�\�OU�8u�9���y�����b�O5Y
`����_��X���O�&K����
t��J���j��WuF�Us�|th����.kPb��WuF���n�|�����e���L�������N���!�\�HO�/=X��c:#�u�Z����W��`�B��n+t�5t7���C�>�k�N�u������"W���.bW��d`n�������>#�R{�x�r���Wx��	�X��z��
�n��`nt������]W�b������0����G>�[$����e><�|�%��Z{|����W��h`���h+�0��?�:��<�w;R���n�R���:����v��]��&�|��.H0m��m^���`�V��t���V��|�W���M�
�O3�`k_qG��[������C�>�3��o��C;����W��P������6�E�*u��2��G@;�1��97`�6Tz�����Z�LbVJ��C�rE�����W�su����P���u��J�k�����@U��bm���n���X�-��w@��
?���"��U�,�`�<,��~�����u{Ol��z�{"���a\�N@1-���t(���C���Y�c,Vy1b-�����!s�~8��������z���3|pg`��j>��~�vn:����p07F��AX�0�L���Tq��R�p�� �l�����7$>��_7�Q7F�I������T1���9R��9��5���N<����sG��n�0�x<�y��nY,��F����?d�j
:�6�|<���a�0
�#�R���sb��>P�U4v��'�������o�������G�1��Wp����p��*u�U9z�`�Z��
� 	L��8FX�n��������a\�����U�k4���.p#��(`8F�����n���X�����X�`F3=�p��F�~����q?��sKT�a���\�5�J��U2����Z-�\�N@1�tX��!���D5�j1X^�%���V�C����!@1�tX���X�2����\��tX��!���(����)n�!���f���B��cH0�Tg��C�#�B��`��`���C�1.V'{�4����~w=��c�}���?F�.C�qp1�@.�<\��z�o>��h������dH��/�b�@2Vh&{4�}{��~��`7F��!�:�o�8�X�)$@���sA`z���z��o��_7�}�a����5����:���0
0o�%�	��������M�p���!����?~^�`b�K@��S��e^1�H`����`�\�0�
����r�<r����a^1p��eC�q���)i���`������`�
@�(t��{�E���a\��i@���f��fz�������C��J��y3�F�`��b��� ���y�8�8��:���C��=�����j�@(��)�i�, ���f���T��0Bq��
�R��`�j:��b����X�Z��Z�W����j�@(��9������P<$k���� �k���n��L�������u:1�unn<�p�O4��C�q������i�_W�����k��R�1���~�7��|SH�gl="��dk�����k���g���^��R�������5���� X�������w�� 3p�F����%����A�����u
��`��h����I0�>;p�N<f��!~�5�J�2�@�g+�����$��������z�&<�&3p�!B:"���5z��s��#���F3C����o	>-����O�o���/����K?�

�C���C�WY<�&����~Nc�}����2Z����X�Dx�]@�4�$�����`3�`ym
}���h^��s��p��F&
i1�e���`�>�q<���[�_�3uV�
�C��U����00�i���r�3CbE����9����kd��@����
������7AO�h���a���Gc�X�2���F�"�L
�Z
�y7�X��F&
i)�>5GM������4	�QMt�+��s-���������BZ
��{l�U����+��D�����C��cq��y���p�I�j���y�:���������������z�N����.���&��Bd�<��v�(�s'�~��I`��Y9�\��u9����<����o���'��%	��F�fn=��'��-��I@�|4����D!-���������U��L�+N�������BZ�����z,�S�}F~�N���*� l�(�<��}�L����K��:U�Wsw�n����4p�N�������u� ���Z-�u$��6�9���{��*g�C`3q�������j:�������< 2w�0�*�u D4"�_8��,�����T��(`m��B�L�?`'1�N�q���g����70��'�7���}"0]<��,��pB����t6�g�#�N���q��r70���u~�|����@M��Z�P��������!����d��j�@(�	%���|�C���G���)���W��@!��9��U���*�}��:]�`V���u�9�M�1kG������;��5���i��,RM���8�rH���X<��,�����u& �t�����2C���b��uN������>@�U��P�88��y��N�/PM�������=�O��!����������L�����,��r�������X����W�/00�)N"��\������5Y
 �B�?���urpp�:�}?���U2������H�t�Y����U�X��k�88��y�b,QM��V�s�����-(�q����H��Q0��PL�k�������3����Z+*�q�`T�%�����P�Rl�k1��C���ujm��������_�Y�U�X��k�8�Xl @ @����$ �H�%�`3q�Z�`�@�=3�PP�����/x��& FbT,*��6�����
<�iU����Xg����S�o;T��16��%��Q%�����Z+*�&���`�j:��bT��`byM�rrpp�:��`�#���dl����8�S��%����S��trpp�:�����L���%���J`�j:����Nq���u���N�Q'�� ��:[��(-W��`+11`W��\�;�C�bbl���9��/Ig@I���!���`W11`W��\���U�@���d�����88\2�Cgr�������N~����!@116����?@����N����h%��8�@�p�}^=��>�&ZX�a��h]��y�f����F&�h��F��=��n>��zq��.T�$������s���1o_#������~O��.el��yA<�0��o�����BZ���uWvO�8`}��0��rd�6+g_#�����s��������^��v��}�L�"|z��	P
8��:��P�in�b��y���N���rj����Bu�0���`�I3_���\�c��\&&��.GI-'���>��P9�Xio`T{{��I�W�0���,���V��b�K���FY��(�e�^M4zB7F�����L���9�G�}�L������@��]����Bu^.1wb_#e����h]38N�@�#%`o��R���!��bbY'`���!�ar��p���GJ��@*�&:98��:u'�,��`3]�N��:9J�C�#5l�Q�a;f$�����p���������/`,`3]�N��:9J�C�#E��!�>bb)�����C�_��C����8�N��:9J�C�#5�|���p�����Y'�?Q'`4�`3�z�?����C���[���;�H�%�����p��%urp�u!4���7����L�^'�0���!�!
B�}0*��tXk�8D��trp�u.	
S�C�bbl�_��p�N����-	���KT�a��1���!�O��!�abbR'�0���!�!��011�� �8�L�^��B@*�f��:���(-	pt��C��BN���b"����C����C����8���@2R�tXk�!�!:]�����Z+@-�
�������8�L��s=jSbl�s�����P�8�V~��%)��?�l'�
����� QM��V��� '6������{5vXR@"�f:W�� G��(-
h=f�j����o�%g������P@�B��=�W;
�sJ�}�$`Vl�s��>�{`:p�Cb@F��t�:�@�n0�U�a��\t��������3{$��\��~L ��G�v�����Z)@	��rP;x����{C��`ju�;��c�y�@e!��H����������f�������b�j`<�<�tX+u���
��Z��B���|S ��k�������P� k�i0Y@�X2F�!�f�WgY��7�>@u!�1Z3`X����9�z�3U�a���C�����Qw��B�6 `����V��PuX��������if�E�!@�0(�fb06��PX��r��
C"_3 UM��R��)��;e��`3�0�g�*���L�# `�`��g�%���B�=K�dz�A�d)*@	��=�VU����H�!�f�����{Cf�n;�����z�S���j�����8zX�7v�����9���L��"��0�|=�+PM\[)J��r& ��2��;�����BZ�G���.e�(pOQA�������H������mY��s��f5g���u��W��C���g�wx��\u"��!�H�_��9���L�2<o6����v�
L�\�W�"Hu  	�,��
��9��,x������5���REB��G[��`K��{��k���'}Ix�+_]�r��UFo�r���m����7�7S�����-O��w������	�`Y�9�m:M$1���	5����s�����B��}<�s�.
���>o��|W�
��*�
��=�h�5�9��.p���{��sG�P?�o�9F�3�Y��(�E9��{!=��g��p���|���j�Dy�o�������i��!���q��5������E16�5 `wy�{u������k�8(Q��0*����K1�����wv�r�88:����n]���026� ��������k�tX+�!@�bN�Q1����Z)@�b���6�C���F�!�f�`3p�z��wuo��2bl��W�cDZ��?O�
PX�(�T�����.VM��R��0�:!�&���<�8X��!@<(���F��[��(�f:+G�Y�h��N���_��z����V�C��h�N�Qy�(
�E|��k�Q1���E���1Rq���
�_]��/�xp�����$�2������v�����!@X���0�K����_���Z)@P��V�?�����,�@+��{�!�f:���"��������z�a�fb06Rp3�wE�tX+�!@\����U
����`���3�\X������"���8�L�2�@���@��1R16���%���y�|^�fuQ5�Jq��,�����<���i���F5�J1�b���Z�s�5�0R�[��P��������l���y���=��0��p��T����XS���p�������z.��8���y���y���� -��?��?����C�
C�����!��4�$����J�������|��4��@��k�8���y��o�>�����o�o
���V��r���������3!~�7�\�]���_�=����/Z����X�L������0h1W�i�+"�^�?� IDAT?�	x�~�Um!�:e��+y���0�b.��x�����^_�!�f��!�nP��a�k16�����88(aH�Zu��$���
�oVu`�
 ��|�>��P$�rkn�4	s�(�������={?��!�f���P`I3��k������{v�~�V9K�:�=��{���*�tX+�!�)f`:��`H9 �r�N���D6���O�6p��@P�.��:��)�����~0,8w���Y:����~����Z)L1�Y���]�HU�a��30��`��#��&K	��>���!��!���S���A!����s�?F�:pk��S�y@�]����@��v���������t�s?���a���O��������0�P�@<�<& ����Z���Z�����r��G���#`3m���`������U�a��C�������J���>�x����|��k���~�h�- � �hk]
�.}Pk��� �(�m���-C����!�)����PB�u�@�`h�;���u� ����	��:)n��}8�pZ�j��PR;�%�;�� P��o�sp,~.X��p{A
���v�r����-���>t�88ZE�_�h�Z�uR,���I@*@\��0:�G�X���C��B��06$��p{4�����@�������A��
���Yc��y}���krlV��!m
��h?<o�;����",M������4�@����@�1�v@�����!f521O��1�9�<d2nO��zo��q���|�`�v��������Diy<�4���n���i{�{��d��0gbJ�����}���`����}~`�&�{)�}��/��e#�Rm4��\�S��*� pf521��[������_�giQ3��ug���;��r������C��L�l�y��M�O}+�}qO��]G������$����;`��:��:Y���U�w�������?�lt{�������/;_�+Z�|�0��@���/=O*�9�]��o������g��?!f��+Sf��6��)$h����>��3�c����9���:3�x��������OE9�=<s����~@��'|�V�t`���4�V�xM�S��K �9$b���V��w�����������	���������t��u�s�`�����������w����������&�x����9�Lq�o�$�~Q2Y�l ���|B�;��Jz�5�����o������K�
���~�����6�����
������{����,!�H��:;�D@�t:��]I!�d)������?���Z)���,L�OxW`��6����~$��i9TD;��X�H��zN�x
N�����T3q��m�h����?���A��G����1�%%�Z>zCF<��oh)wk���?|<���^X3q-X�`��G`$Pk�>���~: �����j!�@��pzq�P���O�s�q^��$�;U9O��Z��xR��W����`aK>�p{�����f�U��@�_��=�\	[�s�` ZA?����0��������	 ���lbF[[J��_���8��B�\������-�;�)���<���\�
��+�
��n�iW@1�������!m����`�5��'@Yq�sL��� \��t�[�"�d)�P����������P����
|	���&K����CC�Y)�;��_��p�90����
@��(H�^$�����O�(��k�8H��a��_����9�d)"@R�/ �|`~��}�Q:�a�bW�&K998�q0_�����C����%�bb�~��c
RO�]��~t�/����j��qP5�:q��#t�Ow]&�?���2H��d)�v����l_���b$���T_@���K��]u������5Y
�������!�f:]P�F����B��+���Nh=o��TC�1�y�a���pc��[�.��z>�f]Q5�:q��3 �}0#0�ig=�)po^����iVU�a� )�W���zx��3I���b]�Br����![U���C��Px�'�0��+?�w/���lq��s7  ����#������v���}8 �7�x�A�tX��!@R�� �����e'�����s@=^z(����tX��H������o��<��)td���R<�{�g����!@0��@U�t�0��?��>�4�:�s�=kk�bl&���\.:�i�X#L��z�^g{0HYq��3 ����F?���N5�:1�b~����F?�G���}���r�� '��7��O�o�?�������'�@&�g�(�������o������������C���s �����
x�~'&���Gi��&Ka1�bN
��|�����u,�\wZ�d)'��<�B�'C�D��u�: U���j:�?����_y��}�/
���eD��U�a��O��l�`<!I���5�t����!��!�5Y
���`mwAYC9�a�T��4��	�!�&K!x>^��q3���VsJ�
��*�A��!�����������^�=���UI�/=�I6���=\�?Yb�y�49������#!@���=�8��g N����*�tX��K��>,�H
���������R`����C�
����^!��V�� L�mC�p�V�x���E16��)hK�bp�����z�9����y��?���Z'��Z�>�QG������P5Y
� b$������E0��r�T5�
�?��}��|/��S	�����=�0������]?lu���Z'�b���_/�Y��D0o�!�A5Y�������I1 U�(���>�3�g�_��������`3�"���
C:Yy�	��~*��c8��v@�Q'l���5k������C�������9�r�
��%v��R��@��g��WuX��H�9|���/�$�����7���,��q@�9i�n����)�(&@0����q�p���A�?���xH������M|s+@U��N$�������B�Gr��j��&Kqx����X]TM��NW@�~j8W�0O�H��Y@6��!@G����w��&�P5�����	��U�����X�O�O52��>��'`Yq��/T!�1x��
<��0N&@R�9<�k��,���!�� (��!���7�
��$�b��B)j��>���r�����07rp���D2H)���������*f�'�j:�ub$�����(P���^x������Z���������i^�8�L{������N���9����@��&u���sf52QZ�`���z4�DE�
���h6`H����m~/�T2>.��yx��{C�jd���������[(�`�@��0pc�l���1@�g3�Z��Y�L��b�
At�M�}4g�i}-vY�+���y�����Y�L���4sX����*�����s�����8|C�&(�O�<5��l�x���;=�K��rt����K�Rg������}�W����',��/����/)�����.��qI�������!�H���>�������M���������]H��4�j�	g�z��
a���~^��y*������p0$�8|
��@��h�g�0��@�ixW�
PV`����P�sb�� ���[���F&Jk]?��K�J�����F]��Jr�P  ����5�V����6�����q;��(�e!������'����HF/�S����?'�����9P� 4����:���eUa��u���<q�,�*^�����
�����f4T;wq.����sB`�1�
���q2v�#3/� �bV���3@4���#3/
�$6�d�c\�/��T$��5Ps���j��8�I1��L�[��_`s�3��3��� �{�����Z[����;��:do�V��s[6I�H�)�_���wh����@��2�)��8����=�Ckg�1�����N,�8�*?
�����FP��tR�^���H��;�|h�?&�p.�,m������@l8$���g�~I����u�3��3/�^��b����KP�6�PsN=��(��~^[��`m�q0��M�-
s
�
r����2���|k���"�x3
~���pB1��7�\b��"kc4 bL��� \���$�s����`��0``�Q`�j���aC�82��8�8H�����[F��,U���E"�|�pl ����3���]R�{+��q��l%��`l"�+�0�8����`^p��p�H�&b�+^6����M`N3��@��C���C���C�DC�,T��x$HaVe)"@Rv����8PEG 0zAF*�d)'����$J��y�����T���rb�+n�����Bg?@-�`�\HbT��V2���9�c�<r�=H�@�8��vu����~>A0R��r��l"���lx��Y�
��>��Kp+�6b�+^
hU��R�t$�RUd�o�e����o=�=�IM�� b$���U��K`+�!!�UM�rr\ �~!�	�����V�`y!�`1�2���J'@Yq�S��Mr��V���H��?�������`J'�a�W(���9���@���5� �R1�@1�tX��!@R��KSS!�D�������P�I1��3 "@=I�� ,�|�+�
�}@~�Qm0v ,�� ����`��b�����C����C���������8�I1[@v
��U2k?1�PQN&�U�!@3Cg����[$���@�/�\{���bSM��J�PF�H�=f���������O����S)��n\�4p�{�C���C������W��C��bxU
��@��������urU�!Y��:g�� ����Y�h+Z>&���*1�
�YkN+��g���V�����V����C5�-yO�.�5�!��8X,�W����3��P,��j:�U*���f��� ��|���{L�H	{�p:1�bNb�������*��%���� ��I`��sPL�PEZB(Jt��`>����n {��S�C5x�W�6������@
�=p�0P����L}��������S'����j!`�
\89v.��{(��:K�V��`��sPL��kR��=���t�9-m-���CAf�PL5�*I����@ (��9�yXP�"1N'��	7UBCm��F�
���`��R���G���=T���b!�����}![�����B��?��s`�^���u�B
���PL5�*I�K����5K�4�)��@�{��U�p:��A9�������O�����z��'�h��sP>�@�`&j���U
�*	3����Da-��}W�[��O�/@�R�����������_z�3����Z���kO=�<��596�tH�F/�j�`H!����U�p{�y�FY��(��9
��&��j+7P�Z�wc�9��(��x����� Q)�
��T(2����TC���y��F&Jk1����7��������~uR`��rw}y?��#1@�w��m��2�_RF���~�0�e�}m��e�Mu��@>�����������u`��:n_����H��t���������~g���f���r)�:����c������m�������Wr�?g"��<��,��}�F$m��e�'��������E�����{������)�D��n)��@mO�h�Tx� p{���F�3�j�a��T��f�
D��iR���6����m�v�F��5-Y��?_H�!�*�\�s.��V;��(kV#���@��n���e�{�NRd��l�0��r�[�3��f|�,��&8��(�E_���%@�@�mqG�I� @�h��	"��f������^��P��fyE@�n�[j�Y�LV�{����������0-�v*��o����WQ��V���
�k�3i?Wl�Kp���_������2N)���&B,���o�:��/
b�n�OV8@�zm0���-PBA��S-!�Q*�
@�t���yt�KQh��!�]/*=�&���T�@�ss��8c�#+�U��]��P�`�I�v� �I��X�6��0*�*����)Q�����H�9�P�^�����������v��tQ�kkp�n:�4������+$�&��h,.	�>��B�	���@��p�+�U@�����s��(Gf<�1 ^=��4t1+C�m�q;ilj6|���� �8PB���98sE �����?R�_����� *�]
B����(	�B��b^��@�@����9Z�0���?�@���Y�|��h��C�8PD���#W�Zb1g~Js���!;U��q��� UIp?��sP������e������1��!@���`ch*3��A`7������:xT�*�����U��/@�'���@%@	0�i���+g~�<��*)��V�uS���~�6�W�b�T�>v�(��<�B'�Q���`��_�Lmda?U33�F�P� ���.�S��6�� ~2�`a��8(���[ �PH����f?anA+@�P#����ZF	F[���<����:�]��:t�@r�0�d��/`������o���'����5!����;1��\lP`����w'��t!r�C������@������h�/�]m��� ��!@�V���:�!�6b� ]l�
�[�I`G�f�(�jp@`+m�'����1��[��`
����������j`[�E�"��~@)�_:p9�I�l���Q;�
l�A�Np`]�^%��.:����y@1�>�����5P#h�J���+��C!�1!�� '���H+2l��pQ��N��CZ���)X�P��z-�J�f@�

��,WT(�l��_�4n�/U#�a�s��k��`�����Z��M���D��b!@���L]3� ��U��u$T�Y�P�U IDAT��P
���p���
��*D�3�����o�K���eB��h�;/V���h  	P��,H��0��b��7�m�`g`��,��'�0f�|��g�e����&r�*��Lw��	�!��XW`�7���H��:I+����B�o���_L���4p�Rp�/���]D%F����-�\�����w��N��5��+0;�����SA9(��� �*:����s�T�\�qZ���k?�	���5�m<�������=1�cs��t���g@��8 ��{,�$f1�j��@�y��  ���/`��9���yfzap3�[�
�(�oPC� �b�Q+���5�~��9��4�`����	��2;X�h���
��V���[�u�=�g/���!@91�rH�mh��}�� ���
�PL'@Q�����r��"�h�@a6����b�9h#�p�B�c�KO��~�	�]���6����"���� �8���o�FB�
.&�������bPXk���`����a��BG�(��j�W�O{`0���4�`�����V�����'
xv*���8���I���k7�$��@B�� Z�6�0%�������������	P�`
�I��G�@Y�����O@ ��!��0�^C����O2 �b���4�
 '����:0f2�l}�4�����2k�C!@�X��I@��������h�@�����3��N��Z\k1�q�
��'�C��C�r�_��T�P0\�����A �H+�����p�-���@n<>0��������<���O��g���ut`�-�[1����Z%�0��h] `2�x�C�����,�@�0w����N��Bj�?	�(5�l�� V@-������^��L�����J@�������D'@Qe���h�W�M��;���y��V���f�N���@�V�*����N��������`� @����!���������v�����C$M��bb�5��\�n��@p���Rg��yp��7�����tX�$�"�6���� _`�l_���
�>���j$�����'�Y�L�"4���������g����:�Xr�_d5������R����bV#���;����[�<��Yt��j�
�w ��@0^�,��pg��c�/�����L#�_x��`.��F��_��F&Jki������.��bu���@'���@��`KUs����)�F��A'9?����h���mM��F&Jk�{�(Q�mg�eU3�������(3.�z�_�7GIu� >�������F&Jk������82X'l��;<�@�{2���������'�"47urt��}��������s����������:���������o��?L�?��`��|����3��O�Z�m/�����m�68m��u������;��u���}m�w����g�������2ui�r�Wo����o��~��^����������V������&�����_�H�\����OK�������G����/��������g��r�`��y�g*-UV������O@�Y��YU�\D�}H�O.!�`�s#��_�2,����a�����K����^���\`�����V]����0}��<	G ��!}c����@���~�	sT��o��<9�G��ZdQh�p��9�bt�A�����O%t���S�-g���J)5���9�@G��d�����+��x���!����>�4�=e�jd����nZ%�tA�{�u�T�w���{
d����.`��z���6��\{��o�P��������@�`@�����	gV#��>�yb�%��#��;\������bSW7
��v�|�A(Bf�B����`�q��]�!Z
�[����P�`VwbV#�U�^����mrBF.^��9�E���?d���/�\�(��!��Hq�[��@�"�r���n��l�-�������g�t+�J/�W�(H=t���������<��t���$�:����X�K<�`�F`�n00_���N|ty`4 �F@��
�Z��B���
�����!�d@|�O�40�@�"$pH���
����OSi]e)�	��3p���@@�eh�~��I@Mz����R�V���@6`�cl���������?���W�pu���d���6E��a&���)��s�F��l��A9h�6�)��8��!)@G?�>������a|]��]��
@��������$2e�\W`)����c#]A ��X^0+�b�"��I�?Er���1�d9�ll�d���^�$���N�&j�x�2����������[Ue��4�@)�T��=��9,r���g��$��m^4
������Sd�Ls�v���W~k"�f���� l6@R�v�Bt1�aL�f�zg� r^\@�
p�ClEXp��6@`���g�w4<�o�A��%�`�E2�`Nfk���������,@h`����A���m�]��8`���u9�<��tI�?0w�@�3� <�o����)��������Vyr���7H��^yF@�=�����^'��)�W$R��%6�au���	��L�_�h{?>
�C�-U� Fx���&�f����C ����i����������������`-��H�^�wo(�t+���jv�IH���[������s��Sa�H`(�7��`��W�N��(�����O�`+@�hqzK0�P+��b
���COz��m�
��.	b��-���_eq��B�p�������P�o��{���A�R���*��a?�.@7��PyD�@8�G
�N��6�B
�7�,��9pk/�d{������Z���07����u6���D�m��A[{WQ+��o4�-/:����{���'�2��a������w�k����|��>HL�%������|�@>a6�����0_O��b�����&={�������
�R���x�����4����v�+�O��n���m
�V�=G�p��@�������B�>�*z������i��29$�@�a����x��i?��s�G*:����� �G�{"�D���0 M`�����a���,��S�s�6Dw!r�CC���
@Xs\�����j�0lp��t���<~
�4��E��'��]�[�o�Cy��!`B4x]�o>��t�]����~}�\�}�`��mG$�`h�9�A�}�� Zl��G�bWh�j���6/|[ ����H�����<�o�e}nt
D���@
����0��;�& @&����!�� ����_�	�@���K~Il'�+%W�^��1l�,��AO�$��`��oK5�!��,�q��Z���A����\�-tr|��-����t�
���;�
��l�����,t�7`���y����mJ�']������x�?�����Yf0���. �����l�P�B@�n��q�~��{������PEpg��`~��#p�.�	�  Z��W�8���0��K]�n��"���w	�FK��C�=�)���"lF��L����iXZtn�QtQ`W&�Y	�JUaP|@���:���� ����V��`;� `�N����I�r�t���������GFQ�}��VL�12$@�����\��P6	@L^>?��N�/8I��.�K�w.Q�����i������1�0���3&@��������{f�@x�j!��7����5y��T�@�u����v!�1}�������O�$`z��8�������n����X�2���R`�����K�.���H=@�so��@����
�M�cB~��
����T2��0}u#���M���y����u;{ar'�v�
��h.S)�S������)�"�y�\MMd��Cs���Q��>���z�PY��}�����/�R����O�!��&7\-�L��9?@K�F860���y��@�BX^`j�s	u!�� �je��@�p~>�@��}��1z{[�T�!�8�`�f�T{S��iD��/�08���Y�`�:�����@�g���g>j��	�hp��V���
��eu��u!�L�-<��k]�^L�H��+�?3P'�������j����A�����x�U��P��c��m� �WG���������d��I��%�P#cLiC��
���d�`��2%������o��K�BD�[�������\�,�`��'��,&�j	��!J��d�����M�>�a�}T���K?{Z3X��3�%�������<c������U�x#0�:��!@K�#Z��#h09t���
�<����������x�H�����X��l��VDH��V���$��Ot�A~���?�:r��W�	��C0rQ2e��8R��$��H<0�9pv�R�� ���@@�/��2�D=�A���|�X����[�����@�}� ��yA*�p���@g�RAx1	��t�9���������D����~&�G#�
��q�R�9!�^I��_�4����>x�@!b���@���
�����8)��������;4������P�)u6������[
�.���X�if ��J��y��?N��:��Msr>:����k���E3�����7��d�iR,��O��X��aK��<����9F�/���{g���3�bt���~��
@�&0�~%�����`
w��p!@G=��+�@�`�C!@�Ug����Z��I@�s��:8 h��:��.��)��-�nC�0|���$��V�����y����s@��5HF���/���@x�!��H4�����������1�1l�@�>F�]�t-F4#���o6G(pGtR�D_��{���(�g�B��E��K��W�
Th���@�1k��'Hv!X+�09��im@�e�d���Bme��{��P!����@����`nB��kO
@� ��
��v~���W�����jF
hm����t�i����*(�(hP&�������1@�"?A�b�	W����������y��~[��@0�z#�P�����f���{	x�9�3k��@u8�9�dq|��@YG�'����C��fn+�O�h���f������������si���aGm�� I\:��7j�(���yJnC%��44u�o��R�s�B��P���z�.�����B*�����f�>@
�)0�D��BV�j�=P$p{��V�d��F���U��K�cazz�v<r;�[��<���B�M��h�����
��� 1p���XR?�&Ka�i��
 �]������R��+[ ��5��)������)�Ms�O�����9�����gX�]�Z�~������_��=$����X�D8��1�������m�?I�S��h��(�cz'@e��-��j����P�2��i����x+�.���������f9��v��@z�D�5K;����iK0�n>����m���@S�6�5���vpWE�
@`�\�h[P����U�.C�+�r���,F
���3A�">�_�PL� �O1�O��+�~is���dm����9
��RE����qy<����H�>�`x�)o��kf�f��
�L�D5�m��&�F������L9��������v�P/"Q�MDu;a�I3��7����[@}����=�T� (9���IC��6���0h�S����� Q���Fa��6�@@j��M��\?m/�o�ns3pg=�pjL�<�O;��o�k�e��>����LUC*�A�8�%gX���U;{���T����dg:���
����R�S�p���."t*��g���N_�#��^A��t:��#S����:
.��������1B�r���
D �:��k:?� ���8r�}���{��Mm�-�����+(���Zz�����`[���j�d�`NH���'�0����-&�?�j����T`n�@[����vx�P�6��nU;���w�?��x@�P�����S�>{'!@�
��b8U![W��m`��^7���NdC=�A7 wE���u.:{(��������-�I�#��Ul��
����C8�����~Y�o�����ld�(��@~2t��@����30h.~��w�!z/�"%���;�����T�p;��@$�� dK�N�Hczij���,46VZ�(�ShC6G@�����gRCP��b3M�zg�Iw�W���Yt�x�������
M@��)���x��!-x�*,�@�!�j�N�������B�,\ 8-:�����
�]��y�yl#����n��S�P �]v_��.&�@�k��5�������r��e@����m[1a�~��r@U�@uY(��P��"�����a����}�����$=`�S��(vA����%�����������uU�an|�%	H��y�����7��Wez1���3����p���\4@cv�0���h>�}��N]����f#�i�*��,��(`V_�,s��������G����a��;G��#��4?���(���0!TK[��#@93�]
Wn�2���F_�`���_G�%�I<bL�tj���]��$m*��}���������������M�@��OB���
�+:F.��C=�:��s���j�tFh���M����y7�'�`=}3����O���[��#�����C@e�h�]�@�[�Kz��T�`������p�A#q�[0���!3�n�-i3�/��Sh��i�Y����t�����+��W�:��`@}ZE�	@��Z��@�yo:��lw����\X�Y�@R_�@�����:��-����f]���������&	������r ���dl/��`��(���|j��nGd��
���	a�a.!� �D����+�-5�QM��p3"��0�xIDAT�O��V:�6�	WC���U�r��SYD@4���U��G<[�@�O��d��L�;�z�^�����6�"�'��!<������)���m8L�����6p�@E

���V���>���_h�U�6�v!�^��$Y����R��K������8vU��;���oF�-_:�Y"��._
C�Y$P�����	6����*��j1�09���@i�0�@b�Y��}�qY��B�w�.�"[��"�CO}-�!���k��Yu�1fVI_p����
@u`�n�#�/��]'���{��������|�����_��x�E�Ng�s� �x>�bb<3���a/��.P
�@tv���_���#w�8Z|P���|�b���iNk�?,5;�����A�@<��tHuZ�B������b����9`��:������V
W8{�<��x�d���O�����f$[�7�L�
����<�C!w~+0����"�������#MAw(4u���z�q9@yP����
(����I������e����8"�{x>\�b�VT���t����j�����<�VQt�����z�>����i�L��pLm@���[�����q����OW��@EDA�UK�'�tp���������.lm�)�+D`'p��5Y����e^��@,�n�?t���)�	��#z�;�j����\0p�Wk�x�Z�S�������Q+:����v�@������!pVZ({=4����u�.u��%<���C�l2�Yb���5
�]��>?�����h������#]=mi��������}�S-�=�Z���<���l��,&���}a���AA3��I����<��f�aJ�SS���66�C=\���@�	�s�Q�����r�

o<zAc�;Uu�vRv��I��Mt$�!���@`����|��[�yG�+l����6�{����|����@�{��H��|�j�k�\��� ��P��	�+@=�@u�B���9G��C� i��8�v�"	<m@����lc��bp���c�Z(t��mo��0�T�l�������C�kM`?	�� ��!��?��`F-���M����sC��<����n��������,_!�Zo����7���D��q�4���5Y
����.�<����R`��w�B�x��n���G�����p����[����~1x�������B�D�$�g��Ca�u�q����l���@Y(�.H;����:�.$�����\4���@`E�v�0+g,�p4l0��g��1Uj-3J�&P�����<#�h#9�Z�t�u�X��
j����`����$O0Y]H�@��� %�+�}�����
����lH�U��@� �IZ���d��1E`����.��19�i��?�i��A����3m����2O���H���"���������2�Xtp:�v<v���1�h�RU���C2�C�
TR�Fy	F��q�!�?���-.�m�'�v��
�aow[�ame���NN�?\Q�!2u��'�
�~�n���ku���2vU�.;)���}�Y�;���a�M �LI6qN�
!-�2�'�L�m:�Y�
Z����
���:E��r���{����
�8\!�`�9����:��M&_��2*;�2�qw��C�p�f��z�S<>b�Y�����a����v���i��ha]ci��e�5V�NH�D�.��m�8���3�O�UU�7�L��7J)g������d
3�pc�����+���RqZ����1\T��0��Pf
�X!����v&�W)������#�he
��A�X��l)����{/���ne{����TBB!�oG��a�] 2��@F`xaB��p>����9o����������@���	7�*�KZ�7��<,8�Bd�?�*��\���D�Rf@IB�U�.����D���M�0���,�����_��TSe"�1O��|���*7�y��Pt"Y/�� ������M�"�9�}(�Pt�$b�V����|���PL+�����@t%$��H���*\b8G�-��n��m�iV���h����
���|T���*��
A�0��E��D�p�_��x_����A0��w�UF���'����K=.��NALX�[�PBY#�ZQ�l�@�`c~���&����e��p_������ZL�;�Os���#��]�K�������^����v���y~�]h��Y�� �&+���U����L�#8s�s�se��r9��&~P��C�L��29�!y@�%_����i���xg�h2�M����t��f����Bx�7��"�8
i�e2���P�m"�Z������)� NA;�m�`>���[%d�bN�j	 �?!8���m��)G��)l�����M	�!�4��9n�bwu�EJ3
��oZ�T��^���-�3L������k�aV����u�O�V�v��f���v	�{����T9*��!@AaC�$pE2MgQ�I@���m���k�~Z
}��mN"��������%�:�����f�$�m{W����K��u9"���l��1Q�I$Hc�"p
H�)wAU���vG��&��0"i��;��]�/�kL(��C��`��7�u�0�c�=[g�I�'gz2��_|?���G���/�`"�k�q�c���)�:�-@yg�����H����=��2{g�p����}�7}<2��!a0���o� l� nk�q�j������W�]x���]�/�8�����D��@j����m�z;�eO>������v�*g���c�-qEb��#!�`������	��;W�~-�|+��}b����M@u8�'�bX�g��=�l�Sf>������5�X����a��i/����4�@����:eh�8�F��Nds�U����/�s+f�7=
c�j1�'�j\��s�:�, ��s��L���@A�$�=:�GUph]�=8�BWfP�Z�^Z�G��5�p��0�Su���p������O��
RPc���%\����A��G@�[h.X�}Vo�X|��2��vT
f/�Z��v�������������M����G�'��k���6Du>�Z�^Z������,�><F�n�8���iWT�]ohO��N��3�Ly�@��� 8&iV�+ ���)���n�_]H�@�	�c�D�Y�=�D�'�@f)����k��{�w;��k�B������4���r�����;�f����w�(���������t*�N&�������i�����A�R�E��~�%���Y`�R���?B����M&�YtSDP`���|�A���2SgP.�j18%~���;t��Vls{���$���T8�������u��i7�@�Xcs�d2�c�J�	���
no�������`R��
r�"#T����5Wp	��3�X���K�I�����%���i�l��O2��-����+R2�����7��b��;�A�����:���W�#�}�.Pi�M�y8(��@�����Ik�O�r�/ s�����^��1�+=�N��*+��=-���i��� m��@w�������ERw���s1p%��[��vY3���0�h��w���!��9��������y�������	��;��c4�����������?C9��X��R��{�^d*�Q0a��}�+e����"�F%�����nZ.�t����.�h�$���T����@Pip;�J2�m m3|c��
�,��@��f��'\���Cn�8��
V�����5o&�p4�����k�O+{>_��z.�X?�#����,�lb�X����,���`�.,�ua1X���b]X����Fs���42Q@���9cNu���:���Q����y�P�����b�>�f��s��`�A�n�����j^�gnu���:�P����������8����"��Fe�h������j�(�����9��R�b�����N]��mNU����8�P���
U�����\�2���F�W��42QF������9������8}�Z;�y��\�2�~H���f�$W��D}|i����u�c}��>NS���y4���s��s���[Q=_�~�2:x�{ks��8M���������u��s.�Y`�|]F���T��b����_�w�GyF��R�%`��B�M�Q����v=N��j��T����;��*�
���F&�����F&�����
�k��C����m{�SU�:w<������0��]��*��5�H��42QF�gTW�^S�w�����{'���qb������v�9W�PO��5�?����id�{����TW�^���~u�8������bv<L�s��\!�����,���`�.,�ua1X���b]X��b�X����,���`�.,�ua1X���b]X��b�X��42���7�[������d�O�{<���=�
��:�i�S�!E���u�'���?��7^���������goZDs������8����W�O4�����}�m����cg�&�<����Q���W�dB���y��W���)?�M�;k���<����G�"���u�'�����|��%���
������	4/�HS��c�+�p2}����������5���?��3/Xl�}��ER�1�a,��{��}��s�����d���c��y���A�<��u�'�i�c�-�����\������%E�pI1N&��?��Q? ���1�O�_*�.)��d������$ ix�������
�����n��'S���%O�B�����p�(t�?A���b\H�KPF���b\H#O��{.*���w�"1X���b]X��b�X����,���G��x
l�IEND�B`�

trandomuuids-without-FPW-1.5kTPS_lagOverSize.csv.pngimage/png; name=trandomuuids-without-FPW-1.5kTPS_lagOverSize.csv.pngDownload

trandomuuids-without-FPW-1.5kTPS_lagOverSize.csv-comments.pngimage/png; name=trandomuuids-without-FPW-1.5kTPS_lagOverSize.csv-comments.pngDownload

�PNG


IHDR��
sRGB���gAMA���a	pHYstt�fx��IDATx^����%KZ������
��%�u���8�����p�Q��#�R3�)$<z$�]��g!
��Rc\i�i4�TUc�7��g����'�Y������2V�_Gq2#22�G�Uk=��k�?5��M�^77�N�N�N�N�N�N�N�N�N�N�N�N�N�N�N�N�N�N�N��Q�����������T�����Y�'�1x��M7�OOO��q�}�v��n_�P��m�{���#�_�����c��Le��[�������C�N)�S��$�nLC���1���o�i?|�������l �n���)���E.��2X-�U��6�z����!ZOu��E:>|x����&��+�X����Ly%���i5X���9�Z������O������AFv�P,h��ij�M\OS*�������6����mY~h�*�b��:c�����V'un���8��Z�����������&���>��{�-��4�����o
Z���������h������!�oC�u����w����X���Y�tlv�C����o�kX�*�2�\Iy_Wr�������*�����>t%OO_|����_��i�:JF����i��6�}���O�^���(�:~���2Q]����Tgh��8%���SrL���������6U_�YvN��>���������d�)%����y��D�xDS;Fev>J����������hr�.=?�y��S����I��_?�����k����z������B�� !\q=-�m�l[y_/W��m��L��es��S��1\����%�T_���T6t~���G%�O)�����������S���5����D�q�����K����m,8�m{��Q������������}��Z�x>�xv2v[��V`����[�#���������e���Kmkl�[(9�&�jo��-���<�M�%[�T~���Sz��d�������|j�������M���R����>o���S��������������k�!�O;�)C�q�����* �y����k��[]���C%��dR���TkG������3�Nj��v�T�������,-���>�v��^�KIY����9�(o�����:FW�������;'RW�s�i��K�]�7j�^CW�e�}��H�E�1tE�d;C�q���9�WbW�t�P��7�Mk�W��E��*�E���X�7AQ;��~����TW[u���������mu�S��]����o�����{~��@��[��������	��i�u�5���m��@�zmDC�q��!��s@�\��&}�W�`��}����/�x�^���T�F�����I���<������U���k+=�S����:G���V�/�R<��s�������zkQ�s�+���Q]m���5`�/J~p��'��^���w���w���&�i��6��z����u��h_Ou�>Y���lK|��k?���+�)��%���~�m�1��vl��7q���3���2�;v�K�Qj�)�}�J�x�7����sh�0����ev�����W\�������=�O����mC����o���z��_��������U������p�b��^IR����������+�`KR���,0�2M�v��:���5����n��i�1�q{%uR�;&���RCTG��n��-�����H�l��k�c�(�������:^<�Q���n.y1���>�9�2MK^��Z�Ru%W�c�E�/%���Il��6�k���m�sl�^�.c������0~�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`��z���ww�_�����o������g���T��zC�HI������)Gu��y��B��S�UO�z����W_}�����6��_|����_�yQ���|���O�^�j�������(vR\�i.4�q�������xi�:�����������v���)ME�6:���+���XZG��HRI���Wu*4�r?5��,>*������jP�����!z����W�)�5Q�<t�TR���|��j�:����H��F����b��AT��w���z����/��H�//����61����/^�S�Uy7�W�������y��E�� �@�b����a�>}jc(%�TvA��nk�M�������(�������b�o�����{�T������xi�:���QP���h�q�]���NI��x���>�}X?��n�_�������[�s�(��~@|��b��� ��GJn�O��b�s+F/�y�^RnM*R^��f������O�����������>�(������u�,�w$m)�3U����~�a��������G�o�>��c���P��b��VL�V�U��� �:[/V�"0z�:��j�E�kGJ�L�{���a������u���G��>���>����)�Q��rM��K{�A�M�@Mx��~�a������u���G��>�}�)n���ap<p%��`)�
L�N��o���P+��n�_������y�_�I7w)W�������`_y����+�V�up�%����}S0p%�Q��y���|������Gp�>~�x���p�����GM���:o��i��<n�=��>��W?��~��8{��@��MS�����������Vn��{������sT}��������c@�x�J�Ugs�����U~��|����/����}��?z��.�-X�oA�
�A�# ���������>}�r�*��|�����Sq��0R����h����m�\K������r���{|��m�C��������G��w
q�`�Jx��c�<��3�v����jCw|���?d��z��S?��n�_��������-a�Jx��c�<����������P������G���~7���W�����o���r�@��2������TO���$�+�+������~��/�W�r��r�s�G������G��>��W?��~��Kv�_�/��r�V�j��)���~������[-�������%�����^��Q��7�DW�9���A%
hp@vW�����K����G ��\	o��P��;t�_&P����_����s���]����+0���t�_��#�'�?�{�}X7��~�a�n������9Kx?p`	�-��+�V`��i	_77$���;��T����K1���NC���������>��W?��~������`	�@Y�#y
��#y�z1��TG��t���/�(b_�����z��'�����
}
p���O>�n�a��������5K�80����?����(��Lu�
����+������l_�\Sm+�v�����TP�x�g�80HA��i�K)p~��U�{���O}��WK������\S�Un�[P�:��b�W���������Q���~����l�������F���>���>}���8@��9A��S�=����A�������/[WS����o��
Oum������g�[�����Z��1pt|����I�:�;�b~�S�}�����+�
�}@mT/��/��s���>\l�n��������������mj@@�b�/��R<�#�K>4�d��n;}%����������W�{5�������}���6��1(�VR}�i�;~������=N`Tg�m+W���c�o��M
b����D����Vp�A��0\b�J��FM��a%?0t��{�[7u�>����zo���j������(�2%
�1���~�F�=���u��F���>�P�Z�����b����;�_���������"
��8oI��X��/v��~���>����P����8
�����(r�8��1�10�b���r���������{m�v�
����.�����:*W^�+�yQ[������]��� 5����r\��}X7��~�a�r�Cw,p��1|_>c��A�r����k��>�.�v���P{�m��|�y%-W}��k��}�6��_?7`��Xl;G?4M�`���|�������z��xv�~�a��������������]���M��{�{)�
L��3:j�,�����U�������o@n`��p]g� p8-=�����u���G�o�n
���{�)�W��_���������������#�0 q��TG������7�W�[M������>(��rq$�7��r�����@������^L�_�n��{�����8�97���}��8/�]�}X�S��>b}��&y��=�fi����wdX�v�C���I{S�
m��8A�n������
�55
�}R= ����m���lN�?6IA��4�w�������q` ��v�hI���.��M�$�l����m�����P��a�� P5�e����@YW�5P�����w�u�����������g���O���x�>�z�����]�������������G��4��7M�_�s7�A���yv�_6)w���5I�w�k��ul����)�{

���D�`�Qp�/MRp��*HXkw��C
^�9 I�������+`~��U��d�XP���������!��o�\S�U_�[������c����Q��!�����<�.����_kRI����4��+@��S�*u�
�}�8�xk����n�x�,?e������7i���q�����5��(r�:o��^P�<'��z
����Uu�����N�5eB�+Y[����Im������w���^�:b����1����>��i�O\u�� (�GR�j�x����� �N���W��_��n�����>L���t��A��\������)����C���L�HY����-�`�u�Rv�?�b$+�U������n���?�A�n�� <���/�@>��mGR��-p:f���q_������+���*}�Y{-��
��6X��������+�P��WY��{4�N�(����\Zk@#���cR��}���hO���P���������)������R�6yK>|����o����uDA��4�wM%������bpBy=R`R�R���������P��E�R��J'������v�����J
��O��+����h���`	
T�����P�j+��5�:�>��@J��1�RFQ�u��=�z�R�����O��G�.���J�r����y�V2��o(����m�Rj/PX�J�mi��L62���E�R��J���Yr�S��G�J����<����x�+��V*`�����#r���C���S�����Q���(s�qO�������w������`_y��a���R[
����_J����?*��@m����p�2o���'�R���>|C�������/�O
��	������Z,�O�U�u�^�����u����\���S�cK3w�����J9C���o�����U:�+�
������$�kk/������{+�~����
���D��G
���A�C��Z�����x�~j0�z
�����X���:�L��U�� �Bm�.g��)���������)���D��F����G�G*�d�W�Z,�'�G�@�8���W�s-�T��k#7�J��m[��
���68!���:vL��u����yL������v���g�-�W�u�9^��M�w��A�5��h��W����N�.h.�C������-���+��$����js����>>�����w
�h ��f�}4���t$�\� P��'{0�R���=OK�E`�^�y���]��O?{���v�@��\�����y9�����;�����>����m����aE&�7�MZ~

�������
5a���7�u�
���L?xx�n�'��
���]�?��8-��}X7���y�����
�b����j+&����X�v?�����\����Y�[�-I��D����1��xv�~�a�����������X���4���u0���~��N���7������G����}X��>\3�?��?lq�>��V��-����w����h~,M���[����6�||l��=���_�u��{���t�ZpJ����R#?PSK���0+�.���4�Rure��o�~jjb��g��h�;��gW�G����}X�-���r��5�`�8]��+��"�
x�J��#��_�}I�c���/)~����9�y�T��������j������S(+���T`�2%O�R,�6�������z���=S���s�dwX�(���\�|<�Z?��n�_��������Z9�p}��Yt����oS+��+�V:�����
�z*����w�],��Q��<5�1�;x�����=n���y]����?ve�z��1p1�W0���Os�����v�|*�E���\���<�?�}��~��k���`�]m�[���gW�G����}x+��<6}�R��g�SS���m���
�]�6NJ���_~yq�;����g�����l?,��������o�m������~���o�-�o>��L�6s�K8�5>�-�� ������VO�����[�
�K�x+V�W����w�|��z�w�J��U��"���\�6�0��>�L�6s�K�Y��j�������+��u��#������[���}���^�9[B�C��]P�2M�����4N��[�������+��A�(^�������w�2�#��0����B�����~��`A*v��k��=�����f�:�c�d���7��U��7��\V�J�������j�:�����c4��_t�c�@kLX�gW�G����}��T@�0��v���)���+�����g��8`�A�w|������_�wlM-H7{�A'�z���h�����������8��}�r����G�������m��ND��>UJ��[J��2�K��
�roh������'���e����]`S������t�v� �S�3P`�������������\P�����\�����V\1���k
X������{�Aw�Y�o�[������~$�Hw�b>�����>��W?��~[�a��?N�bz��b����*����\����_������m��,��{�y�j�Z�6��[�sik$��?�����ol�z���S�S^�f�:�c�i����G����}X�=�0^��*O��V�����������6��>=2��o�:�c�i�}�^A����h ��H��
�m�@���<��x��~�a���������W�y���|�ojH�������
��[Z�~4��n��u���M���#�>@�A���=�&W~4��~�����|=��e��n������m�����E6�G��=��>�}X7��~�a������H�*��J�[T��n�=���{���*� ��TGS����U@=� �?�����/vU^e��vK�������|x��~�a���������u�n����R��X�O����v���!�B�)���Q8��������T~/�����Q�?��������x���O�/_��TK��NA�-@�W�S��o���SS�uK�m���S��_������~;o�?�:�;��~��?p�nb@#^>|h�kz._�����3Y���_?oe����q�6�[.������r!O�|�y��=�oW�-o��%�W��)���o���
X���^I�5�@�&x�>|xz��U�{j�U�mUg�;��������V�a���������e�����gw?�q��T���vR�����RS����ST{@��U7����>���o�:8��q������u���G��>����������?���hj�h^��wM�oQ@�,�'��-����
�g�5U@�����m=��������U��3W���G����}X�#��?���?_nW�su��U�#�
�5����D*��pf
�-��9[����-�6�e� n�}��G#�@wp�p�����j��T�%|0/1�7�,�_���)u�B\�)n���O�C ��bA��s���=�%��x��~�a���������U����o)�_�y����Y0`u[�k��_~P8��������G����}X�������K���A�	�n�;bp�|���p�f���������F���>��}��U�)RW�m`m�M��c�����?W��g3��c������u���G��L}hA?�?���������c���n���1��s����G����}X�����v�����T�A��?�83�,�pV�O�n;���������
~����1 ����"��a��W`
�pU�{v��X�L<�X?��~�a�������;J�������8��������G����}X?�8,���y��0�=��>�}X7��~�a������~�88��xf�~�a��������G���������0�=��>�}X7��~�a�����7b�&O_�����?�r��0�P��l�`&���}X?��n�_��������0�0j����L<�X?��~�a�������[������T���������p$X$�h_��A����W9p�O�n;���������U�\�orIP<���k�~`��W`
���X����L<�X?��~�a����������������'pXl��S;�R�#��h� �������-Q�>v������K�������r�[ ������=!�13��c������u���G�oi���c~
k��f����R���j�3����}���V�a��������G��'�0�=��>�}X7��~�a�j�C~X����0����>�}X7��~�a��C�\p8�O
�<��f���������F���>����0�I��0f�W�c���?5�y������S�[��������N���I;-e
S��k[S���=�M\�)���q�?���x��~�a��������G��'�07�~�a��������G��'�0����>�}X7��~�a�j�C� ^X�������<���4�*�{��%
���\��>����%�0{�-�h�mW`
����W����F�(�0�=��>�}X7��~�a�j�C]����(�������7��gupt+~i��o���Sp�Yx�~?~�{xxhr��{w������������|���xQG�h�:Hc����G��>��W?��~���)p~��M�r48 
�u����A�/SM��{�Af���z)hV� ?E�
�u�]������m ��Q�����
�+��*�z�f�����~`}k�S��c
�5
h �	��u����0��M	����]c@W��4��/���x��g��G��>��W?��~�!�����o�?����n����+�WB�fz��m7�Z�����F���>�}X���K{����C���{�������W����pXz��R*�r���o����m���xW�
Tn���O�
N}��h�9��>�}X7��~�a����O�~��������4W���lc
{n��"����o�����.������W�^=}���m^S�Un���<�P,^��y�#��@��~(��/��Xf�n����{||l�P����?��g��g�D/VN=j���?(c���#5�@�w����/1�����}X?��n�_����-�Cn����Y�"��`&�nn������u���G��>��N�`&�{�}X?��n�_�������k>�����`L���@}f���������F���>��5�P_'����[M������>(H�m��R�����<�q3��c������R���&�����=�f��J����s����S�t|��^��i��
����}��������u����Gu���T~kq{1``
���A�����4���n-	���#S�;��I:~��_6��K��
^�G-��&�����������o���/Mba����G��>����/��km������@���g�����$-�E
�+��M$������=S^����]�w^�K�Q�,6�#����[7���tj����Qn���z���ww�_�����o������<{xxh��������E;�G%u@�����40P��n7������-�JW��f�}*�K!�9��(?����������"�c75���������M�r�c�XJwt��N�;��GpJ���S���{Q�����N5 ���?�}���]����_O�4U�&������q��������[�u��s�b���9
.uu9�-������Q���M���k^��O
�U?����V��d�i�|�-�g/����z���Y��{��������&i0��_S�U�U��(p.��y��
��j�]�W����kW������&)0��}�X�j��h����XS����CW��~k^e�Gr��?8��Zy%H��4��}�q+�{D�4�_����" ����!�J���[`o��w������q������P���<q���oLa������ r	]qN
(XY��y�S[���#P�(��)'q��������"�=.aw_h�D{���I�����w�in�GuE>h@�z���O�>�w�UnJ���M����l�wO����W�5����U��k]�������t�!wE�����]���������1�b������R���T�o��oh������$����p|73��^W��#
�U�e���8��3s���GV@W�c����7������?���C�j��z�]����, M�U��~�>����r�iy<G>�4��c��v<�(���L�?�=h�����e��x�D��"��)xaJ������&�+�������]���6p�W	RI����/���'O��'��?��w���@���������-���]��I�
�x�����F\,�����'�@E�����������=���#��y���f��-o���|���I�����\��?h�����vw
|������t���%��]	���j����X_���_�������n�����_����������"���-	���i���]�E���?C�+��4�������PR����?����O����<P��I��+���-���+�����l�@���
��:S�}�Tt��~O?��������rc���?>�LW���ix
z�����n7���^�e������q}����n�ng��}�A���m����NC���J6���9W�[p��R��f�s������;���R.�O�
�����.��1?�%�rSRR0�
��=����w\+����lj����~�]�����_l_-M����O^|�b�v�(���2K����oE��P���
��Pn�6� ��3Z�d���1��:8n��}X?��n�g?��g�����:h`'���Q��A��?��U;��������y�,�W�n������n�W=�Y����%u�����s���A].��j+&\�����C�?@����
��i �SpK�z���t�}N��o���S����(V�4���
�����5a�0�������G�����M�����|0`J���c�Z��}�&������>����}���V�a��������>�{W�T60����@��7����:1�������G�������p�`���
��>��`�TK�
8�`@������a��gW�G��>��W?��1��9�`��>$�����5�Y���H����f���������F���>0w0� �]SI��l�o"�|�nK> �`KK�
�i0 'd[@=f���������F���>�h�]��`}�����e��r����`&�{�}X?��n�_���������1}��Q���>��xf�~�a�������74���0P]2�,�pdk�h���0����>�}X7��~������0P������<�z0�h�@�dr�������7�����<vt������_Mu�I�����=}��~Z$uL����B�O?{,o#��q�
L���Phi�	wE���x�`&�{�}X?��n�_�����;`G�l�8
�[����k���&�1���>�}X?��n�_���
�@����?����^}��`�-��@�A"�#c����G��>��W?��~}.�:��o^`�3�`0@/%W>���?�G#����	�j,����������"�}_�5��~�����2�� ���`X_�����i�pW�]�� ��1��s����w�>���n3���{�~�a���a�K�
����?�O?�I�d�����o��<X�n�������	�(�~�g�l `�?s� ���\c��iC�f��<����o\�������;��U��`zIP_��f�}��������5(��������)�4�����M`33��c�����u�%^�I���{�~�a�J�p����'p�?�6P:��q�?��k��~�5��?�J�����1?Y7p�������9
Z����������B
�������o����
�O�^���7pD��r_��Ry�'���1��s����G����}X�C�����"8 �V�0���o�9��>�}X7��~�a����b���y���b���b3��c������u���G���>�
8��d!p�p!����_[�N���.(v����
�^'�0��?������~��-�6��-��
��?�Z��?|Xm�������+0w�:�
�e��8��x��~�a��������7��[���c���6���m���$��_�����G	r�^�~���T��hu�����8�p��7�����/��u����S�e�,�B��{H	Zw�}�*��XW|���O�^�j��v������������l��]M��8�-o�������[�c�����m�9�������7o��yM��W��Vy3��c�������S1�W��`E����~c}����1�O=�Q6q�P���o�FA�n�����*7G��<�v� ��@,������I���O�^(�rs�:���s�<v������"�}f�
�8��+�k�
�����;�,
$���Sb�����#�^<���W��))?Z�q�����/c�2��`�R*�& k���A?�T+��L�0�=��>�}X}9��_S�e���}X�k�a���� ��8�^�t�A����v.m���W��
��W�9Z�1����<W������+O�>�E#�{��@�C|G��O�
N=�0���sTg��������L���OHLl�-��T�X��V\�k�������rs�:���u�M���[1/��}]axxh�5U����� �`&�{�}X?��n�_����U���^������H&O���9{�k����hu��'�0���o�9��>�}X7��~�a�b�?��{Y������'�b~w���'�����{�82��x��~�a��������G��'�0�=��>�}X7��~�a����v~�X������Ef���������F���>��-���z������f���������F���>�_���w�c@����?���`&�{�}X?��n�_�������W�������1��w3��<��b0X�`&�{�}X?��n/����e����qH����C�����c��/��`� �N���O���`&�{�}X?��n/�O��X���=X�\�G�,�'���8]��W�SeV��_�_���:���<vt�|����M�k���TzN�^�������H���,�W�-K������1�L��%�+0w�cH�6�Tb�l��a����G��>�����]����a�������{\Mn��L���������F���>����������s����]���v[��}��XO�����{��u�������[�����E�����\x��~�a��������w�>�A�����uT;����N���^h���2��Ty?P�V���5���?K�<pF�M@[�O�*�A�����7o��S�r�}=���]SY��T��N�a����?�]����s\����f~�-����S*U�c����sW`����]�����r�V�V��x�o��65~�zn�Gu5�|���}^�S^�f�:8�{�}X?��n�_����Y����T~m1��,o�qp�|73`����|�eA��5�*�����A��n���:���X?�N�xy��>����o���������5�^I��������2�>������K��_��?���ots���
,��-/S�/�%��@���)�����+�*����}?�P�Zu�R���u��j��������s\�������)���uS����r������;���R������Q�����|q�}X?��n�_�������D�L�����4�k`�M����������z���?l��e����.m��#��U7����;$��������E���V��>���+X�@�~����W�+�k���[u6�OM��&�A��F�-:�V��M�l@�>�~�����_w��c7c���k��GM���I�<������������������/M='�C`WC���[�S��[�^z5?�[>�Fn�������+0E�w�E�K�Y}�zN_���~�V��Q�9��M��*���<��{MR��9��Z�:�7����|�/V����w�@E[J�����)
�-����/0���@��Z��UA��������KT���
������Vg����?��t���6������F���>���}�����q ��c�CW�uz��s]����������������WJ=s�	[�4��Vw��������y��n��|�=
�����5I�������y������[_��T���u���:������[B�����?vS�s( � ���]���|*�2���:�M�����$��>R�_m��X{��������E��k>������9u��mG����|����������-�q�y��oAX���#�z�������%�����}.����0��Q -
 ������Zw��K�R��n��NP`�@T���m��$]��u4�2�Si�����_eq?<�UrKl�<�y�y;/k�[��#��k=c.��6I�Z2����}k����P�`])��G�Ry�� Gu�+�
�EuSW��lN��g��c
����F��R���m�������5��;W:�~;��:<\7��~�a�|�}���������Z{��c���M��L�$O�u��j���=*	�}@^�w���m�����}���C�3�n���:���h�������Kn��������8p`y_����g�J��������=�h��u�!��]����s.���R���������j�/������9;/��fM9�9�
�]�}X7��~�a����T�.S��[}���)���8�������u�6u�+�>����T��g�|�j/���U60����R����������G����6'�F�x�9��L	�A��[J���A�Ay*����-��n�~����V������~�������%�y�A��U|OW�d�6`��o������=��kO������p����vuL�?J���s����6%��~h�������������[{����o�k>
`��Z�P;60�`��n����<�%o�j��T +�l������o���c����`���I�W*�U���k+E���K������>�=����q��u.~���MXP���/����Z������B���x��~�a���������u�`}���1�0
8���[���\�>�J�o������8��������������F���>���}��R�+���y�����P�����Z���L<�X?��~�a���������s���p6'����"w�?��f���������F���>��x�o���-_]�W���w�>G�pr�]�O�����������p<�X?��~�a���������>����H����ir�NcP��<{h����z��R%��|0����x��]���{ �Gp[P�f�~��=��&iP�R*�V���uR��I��%/�eI�z��=��:xv�~�a�������������r���5�c�y�L�s��s��������-��$�,���I
�� ��U��&Y=����&����7������&��:[[J�$��l�b�n��w�:��`�^j;_�����r�p>�76�0��3)�����wo���r��z�_S���S��/M��6��W ��(�V0����_u~��=S^�
(�A�.p�����&�������nj�/�<����j����iP?���4�O���>�
��������*xxxh����w�����������y������&i�@�Ju�cy�:��~�`������Hw���W���r��������S�T�/W>�m(/��;����~���_g����}UKZf����K�mK�{���S6�<�
�~`�����[�c��*����� _W����u��@������&��� 4��p�6H����Y�`�����_tS������t����}�ws�I��,�6�����s�e�-����|��6�����M��x��r�!:�������&
�W��-�c�|��q��Ur.$����������~������)�O��f���G�)��n��|�����W���04h�?�t�^�c6�����	�>�}$�RWwm��@�(����(�cR�mtgB��?� ��R�6��e���?������K(��zg��#c�j|��9���\�N������j���*�Dj~D�h��/��{||l�5�o�W����@���Z�A��A��uSQp���J
Sl��|]���m�qP?5�����J��(�V>>�PB�:~�����}���	�]:&��J�3@��l���w����k������G�6���G���ak��G���+iDCtpF�>\U��s�+��n]���x�k�
�6�������U����	��<-e&
�uN�����\��z��M~O@�|�oV�65?"_4�����h��F����jJ[�Y�S�����by[��]Wu]���s�~�akqp�o�����s�j
�.o��!��}]Q�K��
t�u���5p�������������'���G8����������,�+����=�����V����-��Ty����+�S�������e�����-������~�������]W�x>�V~7��S�[�n��|����C����s1DW�U_���pj~D~�����E�����?@���
.����
t� �h=������^�#���Q;��%m��`�����n���k���G�����\w9�r�:������+%�4k�u�s��v���k���P0?��+�>�Zr.����_��b<\7��~�a�����y��+��������OXr%?�$�_����M�G������
������C�n���x��c����o��k�1M]��Q��������O������O������E����W`��z�M��(���mC��{�G	j��A�=���P��s<\7��~�!���_���#���1<���`Z������7I�=�M��?��`��+��T?�W���=�	e��k�s��5|�������[�w����]��s3�.x����Q������G�����G	4U>R���Wu4U>��������~�#�m@�0��k��]P�G�
V-�Qn�7
v��:�=��P�M��`&�{�}X���a���Ep|��q��y��>D������g
^�t�Z4�2cA������Y����.�q�#U���/�{�rS��������u�MA�=���	_�deG�0�=��>���>�Ajn�t�?^�O�
�=X?��vl����������g
V��kMU���O�>�w�Un��sM:/�=�`���>|h������J:�n��
R����O^���,uN�n�d�����E����`=7�cP��8����~�ij�xIA���7����	P��������W�^�}������F<4U^��������`K<�X?��~����������A������g�-�����
W���_}Z��W�1��\�M��Q�o��
��vG�
�m 6B$6���5����ou	�r���4����0a �F���l`���;�Sp�KKX��c<�Unt�7�����Y��r�"��3Gyd!�h�����ou�T��r�h��s����7���)~iJ�aS��G���
(p'@�R1�	���=�`���^�z���o���>���<��z��?6I/�_������8�w8N�Z���:��c�z>n�����_ts��~���4*-O�����{����=�uI\���_?}��m'��}��-7�Y<������Y��R����(��L/�����^ _}�U���8����������l�����>������6n�����)9�D������w��w�^
���c�nrl,
tK�����������d�|{G�s-�����(���F��3�|�����Gv�E�~�I:�1N-���]��&��Hi;Jq��YR��eJ����������zJc���:J���YR��eJ��2v�v������k�{l�����C�Sr��:��G�ep�s�j,��P`�[���� ��-���b?������������������c�������I�$;���5��v�Q��wS[��d��7�/���mn~;��_k����M��&�c�~[�j��4��M�t��~���a�R���-9�h��/9o�5��������[��k:��c�z>n��������#-��1M����s�;���1W`���_e\��'�U�K�~�y�e�
RE��U�O�������W��{�m��#�Q��
��I�����&i�,���L���S���?����)�����P k�m����r������x`?G�����S��&-�������pE1�y�`�����,)?��vd����_W�q�������������������F���>��������x�o���� �l-��&�=;�|�^����^�q�/�A����$��^���_�
�q2�m��K����tZe1�[�K4�|��|�����������T%���c*�t��:�`7���O��6��g�c����u�@�3����� ��{�����W��XW����,����M�r�L�������})�����*�3�0f���u�W����}X7��~�a�J���T kn*�
~o5(N�n���1����_�>}�t��z���{>��Q``���-�v�Z��wz�_=���:%���
���z�z�h0c��Cw9h�F�*[c�!Z��k���� ���X'W���~�N��� ���;�k/���X�>��`@-��
(�����I�pu��o10��u
B� �~Ho�/�/iGA�LA���
�Uf�����~�x�s���6��i���u�W������F���>����X�^�������vl�owo��A�y`�
�����K�@�����@���/�h�����������~��j�~� �����/��zM;�P��(8���z���V��Ieq;Z��S�+�&��P��ZG�����p5>H�� =�K�����v��� �b�������nX�/6����r���x�\���~�~!�{�}X?��n�_�������5��=�?���18���������r�y���u
��;�����~����	�~e�@��[�u�r������
��������)9������p0
��m���4����5>�71
�s��=������b�M=T!�^��a��s����G����}X'�@J��$hM��
z����5v�[���[b���/���j��# �2}O��)�/��N��O���g	���C�t�Ei:�U�����8gp�������������srk�8]��O?{��Q�Zpo��K*��I*/V�h�z���<W`
��s�v�i�����
����9'�v������� @���T;1�=� �w����S��L<�X?��~�a������u����
x-�2LS���������}��9�����p�~���6��TW�u������u����	t��)W��`&��q������u���G����+��A�G�A�Y@�
�S��up)w^8_�����')I���~�c�.�����z����\����:�C�X���~L�4g�s����G����}X4s���"X]�h���x8�\@>5P�f4mY.�������8�;�7��V�3��N�Np���(���[{k�{mC�3v�s����?`"�yI��@��m����g��vs����PR���=����;���A��c�S~,�����P>����}�4��r>���x��~�a���������

S�$S�������v<�-k���h8������%�~���@����6b;���Km+�A�xM�X�p%<��CH��,q6k?kn���3�C�3v�s����?`�\����Q����v>��������%[��?_6�������K\�)�8 ��S�zZ� wN�/�}�|��f���������F���>���1��2���3V����/��o��$�ir�H����8\����j���L�/���DM��������c[��t����`������\�N���M�����u�?����v,P���<��r��1�+=��_�����>���`, ���1���Vnu���h>��r����`����G��>��W?�p[>8���r/p��C�u���T�{c�K�u�!X �J0c@�<EW��YwM-a}�>��{���F^���
�yy�,��{�����^������s��rK����1���e�2�k����m}Nb�h���� ����c�h�������pU|��������J�e1�~��������-�e�G%up.<�X?��~�a�������� ��}��S�=���-����`����1�������j��y��M9 ���C{K�O*7����{��]���i���MI�����@�\@7����K�Sm������6���Y�\XJ����7�{�v@�����/a��
$hj��)������}|���������F���>�K������:~�T)��5�B���o�eh0W�w�[�#�?G��.�����:��\p����b>?0�qe����b�M}��n�G5�`=����O�^$(�rSR��s����G����}8.z�������L{��p_�z�������S_/@��P����
������8�{���y���^�X��m��\������uY��@�����P�������N8�����X?"K�xy���������.�(Z���=��[n�7�>Z.cy�R,/S��A���}�.S����9F�?@���|R�u���K�~���y�R�lys����v�����S���dy��M����WR���������_�����@��']���)����]�{l�~C���TS��������2��S0�������vU���a�x��u�9��cN�)��@=7�b�-��Ij��e�������������8�-+A\�)n��Tp�@��/��r���]*7%up>����>��M��Z�J������)_r�G�r��XPg��g�`i@8}��xD�m�-���}�6�:�p�����������@-W�_���@IfQ@��T�Y�tp�2�Y�?d��l������Wl����b)��v���m@���~�S��Q���_R�s�77o���l��q�n���e��������#�����P���:��e���}\K�z�%���P��>�7 �uR����`2��O20���#8���

	o���pu�r��[���������>������������A�C���uX���e����p?>��>��W�u9���'��� @�\y��G�,I����}?3�=�����~K�S��J]\���[�o1_��7����eMC�6��s�������.p0�����c���n���?8l�sd�c}c}��M���!���\��|��n��c�;`3��\������u���G�
�c�o���p���G^������`���`��%A�_7�����`*~�JxV��=~{��,n����[{s�]{_�4�oc�=����X�)�c�����qZ"W�������6?���
q���J|P��sru}9���X������A��L<3W?��~�a���������{���4��:(�a�"�R���R*�cI���@���1�����n�F�%���7��'@�������0�O	�}����uSt���T>�}x=���r�l�`�T7%pK�M����z�X��:�f���������F���>��S��\�)m,E>�{<&��}���mkaX��=w�)�*gkb����������c��������Y�Vc�FKX�R?���o��@��+�\�pt3��\������u���w�>����2���� �C�\f��������`���vK������`�S?3w#w��c��������,���`���_������>����&�A��bA�����;$b�53��\������u���G��������������n
�JrA��
hz�A��0�������������u���G����`�X��A�8��w�e(��

k������7o�����^�~}����-�����]fu������f�����������6�W?�p���x{?}x<{�N�+�Wzzz�{��}t�A��{��
�UGS�Un���e�0s�s��$��8��?����_��y�����v�`����
�p��Xne�Z�n���ef���F��}X�������M9}X7��~�a����6 �Z>���W�c>��MW�-���������^u���;�&?���I
0��|�/�A�5q��.[�J�Q��g�5�2������Oy��=�`��xf�~�a%���+��F���>��������sS`�[��������p;����+��28�\P���[�������|.-�z��?���0����>����X��i���>��W?�p\.�W���� ������:��|.-����n�3
|��w]��N�#@y���u���>c�<�|�o�_��l����f�A��G��#Q�`Oyp���W������a��g��GV"wu�)��F���>|6�[�����k�����v�J��[eJ���k�����eVf��[��������R��
���{����:*SM�����g,s������4�UJ�u�K{i����0���f�K�\����}���ro�9��NY�Y�����Dek(���y�:w������m^J_���;L��|���pf�i�;�+�1(���	�2��xf�~�a�������/YP���`_y�6�8�XP�~�?R��|f�����>�}X7��~�a^���������0`:~�I�p-3��\����������>������uy&�����5=� }�pf���<��sB�#�nR���-�o��/��<��h���<f������?PY�O����}8.�����YJ�s�C�\0�p��+��2@��`�L<3W?�����������������>���������8���m��r��������R�{$���m���`&���}�����h�>��W�[�C?0���$�10����/i�G)p�0_�.q.�%w���$$�<�~N�o� @��%m�3��\������u����g� �,
�K��}	��k������!p.���#V�4�#��m�m��w���m��Ry�c��2`f���������F���}h���Sj����o��Y�V����]�����sa����F,h^<��6s�r��~^|��-�'�A�{�u�-1��3s���v��G������}�g��!���m�������>�)x~���y�0��K���})�f.`���\[s���[��	��`&���}X?��n�_����=�g�i��� ��������vR���<�5l��pK~�I�<[�?l���(����lIp����Gx�� ���u��f��9�����G��P��AK���I��6�G���l�v��I7�V��1H3kg~;��������;�=���:[���`����l�3���I��$?���4xv����e����Z�����s���%����<���}���0����>�}X������d���n�o��h`���g���I��<���&��vk�A��i������������|������J�kX���������pM����}����4�s���@~�I�7��������s������:��~���mxv7�%c�G*Om�BCA�-[pY��)���5��m�.�)�R���=7��\90�L<3l�;�Kh�K��a������t����nj,����/���[��M�����}����m�9����t@���
T���;� ��������t'����&Ypow?�m�X��~!��=h�gy��\~��������G�3��~y�kS_��OW���DW��"������<���=:����
������y�� ]�Fm�N��};
�5��}Mm��v�x>*���Y�
�l��^������w����K0��3s���ws}���|\�V�;d��m�s/�*��Uu?PBW�����Oa
������pq`C�c��� Cj �Q�{PAV. _��]�]3����z�5b�o�����pF�K���nc�}�n�W �@~	�o�����=�n���A�h`A��������������)Cm�����/�q\�5"���� b���g�b`�f��/�[Z�}�{�}X�C���
�s�����%�,�N<_?��s��)16L�=^���S�Ctj}�tl-	�|�f�@.����/����_#�_��k���u�6��m���s�\�t�?�J��Ymm�[������k0b�#���G��-����=� �����KW�UO��?mx�t+��Ki���s����u~^;X�L7����2��~��c������,hV l�Z�?��_����� _��_W��K
�h�v��
(P��e������Y;�+�G�����i������c��n�P�{p(@����+�Kn����cXc
(�p6c�����������_r�_W�>t5\���u��H��-�;�
���X21/��m(o�h�|�c�V/��F����T���n�m�8������(��`������!KRmtP���G�������
�����	n�=XS��6����k���Il�����[��MJ�N�����&�
�9���L�0�M<3g��\c ������x��~�a���?�&����G
���=xKn���sa���@�m��pT3��3sc�SRmp ���G�������P��`��	�}�@��[J���}�>��s� ��
��x��4���{�1��K�<���9gP82��n���kDc���=��>��W�[��)W�=[g�������R��sa�KG��,x/��/������0��?3�A�����A���C�\aP���G������0��r����5�>��������D`c��������)�C��f:�3sc��������G����_�}85����Kjp@r�^l�Z���sa@��A��$����mn[���Y�^��I�3e �������N����A�T��@A�l���y��~�a�����s���+�>H������D(\��/���}�RP�y�fU|���������>=<{||�{��u[GS�������|(8����t%^�vY�z��&������6I����=�r�����������\����,�+�������j�ZH�Tb�Y�
���y��������{�������N�W�Y�������� �=��>��a�O��_6�8��I�7M�����_6I����=���$[G���I1hW�����4�fn?���A��
����H�K��(p.�(�V�9�� ��������/�9vkL%�ND������4�7���m����Q�e����_r��mo-v���������?D��������|xf�34�3��is��~�a������&���)?��[�m�S�_��A7-1w?����`�
��X�V�����(p.7�#��>}z1H����Zu�S�)`����^��������J��S�n���Q�/Z���Xb�~`���ow������H\�0�M[�����Oq�/��a���P���c8�{P����|�>]����?ms�+��8�����}� ���Vf�k�_���9�,���+�_�%���o�"O��|� �_�B��`��3����d�[�Ki}�0���M�4�$�������2����Q���Z~�����������\���4����|^�����y���������jP���|�~����6[M���
��6�|�����]��/�������?~�7��3�\�O
X������ek����k�]������O�{%����<���a�z�����]���*7k���FN��=>>vsH�K|N����0�O���t��%�?��C}Xtm.��������6�Wj�r��-������_�4IM[��&y~�-�e��:�c�������/��v�G�~��6�J�M�1T���{���/��e�>3n�s=���q���;R?���#w��~k���XA�C����6����8!�r�~���o7���m�~����:�L�{*�m�2_�]�+����n���c0�M�O�)Y���w�����|xf����N1��B��!�k��h^M�1��u���WS�	��V?(`e�=+�
�C�\�������_�W������k���E�����9�`
>���V^k��|�xz�xJ�"y�M����m��������s�GW&�����a}�ok�U����8�����>�A��5�Y�/~Ymx��_���A�x��u�[J��Xn8��Gq0X�+l�`&���)_�_) ;r�`5�7SY���pEzq��W~��W����f6 Ps���8�O��%���P��;�L�(�?�q��&��������_������xfn��/N�@�A1h������_h}��v�%����VD���L}h~��oy��[���_�S*?�C�\�����/N��FU�Tp��i���e1�'����`&���.������:��B�������W�����-�a�s�OV�����!p.���(!����qX�<�p���;��m�V����M�z{��sIJ>��N��-���Ko�p^3���<�K�����)������:s�N���w�>L��GVV�K���>�����/N���&�f�os����%l*~��gPdu�>��g���gW�,q�������{��5����8���fw`o�O�������M5����n/���`����Lms�}q�dNrS�}�FJ�=k/�nNl�d;��@�w���&��
:��=��V�
4X�]��a�5���;�����1�{�bo��E�
L�v��;��,����aLp����b2v%���^���|��&i�_5)W.��|���<�
3��c�
�����m��1�7�sH�\�k���-�y��`&���}X?��n��?���4(��M$QP�@��[����|�m��k�7)W���M�M�m�K�l�@��s���_k���}�65�a&n���A��1��:�<
�2��������������H��_�n�i�|�g�����(�y��>KP��{��-�����$���6I��g��P��A���=����s���z$�7���m���mH*�7�A��>~�x����.������r�z�@S�Un���ef�����>�F��[J���>��a�/�6��_���Kvk�����T��_�7qG~Z@�T����Ga��;� ���7m^S���u����b���-������a���b�,��W�-����U�x����?`dk�oOW�c�����^u��L<3W?��~�a�#>�q���~]�g�-����^��~�����B��m���@?��G��G��ZI)�>}j����W����a��R��z������y-M���S��~���]-��l�~��p����E���+��d?�����~�(�`�6����`g���s���61tb��S��?3��9������s����v���m����W?��A�y����W�����
�_����1��c4�o���3q5�����������R���w��E�\Z�n��W�q�����������f.O���z��zO9��#S�J���]�;��s1��}��}�%�7��u8aC���5�����Kvm�Z��������Q�|���������6����lg��=]������e����\4�|�������>���O�aC�,{�C|��h`nH74���W_��#~�@w��|Me�:X�;�����ZJ��1�R|{C�]�x��I���S�<3�_������1%u�,]K<�X'=�N����#}X�C���y��������Q���B?��0OA��T�h� &+W�/���#*7{��2���1m��1�e��s�)S���c���6r��������?�!�}�`���?��2�7��m����\����{�r�4�d���
�m�@������+;�:����z"�;cu��3e�[�~j���R����g�X����jWr����^*M��?Tn�|�\����Z��g?n���8$W�h�	*��5�����o�d_�ulAI#%������d5����)��-���vc�6�s����8��O�~��ao�8������mL�����>�z��]�����u0��3]���gu>����/������&��j_r�K����q���6��guK�STw�>L�F�j����J�EN���j�_�����;��:%��7Z������ww������%���c�: ���6������S�T&����W��Nl�d;8�)�#�9j����!~�z�q��k7g�o�|x������������Xf�}Mm>��iLI��l_�>���*�e�<eJ]���{,Wzf����~�2?��!V/�O�k|�����cb]�����[����������T:#�=�_�}hUm��Vx����������\S�|e��LRmH���I��U��e��lM���}���[��?7�L������6l�������]K�|��?>�����}/U:�}��_R	�t���6�K�����>�������``g��^��_�, �y����$���uc���]����M������Y;�u�~���u���%��2�e~;~[~}_�����g�������X���,���v�Ri���O����M9w�������}����A�����5��sK��R��sa�@�%�M�K����|�!��N����z��Vl;�}?���TKSi�)�$���O�T�1��1/��O&�Q�����C�������^Z?�6�O�����M
��O��<L���(x���� bj��P�}������'�O%�ZfI�v������?�����%�v���#<�8����O-���]����I�\������_M}��+?��e4���]�51p@{~��@����������
��}����Y"w�v��r�s�9�~�:���k�wT����i�s���r�+�5< ���@����s�	����#���T:
�_���:�RR��RN����t�������{����s"[���P�����?F<{\���0u����#�������0p �E���TPa)%W��%��M������pt���$v�>�"�������-�vp)��������K-���-��50p�K`L[������>Y2~�,�yq>,�M�u��K�6b���=����k,������{*miJn�/��g�w_�+�j�C�k
������!p.��8b������%�����������������|p`zQ1�����|����k`��@!4���
����ok	�j����?m��
��.����W���%�������A�-�>��,����X�@���DD���|�K������y�������3�G�m��Ry���F�������v�x���aM����:J��z=q���G`���G��x�?%W~m���R*�����0`5Zp`����vsk�����S�
]��e�Z��������-�?�_[������N�	G�x�� �W�������0`1�-���/+�����v���W1� xO��1��3s������E����W�o����2��!��~�!p.8%���=j� ���G
��`>��xfn]{^��v,������l�+��B�]I���}O��>���)`�<�o/c��b�^oG�����/K�v1��3s��3����~�a���+Z��\�p;}��}J��B����__��{c�-hT���^�&��W/���c�vW��MK�<��VU�������\�'_���O�u�v�y������G������y���z�,R��*�8����0`sq�@�v��g�Wk&
 �u��9��?�?�c��g�V���\5��_���+u��>\F}�����O��X�o�5���~��ex����,p������.��\���*x�;e�.n1P���������F�����=�p(�������8����=��k���� ��6�o��ui3����%wuh]D���j�C��k��������2�����0�������>���o)�j��"1h����T���$<h�.���~���|����B�@?50��M2c����_�S*������cG����[=���-U���>��J����z����
@�'������������-��u�����m5+��h��umY�������f�c�i{L����.�z��&���{��g���5-������ZYb��ksHlwN��;�r�c��?;�2+�}�S�O�y{�E�k���
��;���s�>�n����Kp�C�b�w��| �������X�Y!��[�����6{������{1�>�A�?�?J�����o�}��
�??������o��}8����j������������
W�h|���H��+�����>3���)��`-3�ws��_�n����hm�1o�}Ow�����ju��B����g���,���\PL��c�"~Q�R����������`��	�p�n������a��/O}���[���`��=3����t�l�#�p}�����(k��Eo�&����I���vZ�M
�}5�f���y������LG��<����[#��uj�`m`s�����3s����><��$��C� `4�N����{���~�6���R*�6��L�_@��BG��j�7��P�w���%�8����(�����<j|���f��9��Xh�-����\��;8�ur���{pA��5Q��>�v�~����/e��������_LX�7���k���PoA|\��������l�&�e���7����^O��5����T�3�v�S����F�U�{����C|j�
�A/W��x��������X�����m{
d�v���3��I
H���`&������
���h�.��h����5�p(���(��p�(#	5b@u���;��4����KX�@�'���
���+���0�����������eFc���8��4V��a��B@�|jpi|����MY���1���0�=�
~~/:�]���I�>v��T�*��~p�V���R*K0��3s��A����V��M��y���?�-��v������A�8H�48:=[n)�_�g�sa@����bp{|`?4(�[V���6�)]����g����H%�6�>�}^k���5����}�o�����5�G��p>���@
j����/xK�;,�<b����O�?�4na ��;R�����������6s���]_�!���v�%�\+��|ZoW�s����G�������X����������Q�D~���q��6|s��\�L��\P�j���q��V�&�?Kxf�:���_O�������
����n���k���g�Ph�sU��~��n����c~K|����/.���
��_�����zGe��������0��3s��o����m��v������r���R������h�f�;5��0�Mq;%�o���5 @������W�����xxxh����H�_��j<{||l�l��QI������n�~|��?����n��
Xq�u���(��8���������/���Z�Tn������6��2M�W�)��K
�b�~�'�����q������Q?�l�R�����B��c8��?`���7o������:����~�-��9��xnH�hoe^���gS������c���L�����j�y��n��1��o��Q>�0V�/An]Up;I�
+pt������1k��;4�`=����O��rOy���:8���e�wt�8(�|.��������,������1�%��������t]�W������}f�{F��.��P&�\��[0\����P�������N8�������3Y���Z�+��^I���7u���1���?>~���?����y��~�{q���1���/����2T?Uw����h����L�����0�>X?��~y��P>��y���������{�eJ}?/%�W�g��o<~�7.�LyUh�����������7V(q��K�'��������c�%u����:�\n���������W���<��K�I���R�$W�����7�������or�DV����uu��Yah�����-��������������"	��b;�v�E�]�v���.���]������XY�)/M������8�63�������������������N�?`�VV�X���Ga{����}f9��V���W`�����/������Ow	�������ay]���b<d�/�����D���u6�^�9�����T�������5���C�\nz���������L�h������J�XN_��U@eV�8�g��[J�Q
���g��=�����P�n�������_�Tfu������������L��JbUz��l%x.��������R����9�>���R*�����z�+�;b����+6��b4�������������S���`��{�d�6���Y��T�s�;����	������3s������~�/�5�����������P�X�=5�V��9�U��R�\�%��;�O���s��� ���g��^�[��u��.��NgM��g��~�:�G{��|�]���U�XB�~�tkp�`�@n�9)d���s[�'��h���YS
�����������=�`n�"*����xf�~���?��|����6��W{��{��|�a�H4��[s� V�����{!�^Y�kV��+�bW�Uf,o�@�:*7{��2�����_x�W9�?�O��}u�l��}����0�d������m��3��W�c��z/����R������F�x�~�:X��I�V`wz�e�S�����8�0��h���?�T��+����������._>���k��:cu��v^��������Z�T��k��L�.�����[�Y,��z9��+k������}��f��f�o��O����(.S^r�Ml;NL��R=)�k�r�[�m����p�
��}J
���7����{�>���{��?��~�V��W��������{��2�pd����7*����9|�y`%����p�z�y��v��P�\s)%U'Ww	l���������kj9��t����[�������'tn�h�+�+��RY,O�����64M�A�/S2��M?3�:_��y��+�s�7"�o�Q�}:���k}�t�)��K�%�_���T
����� ��>�����dH����f����gat<����_�?���6��O��FU>������T�s����������1W~6�<��������9��3�~l�^��$��Y]��������Y21�7u�u����1q�v�6A�����;snRA|� �-��Y�8"���KK}����-�����z����)�r�g,��5�7�>]so��S�I�I��SuK�us����~�q�����E<�x�8�������K*?t>R�.w>}�����a�n��/&_��7��m���Y� ��������{�h���T��^�n�w���v���S�%}���� >�����k�EW�-��{:���J��R������i���X��u�{K}��}�H}�>�U�6l��O���hN�K�qy�}�cH��2�,��Aj�s�G�~������}��������$����d�6��/D��Ro�P�������yL]���+�S�0����[�x�h����Y=�*�Soi��7V?.�[�o���S_.���J���P�r�sS��;�?F����ASK1�-����XYL�������(N���^j=_f�c����>Zo�������������X���n7S\�Z�T��k�b;MF�X������F�[6V�JUXY��������~����o��V���:6?V�e�we������t���-���2T�[V��O%[���c���Dn���~G�>�^"�������b��>]�`?�����Z�}�\���tZ7�>|���zz���o�^�z��k�|�g����������}_vl��
�������eqc��I����i\fl�~}[Ol*���m�������>���uS����F)�����T�~�����Srt����9����S^��S��}��g2�����%�pt�g��w��p���z����&����_w��o����*�4u���u0{�'t�b��|��[��~j/��Dj�r�b�?u?l=��������M�U�d_T����dd�]����cl�4��RT������\����K�-���������s��s��V{�������Z���n�}�!�����i�45�2�|���&�;��?�-���3�5�@|�����?�M����M��8���N��T�*+�fj����}�S���yMS�(�����e���������?N�����[�ij���i,��o�_��"��Ul��/u����|�o$7��v?�&����D�J��/�k��d����S���f��y�*'�q����'����������Q&w��+����
��0p������S��K����/���!�xr�bmjyl��_�����|l;�[�����l��i,��}�Sk����_O�����m�P���L\����B��&�������:�|�@���>�0s����������[�����e����G^6���('�
�=��C�~��.Ju�������cR�K����h��m�GoQ������g!�:0pd�������F����M����o�L��������j��_n~��W����}������Km����y����n��z���G�m�e��7��T�}��y�n�o�S���R���#09�)�Y�o�����������!���'����������.�y,3��z���n����?�����+`w�
j��W���b��A�ue�-����9�]kK���R�>�fj~�m�����������yT�}����j��og4���:�
���i��.��}�EN���*3n�u���2Z��/�r��\������vh������s�����z�+Oj*_���/���H[�����Z�yQ�b�!��e�6����K�XeE���f�t�{�L.����XY�)/��}��`l;NL��R=�uK���R�lw��n����b�t��3�c��9{{m��-��O�{�~L�@w�*������	o��&U6�o77��������8rmn!�-Ms��}�S����/��R��Sm���xzv<6����i��c�z��u$U�1`.���_��y��w��������������<�{(��S��KX}����e{�K���
�6����I}�i���Q��`�>Q�)�?a-�r/UfT����7
����J���m��_����1N��q��)�m�|n�����s����c�nJ�S�N�s���8�}���2+�)=���/���>b!{����`���z��������9�J�|S��`r���}��S��ZW�s�
�\8��7�S���O�T�Qy�m�uRu�������Kn����q_�o��y|L/+eu����u+t(}j�p����x���e<G��t�^X������u�||4��/������k�Oo����u���>r������?^'/�~���mq��yU�o��y��p%�gu�7���Q=-��!�m��~��v�������l��xK�Fy�eC��JC�2�:'u�� �����=�nmwS��v�Y����3�2�k��Mry�/��X}��{V�E��\�Xf�n&._6�<����g5�\��n~;VG�>�������3/�}>�g�*��E�.n2v|�*n��t��F�-�w�.���XY�
��
��7�����^P=�u�����������3����9����.��m�����T^�2�=��|�b��������?������
L�G�OT��+�2QY|������2�������Cm��w�m�)��������������[�9'_��n������>?��������������w������
�[a�%�[���`��!����"��?qt���<p4E��p��/��{�vRekP��/�*����T�}������q����|Nf��m����\#M��������-���o�?����p���]����5�E�f�u���Z����]?������������
�\8}��XMK>yKm��������p-v�%7���7w�W8'o�g�5:�/��f��p��c�����D�����Q�,�1����T~��~�o)n��]��{$�xx�y��6��w�v?4mL
��������V���=1p���������m�v	���|���}���:��~���*����:~Yn~���;�����)��+�/�|�Lk����:�1Xl�?��E����;7@ogK�_�/�w��o����{!����������xKx�_�g�sa�(�'jn~�-�j�z���C�c�R�*s�D�zj�t���/����z����>���T�5����us�RW/S��!l��ZU�?s������X����� @���\�|b�W��L<3�����������v��	U/�wb����4���j����4}�Y,C�+W@��k���GS�K�;Z'��-�N�[��R����^�Jzo���������zpt`}�/����^�k����(�����_�.(3x��Dj�~������n'�~���\${��-~�b�z}|l�������x���������i�9�,u������>io��-��K�K}SL�^��}���a��V�H��3�_=�j~�NR��}J����6�}V��c�=s�n�R����h�/���������~q���s�a��1�O��#�����p��g�sa�a)��~��>�h\\qvC
���o�����ZY�O��;��\�rU'��*�y�X���\���Ok|���������d�V���==
�a��~j�um�fn�]����1��0�����^|}I�U��C_��o�����5$U���6��v�\��,�����/����N*�O
��|N�"Z�[�w47h�+?*������[W�����[�5�o��OT����g���\p|{~C������7��7�`���������
@k�>i�-^�O�J��v�Nl\n����~j���Ml�$�W��z>�=����������������R��hG�k)��3`��g�F�����L��������6�><����|��"2���:�
n����t���'����+����V�������4���R�k7����s`�h���m�(����3�1|���l�Aj��m*�3���]1��e�������O]�B.���#5�������R��,������9i�[���i3����d��j��@��Sn����gL������Tr���#�����V���3�1������Y7���P�V� �}	���g_}����4���v����X��~�g��|��u]��E���m,�������md����a����aX��D�i�E�&����N���Q���Zp?�����Ux��T^�z�=0�����2��i�>�67�JX�[��3���/�CMX��U
�4
�a�2[g3���v�&���R'];2�G��]��9_H��@js���>�|[v��{�b~���5���J�h���9��:�v7��������,�]��o�35�%�mKK����v0��3sk��=���t�����������.�W_o�}�-�#W�A����K�J����P�]����,
M	<�R�>�4myw�S�����Nu�tf����s���\��s�������sl��
6E�|z����j�o��7R�x!������Goh?b2C������5eJ����m���X����p<�����&�����c�,Zq��O������)�6�s����
���F�i�U,l`N08��:����i_:z����7��S�"w^�\F�<��F\n�rmV��{~���K��s��i�+�����}����_C~~��������9�m����Y�����'Tn��������3e���9/�C�d���;Z�JI~AwU����[)�Kj�U�$�v*�8��0w>��`H�K �b�\���l?.������������s�c���M��G�9M�h?So����F7��X���E����������9��&��C�k�%;>�^OG�>�X�S���g)�����Z>�R�C��!n�'����O)��n�V1�h^30�mAE>@��
�����g/7�tQ�����~���.�Ao��R��'�x��zI��	+���=�������Okj�����d���f�������������<:��o>x���������.�B�Vwu��u����.�m�j2:H�����x���/����t�zn��{d��~�e>m����������-�m���1�:I%��1�m�_w�T���k�"~��L�z�d�R�bU���W+�_'����|�]}�uy�����	Z^��S���#����
�t���������Mt_�R����xx/t�����K��K���x������������S��R�#�Z�}��������i�6�?�K���7�������c����x%m����km?<������Q���>������TnW��K_�O��O�T c��ES�$��E@�X��C��2��X��;,�?��K�aWZm�>�(��N�-oJ�^Rw�������_��d_/#A�������������_[��Z����������e������n�k���_c�C������(�\~�|Jj���y�������Z�-�1���OI�U����������H��J
�
��0�YW�=h[��H5+����X[������oa�������n��y�EYd���>��R�T��n���wW7����)��3�����!6��<�s���\@{F�����k��Ot�Uo��I�i��^�����{�������{-���^s���?������7��q�����L�M{Z;��T�{��I�5��P�
�K�?I���|\�����oc_2�%_P��m@�x��]k��w_gS�l�2W ��jo���.o�~j��R��J�������^C���^d�k:�_Uu�����T����kj����_��4�2������8O��r}�J��|n��V����&����nv
�����cr�����mm�osl�%u������@
��1���O�R�_e��~z�0 �m%U�1�2��o�
�2V����-��?.?����]�a{������w�����~�������H�q�*&�-���c��� ��������M���)�����>��cd��w��o;
qK��]��}����r��\�J�����cSQ[���]h�#�	`R��z]�?��z���@�Z2~�qY�o[R�����z[�����-K%�a�K}���'nG�L�l���� �4�)����W�����������O#����P@��~{�c�������=���s����\�\���kx�q����_;����������~����O�����_���A��|�/o����`�l�E���@�-%��c���pl�����nC�.�)���o�\��)��'v��N����%��dl>�L|���8����u��'�J�6l;1���zS�?���cj�0�0B���'w���6�Tm�4_���+��m��D�~��/ 
&��f�D�]�w����3_�u-���6x
������K�7�K���T����e��d�j�O��v�]�)� ��O��k���]�[J�_s��?~�D��a��V�&oJ�"�l�Y�����A���i�=�[�?f/W��Sa�����%��m����?s��5IU�������r;f�}�� ��O�.�������?���}���S���������:�9����{X�G��1`}�Dw���ol�I�m�PX����f�7��]��5>����
-��P������7�^Vh������}�Bn��r-��v����}�v����_�W�>v)u�h+��i������9��������~��yQ����U�f
/�6
Z�/��|�����6�S�f�X��V��m�'�������d�'�`����|��Y�Z�����uh����9�"�n��-)��4U�>�^^�N��b�����Os����/��n�V�<���`&~`@��7C��N�Kv��(P�&��K����a�a�/K>�m=?���G�������
�vT���D��1���\��(*:���>@]�o_}al�S��F���+��������f��R��
@�`$0��QK)��T0l����������}���b�,�3�|G��J-�s%��w9��z���C�����#�������4R��~�?w���+�R<�R����,��u�45���J��9K�1��cn�%p8���H}�^�������� ��.���+n'�-��Vm�fC?����i+s��?c���c
�����}��������}�}e����R>C<�-��������&�9�������mH�������N�R�@�`�^����k���NP�I�:�n����%�f��6��kw�?��S�
u�sn��������4���[/n���������e�;f�7c�������L��2+�m���+��MI{�~<F`+3��������n	N}���i:��y������CW%��o??���zFg�1��n�m�bwSKK��e�9�}�m;�/>�T.���/3�m�[P�����b��&���Z��m����iSl����:��`�����p��v�p��k��.�i�����,���O����\�CM��e�k]���;%����v�}��r�������O5g�h��E����������K����/��cl{>?�Mi����������
���@��HN�?t��|�||������vN��):���R�u�oG}��)����������t���
D�B��.��f6`���;h�.�.
��S]w��c��T�/��+�3j`hg]���x���L���^#��������62r@�������|�j
u������r��{�1������������7e�����uC�o�~�-��[��N9c�;6���������}����,oe~Y��kS�N���������-�����kx�Up�q&w?���g�W��z���?����+h;Y��Rm����L���L���v����?����/J��K��n�{�w�4k�mq\6�J�vU.��\���.��53~���R���y��fsu�5qQ�[�&w�����5�k����_�J�_;��r[f��������|�?M�J��E�����N����������Sk�9�<U�%kJ3]j�|��g�y+���3m�2�JV�O��of4���-K%[��:>����;'6�X7�l�����~��Rj�����yI���\3�>�����L�6Mny.��%��L|~��_��������E��&,�3MJ�����UR�/�����;����g��|vr�/m��o?{�����K����GvA�_��l��n����r�`�4�o�{�����]�����K�u�$����6Y>�7�on:��e���/���y�s��?��[{|�����X�)�o�?�����b���`�p{��&_�n���l��W�Ua�v8�����H�m���/�<]�7-Kl�v�O]�mu�N{R���}���u��j��D�Z?�Wj=����8o�y�����mG%u�~�?�?V����>VO�m�^rS��,������o?����RZI;<tP��m_����5��l��`&~`;����>|�J�����g|3�rT-����j�����W��i��;�El�<d��
����u5j�Ni���c��qZyJNS�d��]���y�c�-���i���dn�Z'��]��V|
jY����8}]{��x�X�)�vS����M��7.����M[���N6��o��'_�i;9��Q*�J�����z~~o������g���z�;7pq��^i�!���|���9�������R��vb��}�M�2��}k�m��[Y����ig�n��RK�"O���>����*�\�R��==��������jy.%i�?��X�Q����x^��O��n����=�������o�m���~������`�Rx����A�&o��a�/<��?��m^�S����U=����vrR����g�L�d��$��n��]����_8H������U{GA�+�����������;��y���D3��i'��.�A����s2����������{K����p�'���I��\{��-��?{O�:~]_��={?��/���������s`	X�LS��<����+�5�����a��v��V�[����
|Sj���������u�����������ACu�nk���mO�����q�tf���@#~��_jUgl%�����F��U������_�����7�[����hj�h�����.��M��hV����jk=���5(`�Gc'"���r�_�I���T��`hs���=���&&�[1���l4�dy�6-Y>����_/5��O������[:F����������J��*?�o������<U��6�z�q�u�%�f�����9@	�c�O��O�`I��/������F��<���x|>��9u����7w�V����Br*���M~��������������E����;)�����������[k��2vU���3���N;����>
��k7�6�c�:���+�cm������m���{d�85�8�]yJ���d�_l�����m�Q�����N�W!
I�����~`
������zq�O�;?��T���o�e��T�lnG.���o��K�S�-k�Z�k��cS�mO=�V]����f�7�����C���e1h��������}q�
�������
m*�~�q�������N<���vWE%��������a����m���5c���n_"��Sb{�~���ch���'�tU|~�����9��������m-/��hS�
P�c�yUj�4���%���v���/�]������n>���;&�w�?���A����V�}�}��^�f�z?_>��VHkj3��������:������!~=�����b{q�VS��n[]Y�]��h����������
n���b��s���_v���p�f4���b�O/Wa��
~�ia�u�^*xM��������'��!q?��g��_�-{��������X������U��4e�
`n����Y���f�G8�}�����I��c��^�On_���>��Jb~;W���������������v���m�c��w��~>��m�����-���8}=���H����}8x�#���W��v�3M=�K#���r�4�BS1Q���|y�����GO/���e������S������4��[7�F�s��u-Q���D���mg�����{%��~��m��m�����lG5������E7�L��A������sH���\P�Y`�4������o!~��O��e���j�i��
�vs���l�������f�Cv_�.6�dJv��C���
??En@�o��Q�^{^����1����>X^q��sk��5�l������MS��w��2�/8�9/�����#�S����vS������w)���~�7��k[v�6��^l����q�W��r������C��b�ge>�a�+r��������c�^!�,��v��4.���v+��m�M��������]�6����QH[������yk;ul�R8$��k?��e>��WpRe�������l#��/��S��%~�K�,Eu��u�/��(�9��>U��V?�Q��i��6����p�Z��jw���3�����Tv�����\�fl��~�������[.�5m��m��r�=��f��rhG��~�(����hc�?���n�.�o��y��}�6�3�Fa7|��i����Mq�	�)��~����o�������iw6Q���ko�Es�"��?e��������~��OK��#mz~�����G~?�vB�����6������,��R��2+�&_l��m'n�b���9��0��ol�>�|j��k���TYF�����m���
�)���?�E�T��:%B���.��o~������~�������|�,��~�mI����)��n���b>�����m�
��"�R���S��=?Xh�T��<�]�5V�������n�.��T]{������F�*��xnR�tl���UNn��!t+�r�S��Jn���n^|������^��rF<W�)�wb���V��:d;l�q�������!s����m�>V�')�_gZ����&�E����2��}�i+��n��k�M�������ij^�2`��>��>-�����+8H�];���1��A���&�_)�����2��F��}�:����uL���6��
83��C�Cm[ZU���
��ue}�5��iJ��-���\�����uul�5����k7�o������%���4��)������>G��w} q����x�J���o�y����������	�{�}���[ClO<�������[�sI�b�������5M��[zx��(�?��1�������\!#�����}��e�.�5���.h����������0�p�T����|Z�@d��W�i�i��_ �������>*_�� n�d�FUo�v�v��)+��P�l��v�����5Zrb�N*�kYA3M.�����A����2^�mt���R�N��o??F�j�]����~������������I��R%��}�F��������������y/.�l�Sq5�}peV��f7�/�����*�)�����Rlj2�-��SK���4�N���s�r�J�S/U6h��-��[<����cO���ob�����D����>�_|����j��_\��}���������*_�}]�������}W��~������/m�vm?�����#v���}���4j�mSwL~���.�sa���ML2�-������&o�%������n����&�v���$�
*��&�G��v�kj�e�+vnM��mv'Z+�A�8���������k�������T�zm{6����T#[G�ruS��m�m[5��f�v����|��Nn���S�b~*x[j�E�L��j�������=��tj��#����.J������L�@%�'mGAp����,�_�� �)�������^��9hj�)~����1*K�KSn����>�.����r��S���Ug��#h�=n/�{�8���1�J��Hl'%�N�-m��?��v�8bq���+���v;���Q��n:�/o�u[���F�y)��ow��d��T����q����I]�m����F�����
���[���-�������a������_hR��h�<5��J�l���vSS����EV'r�����Vl{`>^�����x�*�XO�����ek��g��������gr�
+�wlj���qY�j���m�������}L��Sl��"�Z���}�>���t�'������m�VO����K�*m����w��������R�on?�����������R7ol@�b��f����n���A�c������e^WO�����S�k(�<�q[��������_?�_;�fP[��������]�i7?Y���V����/���-k��.�����4��r������X���v����_7��z��6����������v�qw.�I�t���\��|���U���x�y|��:]=5���?�&
:)�}�uSl}������L��>����j���z{t���T�
������5�wS��+b ��2[�H��?k�������e����#Jl�=��C�b����H�d����h1�T�'��������~�uU������}���qX�[�O��om����eN���SO�5��@��������(m_M5�����~����|�6L��m7�������o�!�u,j�}�h�&�����v.^c��2��n?l�~_\[��?�.65M���j6�knWN�K,����?�n����G��q��	��m�>o��8Mn;�������D��M���~;/>�A����o�������h�����>�:��v�kU[���\}�ZJ��h���~�����b��
��S>U?A�4�Zj�����j�
�k���Q�������-��~X���W�����}����j���m�o�d>4� �>�5�y/���.G_/���Y��>/q�����6�T[��i����6�{�:mv ��m��+�
j���r�m_Y;���r����2_Gb����^���.�f�\^,k��A��N�X�n����`Z������v��^j�q��@���������i�/���R�C�C�����>��6���L����m���������=K����_��N��	�)�����>w���J�R%������|���Jw����5��������eS��8������R��M�";g������ci����m����T5\�.P�`&~`����V���C��k�����l2���/����n�n���-�u<-K�alJ�nh?G�����_\��u�`D��VH-������VC�l�.�Q[`��-��3q�o+,�}��������ij������D������lj��|������i��jqyn*��u1��������-k�Y���m+��rtf�5�\��_G����������������r��^[7����F�o���D���P��-�4	�������g�o��{�g�d#9Cm�sz����L�����7_rj��>3����qG-?4o���-O��)6��������/���T�0��#~:6t@���<K�i��`M|�����\����O}��a�~��}�������+$�^\���k����J��+KY�Z����h�v��E��2��w����sm�>7����[��aS[�[�[��~�~��_�����:��8wq}Q0�M�R����T�qp�g�1dly�����7��Nmoh?c���)2m_�S�[m��Ep�N���>o�y��V�?j�����
w���!X���f��zF�Z�;]6o�d�b�b@�8"�dh���l;a����s:��?����)��6���������T�x�b>���{���������B?ah��3�3��>�i�le���E��7re>Hj@�:6���m�^^	m]`�y�|R���k��O�~����:��E��c��d����.s�����$��E�cy?�����}��#+������~�+������6Ml�}�4�v�}����{��:�i��}����9����#�5��9i\���FC��j��7�/�� ��i����g�y��
}�u�6�n�}�Z`]���^�u��E�/�N�������f��rk���W�r�&.ov��g�P_�-������Y�R�P������t������X�����I7o����e�N���B�M��}����P��U�?"q~��H�g=�+�(P����_�k`�/
~;~{q��|N/���/��fSW�����������_�����aAF8�����l���#�}I��-��c|�������N��f�=���s�h��\f�����S�zT����os���ls�5b�t��:�~�_�m���y�J��<_SN�|�eq���sg�%i��������x�c��^[����m���v�vA���%��t{r�Z~;q�����}��;6�\���n����iY������X�m�9�T{*)2����:v���$[n��-��]����2[`������h�����_����6~QQ>���#��E ����NI�O�����`����m7���u��.��VS����o|^��w���0�k�z	����>�;�*��';�^���}�3���Vw�}q��!�������CO�s�?C<�l{�}��~��>�M�'�^f�������J��y���n�W;����m[�mJl�hP�G3}�g;�!����^��gm�J������R����}�b��l���y������]�N9������m��X�yl��-����9��"�����H��x-$��\��������?ls���a�k��/�����k��TM��i�S�%���Ba�t�=tZ��������-���B;�3l�V�����i_G���4�}�'����YW�������m���}�������d�n3m&�/��[R��5X���s���}�����}�m�}j������5������F�E�����T/����K<f�c�S�������o���|�G??Q�}��1���b-���_R�ce:?����]M�X�Z^�y��1��o�8�a��S]������R������$�����-�b�6��p.������/�	/�x\v^RTn�f�}�~��h�����_��z�y��^l3�v����Qb[����6b��n���fh�����c0�r��C��F�<.��I���|�q��-��`^��b
�3�^�Nn_��1��\}�-�������yNh������80$�_q�\�J,(O��Qlta����`��Y���`����p����gu�y~��0�//�}q_V^�a��� �i���|{>�u�;/��lq_��7�_l_Z���2�������h> �rMS���������l0��~�����~��?fk�����[>����K4���oI��s���m��z}�l7�Y/{,	m�n��k��;�1��Q�n���x��U7�O�-�:�3�����e���v��m���'[�TI[klg)���Wd�����~j
5������`&~)���������"���}�h��w��A��I.mY<X���M^���l�������y�y����K�-�Wh�yN�����/�v��^l�������0�:%�q���m�h>���f�����~�^�+[?5m��x�F���zm�v\.���/�W��X���e���L�U�b�����8���1D��v^���P[v��Z���J�-���X�>�U�1~��y�3U'%����-��"��o��X��
�.�R��<��=	[9�����������U��/$/}�(�B���g��P;���v;���zWp�/n�{����/�s��p�D���|�����4��hw���Z~~���^;vL��? �j����k'w,/�������v=q���)b�]��X�������>N_Sf}�b�l�Z�M;�2t<6�`������w���������xP��-MS�����m����g��N�.��56�����S�����y��������c��Q��sj�'��<���r?|�"�}�rJ���)�r[���Y�K��d����U���\h��~���������������{G�����+ �����&������(��g~\=����M_��I����_G��y��������y����:��?����'�~�||/����)����e�_�������}w����S���?�;�^�,�>�������������S�������������Z�u���>��5��6���3�"l��B���w�_�n����Ty�Lr_U>��8L�/3��Zh��~��2�`�iW�;����m����n����hu�Rmp��%�'������ A��t�M���������b���v����|[���M���e���L��f}y�]���i����S4o�hu.���)�_3������/��������q�J�%���Nc.�|��}�Xf�g��=7���sa��������{����7�����E�e���`:��|h�I}��y������7������_�wV�[�mc��O1y�2��R}(G;�[J���TR�H^��K����v����_�)������+��h�����E�]?���Fy+��S������n��0,������������Ss��Nj$I�$�H��~:[;II%��H`$�&��������z���r���'��_]��������_����J���	�|#<>>�J}��n�ap��(>`�m�*=P��?��h���M]���~�4�=:iB���D�6�}���1���V~���������7�_,�cz}*_OBw�Oc_��/�_���m�y;���c��o��u�����s�w��q���|�k���]��^�3UOL��C~�>m��������}�N�~���8�
��z���i=���vN�K�-4,��,v�mo����0��f�|S���������9_y�?%��xo+=�n�+=f;&��R���T�8����J��_��-���b=�W^��i\�cM�������uL�K<'������S/Q)/��s�������-��L��99Ob�����������<s����uF�2�{����xN�|�����,��>�ziM:��|L�I���iw���!���UV�;��s��g������k���x;��8���������Xc�X�N�g���>�������{u������4n�t}���J�o���a;g��X�(O�Q�c�Xq����3�o�WL<��%�������m��;�U��H��gSe��j�L�����d��������kMi��L�qn���
������C�?���R~�1���{[�R���������o7�?�M����k�c��by���"��S�q?`JL���+���e*�y���b�q?�U����9�;'���Jc���~Su+�bY��,��s�y��(�i[�������N9�,���-�wL�%9.�n'��S���g�u��jbY�����b�:�sqj�k���s��Kl�Il����������8���9/�CLg�'���6_�5��-�E��<��i=�]���o��9�{����o�����w	�� ���c�������to���60p��o��?��/��!��"���s�o��+-�������k�Zw�tc���>�e���Ys\��xL�y~ �i��~*S����2�~Q����<��2���tNq��~J�T�T��8�9uY����2oG�Li�n�*�R��������X.�K���Z.�����r�r��q��N?��(����O��D�N��4r^���q;? O
�]�����+������4��n�|��i>������>�(������i,+]��m����aR�D������.)3m3��9�����c���-�K{m�����m��o�,5S-�����G8vH�r����;��������[�h[���������4k�W�7��R~��Z�!�����J1��Z�TY-���Z�!���R^��[��*��O�+UY��y1?��%�Q�s�����T��b}1/���Z].��q���K���<�������2�sLY)���%�-�����-��*�D[(��t����t�</�)-�,��ei����y=�o)�qq��8�u:?�����sl����eKb�.[�v���e,,k/S?����p�W��6.�2�9./�j9������`�������_��_^^^��cl6�t{c��K�-b�E��Z����!���J�x������|q�����g*��s	�W��y�v�T���~1OZw����K�w��b:U�����8/���J�+.�zN�u�^��t��i-�V�4.��������%���-�����x�SeN��:��2�:�v}�l�N��Tr��%����)[#k�-��5�����e���~ke����Z�sp�x��WZ�m�J�/3�����}������cbO���R;��S��c��Yw����)�����9��oK��9G�S����y�C?��K���$������f�����en�Z������\��I��cY�vL�OZ]����J���%1Q)�i��J�R��2����|��E�3����������>�����4���x.�k�X\���M��j�V�o�6�/Z�
�������6��fx~~�So�����e��9}a��:�`�0@��L�&�+�������O����!�6����yh���lI�������u�w%��z���L����RZ�}���|��e�=~��1����M����Tr�����MZ���1���C�C\�;���������CZ���<<<����s�����Y3��_�~�Ei�W��v��������})K�u���<==������H��L�)K���bz3���������q�Z1����n�V>�k�����Y3���}V<�j�5cz�k���%=���ULOr[��K���M���Mn�}�=x�J�.���2�����&V�=����������Y3�J��Vm�fL���a���[��.�G�6]3�mRj�lI����><�]���~���,[��k��c``�`�������fY������K�f��+�M��^3�g�v��/eI���;��U��j�5c������G��
m�v]3�w��q)|	 >4}����H_b��yR}!_j\��mQ����G�M��o����������8�����������fvpF��K�\��y���~��[�����1�Wj�Vm�fL���C�/�U�����F�'���U����R�p^7������q��V��fL�r�pb
�`�������|�O�x�����%}�f��j�5c0/�����[��F���<pl��k��HmP������W�5x�"��u����>,�>��X�nf��E�c�|�W�E&�����a}I���y��z����/�U������ZJ���t��M�!�����Oi��k��F�?������7h���(�v��M��/V����m�/YY�7k�`_�WT��^3�g�>��/�U����I�_AE��t�����!���S�>R�~����%1���C�C\
+�M���xs���KK��c����%}�fL��6��Z���1=Q;�Yr_J��o��Am��`�6]3�'j���>��
��r��j�5cz3��������t�	:�`�0@��L�&�t�	:�`�0@��L�&X�����v~��}�<>>�[��c�X�60��4����O�6����?~�Q��b5�����u����q��tM:�80�u�
�U�8�m�%����znGp��|x|�EWO��}�}}}�
��|M��(_gV� �kC��B�&���z�Z��W�=��������ip�z�j���g������WZ�G����:fmg����]\?&��`V�
�������i ����{�D���������o^^^������\������@�D�R�N����$�����w���u>������?u~�^i-V�n�k����L��n
�� Z��'T��Mh�,O�����u*/_�!m�m��:��b���
��1Jkv��5����vT;
�5���z����s��8YQ�X�c��q�oqR ��!m1w^�:�@�4h�og�8.����(�W�3
�=����������D�h �E�oooc��T������
Nk���-�=u^�:���_�.�+�O�*F�%6���_)/��A�bJ�v���Ch��wH|�����|�;�-��;�~��-
������+����Z���g�����%����C�
�}��K�kg��m`�
X5���?�o��`}����y���Lu�-��u}V����O�1��9O���%������R[��sm.��>4��=�g�5`���Z�^�W���Y�`
�����(�����}�����U�?_���4l��5�s�������J��s������������OMV��I�<���Ys@�c�X:&�~��Fr-K��z���G#��`��:�`�0@����l�Qa�w��(IEND�B`�

trandomuuids-without-FPW-1.5kTPS_lagOverTime.csv.pngimage/png; name=trandomuuids-without-FPW-1.5kTPS_lagOverTime.csv.pngDownload

�PNG


IHDR�d�?8PLTE����s???���___����������� |�@��������������������������p��`��@��@���`��`��@��@��Uk/�P@������������ �������k�����z��r�E����P�������������p��.�W"�"�d����������������U��������22������������������fffMMM333@�����**�����@��0`�����@�� Ai����@��������������D=		tRNS@��f	pHYs���+ IDATx����� ���<GO�%z[��U *(�L��;��������$��jV��6��b�'���,V�b�X
��b5,���,V�b�X
��b5,���,V�b�X
��b5,���,V�b�X
��b5,���,V�b�X
�p�>[V]����N=}�CX_�]�V��B+]V�qF}����%��:�q��_X�zC;����C>��Hy�<�����Xg�p�k+���:�� ^��^��MK�����f������	�#��}\7\6�EB�C4^
o�q|�+�����iz���|.��c����������+}d�]/���t�V�����1*?����������v�-|��?^���}��9\/��J�n^��qi�u�a���b#������t}ar��^S~=�~�L�q��n�z����M��A,?��}:�|����Gk����Q��I�AWs����p�}����`c��^��Kb#�o���N����!��b�.��y?
�������^�����?'B�5-��}�L�?���*^w��W��(�?n���[O�>��p�|���U��\�*������w��=��[�g��L��w�����?z�����5�!>��n?c�?�=��6j0�_�o�u�������C����'\X����L�`���M���qM ����x�\�~����*w�f���)�y|+�"�}�>B��Y��p����^�k���P-��=Wh�����G����/����
w�����d��W�4\K"*'(�_�3,b#����������ti�;���O������rS?)���;��{�]���*����@�&�\pMPE-�G�u������	_������L��G��uF
oM���b&�����
���s���������|/��i���h����N ��_��=�@N�@�����p������_>����A^��0��C]#������
��Q�c>�����z_���i�����j�ZSuj�w�������d�5���1�\1X���`���jX�a1X���`���jX�a1X���`���jX�a1X���`���jX�a1X���`�V)^fN({�H;��[`�X�7`�&oS&�Nu�����n��������w��b�
do�:y��uK9�anV������N�����U {V}C�.�T�B%m�L��{����,�L���<�:��6�z���DJ�����b�I���T��L9�anV}�y�k����!��r%(�k�}X,V�����x��#��|��
��6�1!a�f�n
�����m"o�K�Z���u�h�@��Y��V�)��u2p�x��6���U�K^����E�2&��::�}������l�JS����S�bo9��V�.v�����	lf
eR�@�����B<�����	lH�b4b�@�6����6�1��
��b�%�`|C��
�l�vz�"J^�����l�v������������jD����:�,����?�s��@�D���.WJ�i,�/���!8���p��#n���'dm��nN���l�vD��Pk�jr��_dG������l��5;t��:Nu3@�=�5�"9C��P���E �ev�����p{]�n����m�a`����}����6X[��:�N^���*���|+�o��U���0MK~�NW~��%�a�	��������������lC�	�����uu�i
Y�J�G���U5�	��=IY5�_��vb�'�/�Pd����Gn�C=QN|����.n�Ce�2��~�S��t�,V�B�gl����v~9�j�@��6� �'�m�<Tsg�K���<.�������v���lC�	�:*�u��oP�u��i�W��0kI�:*��rE���������y���&j�;�F���s�w�D�PI�������s�����zw��M�G�_�u���n��z-�����"��
���s�PG��;�
�/S��.~��oY�5x�T	 ���=�{���l�g������\t�N�f�\��t?j�/;�����jU���������������l@�����
o�n�n����ij�a
@j���������:*iP��oo��0Lf���=�?�i���m�MX�>����LH�r����?L���],�kg��
��|5`�Q&@?�~cj��.n��`05���m��,bZ������*����{v����O
�v0�@7���X+5@Q:�E �eJ�o����6&0��6�Pb��@��
}�����"p��
�	%�M��*(y��@bwbJ0=1�(�lB0��yJ�\�
�	%2�����������6&0�� 5!�4��:������>jh����S��6���"���`��w�N=�s�}���#�dJ���X#������������v���)���}���	�M6���uQ�~�@��+�lN��8��g�s��Wc�Q&f&�~FW��~�asS�
��r�����`u��21+�&����~B�����cOg�k���Z)1!��]n�'�V6%��6��N�H�O������s�����@+��m�L�P�@-� �L�P=����#��cm���e�`�(*�*�X�D�L����
}�lB!�`�(��"p��
����kE�PY,�����N����V�*��D�PO���m@&L���\"��(��"p��
����O	�*���+@����{���ZB�
'�S��Hi�|5p~< ��rp���[|��h�0�,Wl����#��0������4��[@<q��������6XH�`����������?+HL��${��Y�(��B.N��o�������3�w���&��~�=zd�*�H�,�
4K�:3�;z������Q0�>9q�T�e�v"��l�5"�uh��B��`�y} ����J�%&SD�g�p?[{�sEN|���P��3���C����������`Y3�q
�Rr�'�@���5��������{��PG����59�VL����*��d�����@���+�������ba��N�}���\��c��a�N��V�/�&�_�n��j��vu����>�����B%�

2wM���������T��f#p��
`���d'nJ�jU��+��W��{��PI4z�X_SN0���r`�B�,L!�����PIVc��-���J!��p���@��6(���4�Q�;��7���%?��O|������nRCs0xxq
��:JP�K=��+������f������B%1Xm	`�QO����WpU�����Jj$����6h��KPq8P�~a@��?>��|a�����6hY������` 5����2�Ce�P��5�
����i�r;)S:T�z(�I���q�dN'e
�J �$`����&������s
�PA%�C=���`��{I��6�z���(Q:Ty��\��k�>7����[*�@S�����W������7�@�����P�g����~�]~O�U�����_4�m��t&��r
u�c���l+�z'eJ��b!7��l#�k��P����2;)S:Tk��3����2�C==�$a�0)Jf���9�{��(x��4b��wX�����fQ:T+N�9��f�RQ:T+�\o`}����P*
�J �%`� U���#��@�RQ8���:&�
R9{>�M���+��W��e��}
"18�T����
`'�_�"J^���;i��6�&�� 5(�T���O��A�z�$��x��(x��XZ��� ���RQ:TV�
���#��_�@�K���0�]���5D�+@�FL`t��\�SO���W����,�d�\*J^�`�"���7�`T�C%�0�m�n&�G���8�	��6<=`T�Ce�-��ZV�r}���� !@�(*����	l�\
���<�(x���	l ��b��������D�PYM+pr\��0*����}	���
O�������
O@R
���}���`5!@Z
�@�K���m@u�d
��CB�yB���W�����
��;��'t�(x��Xq_�-H��E����jAu�5�RQ8T�/��z���#�bQ8Tg>�����DW@�T�Ce����_n��`5�*Xx�L����}	���
^�?S��
��(�C%p�0�m���L�P(J��jWk)� Q:TV�J����jb_&4mr���'9D���O���mX@4ixZ�BQ:TV�J�����`�����5 -@�(*����	-���)�/@#&4kCpug�j������N��$J��jU9��#D�P	��Lh����Q8Tg>��sm���j\��`�`��jb_&4lC^
����$
�J��'`B�6�����q���#D�PYm*5(1��D�P	��Lh��'��h��vm����jZ� 	\_�������i���+���q��q�[Y��@ �%`B�6��ON@���+��U^/������B%11�U�� nW�J7=\�+�Rw[\���jV��~��e����PI�Ok�@�<�,.TR �0�U
R��y�.�PW�|&�j)|��7`�usO,�N�e4#����~s
����|x��p1�{�N�``a������	���r
����W��]\�$@#&�hC�� la�,\��>�K��B5V��#5���`��+jb_&�iC��`
p�����
q��:��(*�=�f#���XS �0�I9��h��&mH_
����3�����D�PYmia6�^�PI�@�K��&mXL8��h��mH]���U���rO���VD�+J3�����i^��	�)*�5-���";s�F$/OXN�P	��Lh��d��0��M��D����	�)*�3��	
�������d��'+�t�������a,i���q���y�RJ��jI���
P3�_^�1�����Lh��
)`]hF��cF��{o���6|������6������M���>�^/�������>\`Q�R/��@rW�e����x��MXN�PYmi��F��:�`k��S8����0�9f0a�'��2fdg.�����������O���l�	 9�,J��jJ�����A��`�����	���)@�����
�,5V�Z�0M@%Q:TVKZ��Q �0�5���0����8�	���
��W`5���`�`��6�0*����}	���
K��8��h���l������j�t��v���b�~P; E�P	��Lh��y�_M5E�P	��Lh���`�'�'J��jF�	�`��v��T�C%�0��'�I�D�P	��Lh��� �PM������*|3{`�������G�D�P	��Lh��%����Z�����jM����9��|2*����}	���
+��I�D�P	��Lh�����x @%Q:TV#Z<J)��2*����QF�W�D�P+���?���j��E���@�L��P(
�Z����SM��flX��;p0�T�0l����p
�(��u����[�@��s��7~��+��L�~Z�"q
`��a��d�j����=�e�����PI�������P�P7���&����0$�X���
k�W�+Nh+3`1
��-@��J���)@��b���6�wH?��s�`�������!��@-Q:�:b���@4*�PK�S�6lX
�SD_�Z�p��6lX
R�f�2�7�J��<���:ju�)i�f:���a ��s�G�_�?��'k����Vt�2'���3|��y����-o���`����j"X��B���U�W�����[�$@����=��3a�x�c�B%5������b���
�i[P��?�j>����K^�/l�f�!�>|��7�G@M�L�p�2.�hZp[���?)��!"	���0<�~�J'����[�B%5������TW~@W���[�$@�t���\a����>>9u�H�[��]
�9�O38O�B1v��D�+@
��8 ��^����J�t�u������
H�+��6|Y���u�O���D�P�6|Y%
^ju�)Y�d#@�TU��@8��(*��6|Y�������p�(*`��/�,�����$U��b�@UE%�Q�`��/�H2/���b�|az�gB���Z�t�utP
��	�~Me����Tc����Z�t�utT
����=�n
��%
��'R��'	�� ���	>�p�(�_�"�_@A#`%J�ZG\���`�0�C@�(j�U�^�@��PE�o��{�/�5�8���{�PC�0o��u 
M`5��:�+0��{T���S`$J�ZG��S��k��y��������(*��6|Y����
�K��g���������OrW?�,
������:�&@
@IF��X�:��a�7�mZ�n�,J�ZG�PTi`�	�E�b���������'���!�@��'��E�P9�������`,��H@����M������J�s`�������k������y�n#��b|A��b�y�l���/�z{��\��'�lWq
0����9�����:���x�������S�:��"G�@�n�Y�>�V�kj���2�m����o�;m&��N�~L��x�_�HJ����Z�`�*���D�P��@P%5}����q�@@-Q8TN�m������;��_	��Z�p��y�,H���RG\�-J�ZG��Sa/`��<*0j������JGx�>�*���%
�Z9����hbC�0 3P� �P[�0o��Eo���/���%J�ZG����)@��08
0����������	���T�� 
��)��
_�1)@�	"�	hd��fT�Ce���eQ�\���\�'J�ZG�P��@��\�&J�ZGb�4@�'�!�PM�S�y��#l(�b
F
q
��(*`��/�<$�R{f��������5XYcf��2Q:�:bP��nk���_�S�y�����A���p��y��l(��g`��W��Um����8��RH�xx/2����r�����2�D�P9�����oCq�U��7�U%
�J�rpS@o3���[�F]/���r]X�X���6���������m�Z
`~�?�K�n�����=�C���5�!���W�z0<9�V]�[�T�����H��=�����)�1����{H���O�~�k� IDAT���?��T�K�[w�,�k������0��E��too��w1�V���������0��f�Bm�m�)�nm�����<w*`��`�	?��T������g������S��T��@H[J����@�8�!2)���r�
{�2
����6li�m����(�@��=)���)�^m��~����{|����� �(������W?��v��[7z���~��t�n��5�w�`�6u*HR8��b}���	���(��C�;�����h�
���5�>����ye��6*�{W�m<1�=9��qEf�_J�.&���+z�����]��.�4�[k��@j���������M=�/%Ll�y��A��i(��x0���� ��k��@�9��G���)�����d�'���T0
���$L��E��e����5�a�����������s=�M/�Y����&�����lV�@m����S��<����R� V�[^����]o�id
�)6f�/����� ��qqP��K?���N����������m����0f��lM*�p
�U��^��h�]�b0���z>>�������mA��|F��W��@�}��CU�=2����^x�J>�}��0�@�(���T
ng��R Y������yQ�����x�TM���k���J�w���_���~��P��r�f�8�6��S���^D{k��
�:�+:X�t�+a9�n�����n
��6���6<�`
���<4X��Q�����/��q3��J���#
�.N�j����(r��?+��M*�T�U_���?���q�/@�3@!��r}a`Z���� ����s�|
@���^��|o@,N*��
A/���U����n|���?$~I�M(��cg~��?�tj0����`m8[��|M��
��z�
1@����"�iT_����`r��{E�������]��S��n@B������
yK������^����� +?���U�t0 �* ��fx�B3��0�h����u����uL
�J~��z���D�?����1D��@�]��hu(�1Jt�}��~[��q@�	3 �X��]����OI{����}�@ �
�&E�;�,R�
��Te<S@��x8D��=!���@�t
�D
�Sk���`����xS�w�@w���	`o
`��� ix�� @e��A��S�������/�U�����Z�#
^j��)EM.�������p�_�E��`���`[�J^(n>��C-����Nm���|�bg��l�@�+���8��*6�n=��P�C�~p'a��0!#��
^b,�PY96�E��g��!�����������]�|U��/�H0��t�}����g�.A�[�P�
����u�������)>x;���s��`S�J^�8�6��5p�^�v`a�0x7x���
�m}(x�� @e��`{����Ul���Qr�b/�p
`Y��\�|Yf�A��@S�?u�]���B��Z��[�P�
��������Ed� !���	D�"@�(*�����l��v������>���"��@�E@��(����������^�x����B����Q�~�k�EQ:�r}38�I?��:��I}l7%jTH��D�P��������CI@X���#��E5��{��$
��)���l��Dpew�}H��:����0@�(*��p�Vl��f(�jP�~=�L��$@�(j�8�����+��� ���O��\�o��1?^W?�D���R\o����;d��h���{�$
��)���l��`��l�^t?7��!&$4wP�m3�P�
P!�G���w�?���-`Q8T���l�g���x!�	<�I���%#K�m}�Hx���p������(j�88@3��i��_��nj�m���������������Gh��������e�_�3D[%����Eg�x���
V>W�X���!�������������� �S�E�+@[py]������L`��h�y���z�#��5���e
^�
��cF��{�{?C���?$�I����|�;������&|}|q����o������������{C�L��g��g0-7]
�����������_J��V?(T:p��}R"J^��x����������{��k)�V�	9)@	PJ��k�'m�l
0~�j
`G,�9�-��W�v�����������a�w�<�e���t�I��]��l�����{��������[�����4��(d����Q�/��������"�
�I�+x,������-���L`!��	7T��,
�z`
���6X@PJE(H �E)�uQ8�?���mi�`�^\P�P���\$
^j�U
v�@G! H���A$��EkE�RQ:�r1Q4���g�
�0E�P9�6��t
�0�&�V@��"G�p�k���`z�����W*�k��t���������	+����9����YQ:�r1�����I���@��m���
�Sk��J����5�?@w)C�++9
^b,�PY)�FUz���\4 ��W��vq
PAqxM���;�R;H��`���� @
E��dW���.]���A�+���s�.�o�)���*�����w/�@J�`m8I�5v��-��2nH�����f
0�
���������9%(@�(y`AU��m��}o!���S�F
p:�JPx,�2���`sL�E�+@�1��{�����J�@8�q�(x�l�u$�%�[������%Q:�r1��En.����2(}p@�(*�`�Y�����gG��)�/��ud�eo3�y
�Z�����S�:����
fV0,�s����t��b�@������l�p�|�i����,,TV�l8K�9�q���pP
�g8�����tW�����}q��J��^d��@���POe<2 �j�6|
�����5=v����mq��@�Rw��I'�����
\�8g�_�.�������Bm1���dT�,�wJ~pm<����j_�,.�V�JRw����P
 �G;�m��|�1�O`��?���,���<��
��Q����/�3�5��������A�H��\8M�H�^�Os���2nS+��aJ����Bm5���/�=N*��8�z�
f�7R����3�nZ������1����E���w���
�"��
������X[����# 	�D�Is�T��V�.H�����8L��/N�s�@L@�(*�`�Y�e�������k`��Ce�
gI@ �f)k��_
 S�&�&���haIg&�]�t�+b��P0�SHv�I�����b��8�� ]�~7@N|5�w��o,�a
�Z2��f'Y.�a����&�$VJl�\m$��0����`��#
����X-�ar5@�b��2�Sx����."�J�h�0'
��5��{
�]E@(�5�����h��]yEvl���A��3�E�Ps������;�-�)����;\�Pj���K)@^����e�d�{_����C�*�d�Ad]���XD��ZP�n��������POY���.
�=�����H)9T@E���k�#����O	G )H@R�S���V�6.����W2��.J$ )
��j(�����k	��_-����9����������PO�gzpm.�V[��sm�����9T@M-��P�`�~��,l���L�����0#
��)�PIk�����o��W\��n��jaI���Ce�
��|��j�:�{���v��j�����!�5y>$����.R5�9{��~������(*
���W(1��a��:�Q=l.���D�� ��\�sVD�B�p
6TR�<��p���,f���
5�w�� �N�4��1r�>����fZpG��V�@�^��O�+@
@�I�4s����-O�-�&8H�����y����M���Z�h�[�@R�� �_�)���=��������m�5�~
���3�p����R~+��46��'7Q��� ��$�z��[]����w�t~�?i��T�cID�B�0� ��/x#GH$��
�5 OD�B�p
6T�<�2 ~ck
P[��<+	Q8T�PIK)����
d_
@�I��
������F��."����b���1�W�l������
�>sL7 N|1@�\mHL?�pN����"�� ��$�z�����j��gi�`��C����
H{�����j$8���`�����N|1@�\�G^9�8!`�j�|H��Lx�f'7Xxo���CD�PS��6�b�M*s(16���r
6TR0�<b��"��l8�����l����X�=>�.��D�P`C%��u�6��c.�\�U���i�����I,`9�� 	���@����<�$+�3<��%�v5���������*����@/��}b��,:�����\���#��|�8*)�D�����N|m�}����U�x%@Za
@�51�X��V�����R�Z��(l�^?�`'��RJ��_e0z��=��xn������������{��h�	h�C�mz���[���������?��'������T�o��/�y�~���������/�d
�G�N���<r
�VT@�q�>��������[j!^�[���a=`^+pY�����W�����M-�+K������0H���~�/`�=���&1@&��S�������~
����2��%���������@�@�V��N���� `c
�(p
����3y�bH�*PL5�_@��x�]�~�`�D�/3��8H�������:�F/`$�g�0��\������p��B�����]�*�c�x���=^�b
�u>*�3����	*��S���2�\;�@!����s@�m����.�_���yc
0o�5����t����I�������Z|�DmJ����f��0Hp�_��5�o�D����7(�C�po���
k�f@�1)��/�j�[��p 8�Ez�2�tP
�V�@T���>���E�8�L-�s]�ee������"�x�_�p���J�v�|
�����G`;�����C��_2������B�
o8��"��?�_M*Hx��l{a�����5�
[���Y�8�x�0����-@)���C����k?9�~
5��*���`H�� ����q�	Q����>�v����\�v���
 Sl�����W�^[�g/9�^
5��	Z~6��`��6��#@1�_�k3�&���!������^�����	��~p�U`(^q�.�m ����.*@�I�4[0o��n��qS�!�^�70��g�R���$�[��Wa;�5��D���P��s5��>�������C���U���*8?�dW�l8��O0���:97��]��� ������z���m�}@�8�70���0�~r����%��4�������$�jJ���"�(}Ks��&�����=���
v�x��8@>t��	8_��J��,}8����.,PW���i���9��v�p0��g�����]�q1@����r��cz5@Q��mV�����~������r��"�� ��PI����N��U�G����%���[�{��/�)*@�p�*�*���������P�G	�6����G�G?�G�����2�#��M��bDQ��X�7�}�, �w$���.fM�PjQ��2��nK�
�8��Om���80��I��m2E@�g�
�����w���~(�e�gl*���i����:��?�8��2��\?����?����)���.��Ua���f�	� �� ���T����;��g(���k:Zu+�V99j�7H��PH��$�w�S������2��/1%C�n���n��d���-������U���b�-Fx�@�	��@
P�N�+� W�1bm@� r���� 3��'�X��_2�jg0���Ww`�@��@��U`�(�}>rS�#�7R���)�/�P� n���K@�����yD�������`i�)@��f���95T�����%yb|[_�3P���5�>�96|����!�,@�����+�T�����&���I{���� 	�����I���Uu� %�
��
���$`�(`���3`���K�a��_�N|1���`�;~.(J��MN�$�R����oK
pM�3�6��y��0EL)
:���uPsg�����(i0G#��;)����j
�
l��z(0bm���sJ��u#�J��������M��
_�
�k�����e�X�9�\H ��)@,���
�)� �nD�{@����5�H�^������@�b3��[�v6�Q�#�_I��	��`m,�V���0/�&m�^U������W ��Q�R��E���z�<����k���5���H�M�-��(g�m��
�W�
�#Zd�i��=��P����v7 �Fm�`VN�gt�)����a. ��])���������6|��7
��1b�����}J�����5�j�5�l���"����~{"�y��)��E��?��������v��j�vt�*����)�?af���
�� 5��V��N�v��@�M� xJ��E���5$@�*�.���� a����S60���>�S�3�`�K�t���� m����@����8����D
0�uA����H/�9x�a�����@�5�Z�`�6��Un��j��8E
�$H�B+��<���@�$P�}�k��1��f�>�o���k���>���)����Tq+�,@a
P�F)@�����V@�\�^�NN�_��
`����+~���.�_.
88�[��<H�)�������O;���m�����z�������;%����>AD���D%)�<�)@Y�U�jo
��	���`���`��Po8`�(��a�T`^-i���A�"��`��*���P��V@��@E��K�+��W�+!���j�
p`�~�a�0������K�&U�L+�s�S���K�P����R�� Eq
�S�� �H�� IDAT�m�{���V
x�^Mp�
���������=>�N��� �����X��S�}�)7��
�2����,�w�����H��F����H�8�
�M����
-`�<�*x�w�k���� ~�� ������
_����q��Z�"�A�Q!�72���n+ ��f�/:���AX����<	�
������VO`�}&1�.����w�����5�fn���	<�}������U���k���'�GW�k�������{\����'R���{jy��w��{���������S��?�w�
�G�_ i��o�ix�����c�z>�����>�p�E���.�w3>����G�����~m|�h?�����?��U�����s�;����`r�Ck~��?W�z�:OM��{#!��?u:}���4!K`���b���Z��X� G�����d��:)����%���\S��-`1~Q	���@ �e+�	S������}\�{H��o@�)���B+����`�M��=X34~�_@�*�Yl�`��TO,�K�>`I��Jg�J@�Epq��)�PXYWu�	�� ��5�?��Rm�� v���,���!�����=�*k��}�?�w�%�F�_G���U^/]j\��H���;_U�������R�b���+�����|G�rc���05���%o\M�P[��9�e�@�)��~_z?�/��&5���OR�o8����T\�u�YX������������"��8&(��J��+ti;�p�.��j���z�z�.�����0(���%�	�����p�\��ze���Z������"�d�r}p
��H���,��]���A)��[�����R���+��`�x��5����2Cc�7t�_�x2+�
0�T^�61+jo�2Jk���<�0wUO M��Z
P�V����;���0
��z{���W>��[�S�&�1h��z,)W�[v��.k��%���}=~%\qz��b�xL
�/%�b��n�����uh
0��si���`N��v@/
h����[�$��������f>�~�7S������ �I��,��j�G�����M�m,0 ������j�}P�1�|c�w�"������5���+$SS����,�����kp��.����{.�^����1��{��ck)�+y�m'
�Ail������3� N���:05@)@��H�7��� rS9SHG�*P+�b
��h�������x��@��{�L�V��"�D|o]���M��aZ���F�5���x_CK�S��@�`� ��qXnX�
�p�=�K�jY�&\5�����m���������> �gx�K� 5q�)�z������	��(H��j�Y���@��g��Q��1h��h@�,l_�.�8����e\F�P��� �e
�*����,X������@�@����X����
�C�m�?*���\
������K���z=�]�N&37 ���~
`����onO�b�(�g��.��43�����0@	p!��P��=�i��l�f<���g��HP���=�C�X1J5|dpqG��Y���@��?�������@���P1P������]H@�I�Z~�o�N�)HNB>�d���:p+x����� p)���mj�@?_X��Q�c{1f.��k�
O��r~�Pg��_��T�!�R}�v����B�t
�]_���NfZz���J��6���6XS���n��i�(+���r
���z�jv0�s���lS@���-1
B�}ktM���)���)@�� "�����`���$G���
���G�/xyz���c��y�v��`��A1 �pp�p�n����bsS��i�J>�x��L����y���7j8��s��������@����T����z"i�� �������1E���v`��0w|+�}�62|�
`OtH��_�`G�����`���	'���OP���P�*��)���I��3j@�Az5��wW��f@�Z1���of�r���^`u�����o�r
 C
g��iGVpU���{��K}w���i��2��Y��>��J ��B�mP���m%�`�5���M����@z�`�*�9�)@8/�����6h�*�'�
�S_�
�`������j�{���RU������_����qw��{�����7�ep������{
����,����`cm�EQ��N
 ��e/���������R�?:T�l@�����?��a
���%���r#1\����?�`�;8.�����Nb��"����(@�?�)C�����@+@�����&G�������7��]=���9������A�iLW}7�@o�/����~|t~+��������3Ju� J����@�G��=P
@� ����W�p�Ntx$�x�%����G�	��!�;O�����is�"�a�R)����0����9��Bf� I��Hpla��%���	��.0�uL
`����.�($L�������i���b�%�L��)�2S��`�H�
��������&8������]���r%5���4[������������4����=�p@�"�t|8�@�	:����}] @
 �g�������S_� )Q�j+���z2!1�u�Y�������j�[�/�������"��@� 5�|�}�8�Y��r��;�n��u����N@��z��|o�������a!2[tT�[T�l��_/{s5����t�g��H[u�O �Kzg�B��������6�?���O���Y\g��b���� ��`J��!U�����=r�n�j_���������@�[�E6~]A���u�=��)�	R���I�����~}� �M�s�_������R(�7�e�T7GQ`��@��|��0@���y��9�P������d�E�>NP`>������fO
��pF	���m��}����^��
�&b��St ��ZRr�\1Jjb:�y��2���p�z-W�O
�z���knS�3�a����SD���.�������#�t
f��0�4c({{}����K�5���^6\������ ����!,�J4�M
`������P�����Rh`Z��OH{F������e�f��_b'}W\�]gt/��,�G�t��B�k��9�Q�e�*J�)A�5�@�^�Q,��1��`H����zJ��}�����t�lBfWe�6�	�&
5���(�(���qw8l�[$ z�s�	]C{u��m	p
���5:���������	AL���j��E����=K 1}�{L}d��/��.�C��qzH���l��%�����8�?�6x��]���\�����y&#�/��������y�?�$B�u��U��������w�s�{51���-dm������:�:t�jK4Eg2X�6D}N�'�����E�_�S�_�CCa�6��>^��GF+G{�+4b�jG+��,�o��i��u�5QuwU�K���P%a�
���t7�?@��6�����V;l���"h�.@�(*���X����%�io����{t���)��� �@�K�b3����:*�a�E�d�~���I5�:��5��r�pa���������*��BB��s�� u/�)���T�ni"C�����1�DS��oM�h_��y�0��W�'��u6�����W=>������z���@��3����+��'w�(��C5w��f]S$�~tm��h�~���p�8�51��F��N�p�s�����1@��s�
���(t#�0�w���9�����u�B����i�����Uy���]A$�(�9-�����/��T+@�v*�z�+�O@��`p:���9����g�s����������������{jR��)��ES��h�6�L7a5\Z5]^��-�*�������-����	8�-�	����q�za�t������Y S
*���^�w�eC��;,������.����v8@�Xq �3
�����C��@�n���*H@]���?z��xi�i`�@����i�?Y[�0�ZwH����y��b��x����h�:�����x�a(�����_���)�0!����R��T7�������'���d������x�A��z4F]�0��45H���t��%�Q����yS��K�������{s����A���������4�RP�
|��+����O�h��#Dn�z�����wa�KK�����X��������$����j@]�t��92���"	�{�D�g�|���+N��`gP�����}�}����
LS#�\�d�5��(a��O�e���;����~������<�O]�#`�xO�.�?��<�
}�6��_t/I�W�����C�X���vV1��@���������z��
(5z;�~��������������.��#��!�BC��G���r�`�|�Y��~��h�����������{O�������=��
�~x���\�V����4v0fT6��vF�6������j	������$n���m�;hY�\�b��~��
�%w��~��k�	2L�����<t��<SQ����������h�q���9`=���'|�������>�>���p�~������>�>�Hc��&�<�|�L"��b���bdJ�]E�����O���vc��{�1��bU��!=>��Y%uW"��NJ@������A������
���#�e�fHs��mS-\G���C�p�\�}.�>�"k�#�f;a�GLm�!���{��Cgy�����Gv�����;@;�?v�+@��5	������u������@b�v;������u�
�gs�>]E:k��&�9��*zvf^}rI�f'�,`�Z�f@uErw��./M��v^4)������M�\VCS��@�G����|�W�n�?=���� k�c|��+�N=�����Fky�qG�#@���k���AS�v]�ux:g�u=E�9�"����������P_��C ��gn8�]I����a�������.��<������E��6c�P�"�{�������?~�z�l�O��ywa�#t1��`Hig�E@ZC?D_!�����Up�53U@�3��>����dNw���F��i� p
��C�5�}Y����4A�>CQ��P���2=�M=s0}	����K��zDcxg�w���:�3�Q�.�v�j���Mi}7��9������_��g
�����w�>TZp3R�
�)G��o�6h� ug�iU�L)�;GLTh|��	X'4M=���]�Rn�D�#���]G���D��Jz�@�;,���X
	���~��uH�*�2�$���*�
�C��pQ�)���Bk����b#@��q7����e/����<���v���&���f�k� Sj:�'�j(�.)0��[��,�n��
w�r�g���!$8�I��8�V{��?H�Uu�����y�`by�F�{9�z�X-�`�zI�wT#����E=5�����X��O����E�[[z�l�m��_	@@���pd�sn{������pa"����c���i@>P��:B12:�U=b-�@�9�0��Bo�*�MTt��o #��p[��}w���s�����E������rX���kR�t�'� 1����ox��^���Y��Pw	�F6P��h?�����\�7 D(��6�{��gB(��@c�0�J�03?\��B����6S�C�?�cH��P/�����;���x��qT(�#�~���(���n���b��?�����V}���a�c�{o�t�BL����@|7u���}�
���h7(X�������M�j�g�}
��
��I��U12�`��������u�2�S���j��P�[1�<���,��i��������U����	0�&C@>h�1�~�!~
��/a���'
21HU5�_��2�Yx]�9:��������R��-�:�g����~�w,�J�!������1( S�������f/��S�L�0��?����fC��]�Lk��t���p��:��.�f��^O�u��'�4)�w���F8�,��:�U0@,c�����Y�6X		�������{VY�G����$>��N�g���
[�����T`5)��3{�����O�$�@Z7����g��p���/��j�4�z<�N��S�]�y����0(�x�k=�G���0�����n -�k0��
i�r�w�����2�L��M�y4t��r���G)|���+�F�}P6�����B��D���C���)�_�,S������+[��8�
bS@+th��$���g���]���OK����Y�T	�S�9�f�NtK����71�34���"}�&�1�?�p����k]���C�6U*T���_�"���?�R��}$i�)=��7�� �~T����y1u�/T
��W�#�����: 7p��5��N��t����H�%S=1��P'�������}�=h�9�{��#8���{�bhnJ(�.u7	�e	IAbe�A�|�P%��Vp���Y��{�/
�z�u(��%�&B�y�

��E��j@���#�6n��t����A:����`��l��i8\;�U�F��0H�
0�������h�i��QU\��K���[M���#9��U!1s��+����a@Z�Z��Dk�a��G����y>��i?��}'?���i0������(��
A��1��W(�x�)@���<���J��L������)��>�
t
7'<D�f���w�B���o'�`��0Hx��`�
�q
�"�I�/�K�AW��A�D#g�2J�4%��El�A�Bx�����;��%���m9�UX���4t��4G�����	=�k����z�g�(��^�0�O��Q�`d
*�~/�?��2�������^�� 0�k�\1Q����}	1�`�)� a\��ltb����x���cU�Y������	C����nj{8J�����������{a��� `p�'&�hK�3���&���H���a�����4��z��v=k��4�IDAT0R��
0�<R94�����;���Z��o����0W:�r�*����\�L����&o�|���� ��BX4$���o���A"���
�c�6�Q��.���Zu��6�F�`b,l��D��9$}���Q��$H���7(a��R{q1Q/���	�����D;���K�r���o��No}�VuC��6@�bt@gb�=aje����{l#d��� W��Tf���A ���TH� �s7w*�����
a{8b��Z@�E����>-�a�!���&����c�e��V��m��w��1����'����A����e� S*���;�)
}��?*z{��}CUE	�	�Nm���g����@�(F������B����XAQ��1�������D����)�.F�bp]�����
��D%�����L���8v���W`�3�4���e������i��
=��h]�LA
�E�k0���!���,`�re[�(���|�]�2Z���}t�-�J�XW��J$���]���6��a���i� �G$Hd�z��K
��<���b1	<gV�8]�b�9O���D�����E`���L4b�0�k���~��V�0������ w��<������NxX '�����W����Z��m��<@��2���������iD�'���� 8{�O[�2��Y�f�H�z��.P*.���:������{�	��tCY���O���=A��z��� �J$����K���(���~~�S�X{��'!r���"�C������D�$1����B}�Z��������8cc�w"zL��r����`�x�(l@�����x!��c�<n����.Ii��&���@��H��7:��J0�h����"������S�Yy��E�t�sw���"\��@���L�����P3h�2�-���o�����!���>����ih�
��)������s���s�j��O��N�3�VJ~�������	*��E��m��������i
����0=r�k��p�08�u��q�������
Z����U^���������ckg��H�G��.�
�:	8��v���V��A�/��s���="�~K
���6���G��3rdpz���	�"���e`�F_p%���<$�:K�a��������u��*TE]�^��y5
�-�����������
�=�0Wr	�6�o���>*�F��9�����`�W��`�U�]H;`�)l�+�Q��$'R�B���������'b��.��	�fj{h4+�I����75��a�K�n$�n��-����w|��R������,W�S�@>����j�U��NO=�r�Q���7�@��%�n��;u�_s�w�����i����a�l�O[��6Xp
k/�����NjU�vc�/]���0�.� �gD��&�'��E=�L�]�9D2�l�/a^$�����h8��40�������'qo"Gz�����?
�K�<7�m�i�LW�0��
�p}}i�}�}E�X� =i��z����v:;�h}�I��b`;���s�K|
��G�c���p-���]m�Z�k�! t������>�BX���^�>[�	�r�Kg~��y�X��"�����*���{|]���C5�-�F�+������w��(p�,��,�/�J�B
��! Y�C���	���6��M���(*H���/; O���|�\�I��;MX:��Hy��T��]�x��u�\]<�Q��"�n�����A)�XBA*�
*���;�
������B\�s��
����j�(��g���d/~��W����
I���������q��-���?E�����,Z�O��$/�~�!��O������/�4����	��0�jk}��:1l��:W��LeN��B!\��8�������B��'���_@�@���['c�����q���x�}��������>�y�N�!&��v���r��T�H��
w'�����Q����L��4��KnW�G�E�.���v���J�]i�]���;��'�;����x2S�����4a�7��W�������D��|z�R���_��O����(X�O ���?�J�
��|Q��A�F�xj����v��q��+Wna������3��'��]��=g?;������R����
����we�u������s��+��TI��w�}�~���w����Om��?sg�{�>�r-���S��/��0���,��zz�ws(t��]��O�K �e�G�2��[�������)GI6
�����}��f���~��6����*����}<M�nPN<=U�����`��'����	�3S�k��L��@��z3����J��S�7��s�3�������X�Pk��EY�r�������fU���n�� �Tg�����h�k��~���>_'+����,�`�m�R�!�$��Os�*�����&��)�h�`d��#5�[7>��Z�\�<5^���.�|?���6�����<}{z5��]������.j�x�,_W�\(�v��YxuAk�������%����_
��@~+@�['C������Z3��s�]��>��^6��e/����(�k�sfw�R`B(������?;~*���
`?�M!t����wu���5.��<���/M!���]��}r����}����'7���Q5���{�U�'���9�7���u��<��bQ��jX�a1X���`���jX�a1X��u �?�����?ya� �L��MP6t'�0}��Tnj������z�<n�}�|`�H|�����9��������LH� ���S������a�3\_���K:�N~y���U������p���O�N4A�	�0�r�����h�|�����?����)�?�'/,�q�g�p�
�n�c�h��9�k����aE���i>�'/,���g�p�u��p��}��6<��\��5L6L��������]N7����k��q=��uL����g���������fv8�/~vp{�~�W����\L)���5,�kuuU���I��53������s����������[�=��������d�>g�0����d�����A�����&�@�N�O1��p�DG������O4��8������������)�)�u`O��Otd�\����p�	�M	`>�L&}N��z3]����V
�y�X�a1X���`���jX�a1X���`���jX�a1X���`���jX�a1X���`���jX�Z�*re"��y&�f��W�f<6�t9����@NP�F���#�����9�
��$U<Mt<����b��7�-��$�f��9����3��d�]u�u+�msg�2�<��6�v}
?C��w3��O�>�\��v?���z7��$�0���1=\�;��w/�I7���i��ir`��<��F���n�>c���#�;���L���]�{L�T��'�f���F���]w�w�u��'s�vg���`e�r���ghMw����Z8��}.����}1H
@���w�]����T����`+X*y�h/���M�P�Oox�uz'���UK����:m��>[���1^��[}�$>f��)%�xG;V-1H*
�G�����m�����7X�n��wg��k��W�hp�b�T~�=@�pC�zX��R���?�K�������0Y��@RiL��������N���ro_L���
V����`��^wy����;��V�������u�\����X����+C?��gL����.)��:��s?���/�	�sb���c~NV�x4����b5,���,V�b�X
��b5,����}�X����	IEND�B`�

#65

thomas.munro@gmail.com

about 5 years ago

In reply to: Jakub Wartak (#64)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Sat, Dec 12, 2020 at 1:24 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

I wanted to contribute my findings - after dozens of various lengthy runs here - so far with WAL (asynchronous) recovery performance in the hot-standby case. TL;DR; this patch is awesome even on NVMe

Thanks Jakub! Some interesting, and nice, results.

The startup/recovering gets into CPU 95% utilization territory with ~300k (?) hash_search_with_hash_value_memcmpopt() executions per second (measured using perf-probe).

I suppose it's possible that this is caused by memory stalls that
could be improved by teaching the prefetching pipeline to prefetch the
relevant cachelines of memory (but it seems like it should be a pretty
microscopic concern compared to the I/O).

[3] - hash_search_with_hash_value() spends a lot of time near "callq *%r14" in tight loop assembly in my case (indirect call to hash comparision function). This hash_search_with_hash_value_memcmpopt() is just copycat function and instead directly calls memcmp() where it matters (smgr.c, buf_table.c). Blind shot at gcc's -flto also didn't help to gain a lot there (I was thinking it could optimize it by building many instances of hash_search_with_hash_value of per-match() use, but no). I did not quantify the benefit, I think it just failed optimization experiment, as it is still top#1 in my profiles, it could be even noise.

Nice. A related specialisation is size (key and object). Of course,
simplehash.h already does that, but it also makes some other choices
that make it unusable for the buffer mapping table. So I think that
we should either figure out how to fix that, or consider specialising
the dynahash lookup path with a similar template scheme.

Rebase attached.

Attachments:

v15-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v15-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload

From 85187ee6a1dd4c68ba70cfbce002a8fa66c99925 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v15 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.  On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/include/port/atomics.h         | 24 ++++++++++++++++++++++
 src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
 	return pg_atomic_add_fetch_u32_impl(ptr, add_);
 }
 
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
 }
 
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+	AssertPointerAlignment(ptr, 8);
+#endif
+	return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
 #undef INSIDE_ATOMICS_H
 
 #endif							/* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d60a0d9e7f..3e1598d8ff 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 }
 #endif
 
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	ptr->value += add_;
+	return ptr->value;
+}
+#endif
+
 #if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
 #define PG_HAVE_ATOMIC_SUB_FETCH_U32
 static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
 	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
 }
 #endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+	!defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value += val;
+	return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
-- 
2.20.1

v15-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v15-0002-Improve-information-about-received-WAL.patchDownload

From 20ce8f35559df98c0ef4ccee6a0a3a5146257131 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v15 2/6] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently.
Without that, it might be difficult to know the path of the file that
has been written, without data races.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9621c8d0ef..cc7b4f7f11 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -870,6 +870,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -961,7 +962,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -987,7 +991,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1327,7 +1331,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_tli = WalRcv->receiveStartTLI;
 	written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index c3e317df9f..3bd1fadbd3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -284,10 +284,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -309,10 +311,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -321,8 +323,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -330,14 +332,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 1b05b39df4..a72fd4fd9c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,14 +74,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like flushedUpto-1, except that it's updated without
+	 * waiting for the flush, after the data has been written to disk and
+	 * available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -142,14 +153,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -457,8 +460,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.20.1

v15-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v15-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From c28c516c1fabd6b4e80f2f54b16cb3d9d7addca7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v15 3/6] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 105 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 699 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c2f3..5f6df896ad 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b1e5d2dbff..691d6a0ab9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,7 +211,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -921,9 +922,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1427,7 +1430,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4358,6 +4361,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4401,6 +4405,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(&XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10208,7 +10248,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10344,7 +10384,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -11966,7 +12006,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -11978,6 +12018,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -11988,6 +12037,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -12010,12 +12062,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -12041,10 +12094,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12164,7 +12217,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12267,6 +12321,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12383,6 +12441,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12460,7 +12522,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12483,15 +12545,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12517,9 +12580,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12533,6 +12596,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12569,6 +12636,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..22e5d5ff64 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -439,7 +723,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -476,7 +761,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -487,7 +772,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -497,7 +783,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -511,15 +797,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -527,9 +814,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -539,25 +826,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -573,7 +890,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -608,7 +926,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -626,7 +945,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -645,7 +965,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -664,7 +985,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -974,7 +1299,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -983,7 +1308,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1147,34 +1472,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1189,17 +1563,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1217,7 +1594,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1228,18 +1605,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1247,7 +1624,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1256,9 +1637,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  (uint32) state->ReadRecPtr);
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1404,17 +1785,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1423,58 +1805,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1500,10 +1861,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1523,10 +1885,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1554,12 +1917,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index e0ca3859a9..eb7798f488 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -350,7 +350,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -828,7 +828,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d87d9d06ee..43523c464a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4024,6 +4024,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3f84ee99b8..4bc22deddb 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d5c9bc31d8..4df154f11f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -812,7 +812,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 9275cba51b..eb363ceeb6 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -248,7 +249,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -432,7 +434,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 31e99c2a6d..7259559036 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0b6d00dd7d..44f8847030 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..374c1b16ce 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..ad3a18dfc9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -960,6 +960,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.20.1

v15-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v15-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From 167e87f415898c7d0a6f847909a6e00e2f70ae40 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v15 4/6] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  22 +-
 src/backend/access/transam/xlogprefetch.c     | 895 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  26 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1387 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b60382778..ac27392053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3366,6 +3366,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a5366..5bf0bf97c4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2878,6 +2885,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -4886,8 +4965,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
    <acronym>WAL</acronym> call being logged to the server log. This
    option might be replaced by a more general mechanism in the future.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 691d6a0ab9..594f09d7f2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -109,6 +110,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -3697,7 +3699,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6541,6 +6542,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7218,6 +7225,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7228,6 +7236,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7257,6 +7268,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7428,6 +7442,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7444,6 +7461,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12302,6 +12320,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12566,6 +12585,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..a8149b946c
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,895 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 22e5d5ff64..fb0d80e7c7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -866,6 +866,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b140c210bc..5e8e3848b6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -841,6 +841,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 43523c464a..a7010154cd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -296,6 +297,7 @@ static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
 static int	nReplSlotStats;
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -373,6 +375,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1396,11 +1399,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2846,6 +2858,22 @@ pgstat_fetch_replslot(int *nslots_p)
 	return replSlotStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4673,6 +4701,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -4886,6 +4931,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -5167,6 +5216,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5440,6 +5496,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5545,6 +5602,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -5863,6 +5932,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5939,6 +6009,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6122,6 +6204,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6818,6 +6907,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..912a8cfcb6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -216,6 +218,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 878fcc2236..511509ebd5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -202,6 +203,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1248,6 +1250,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2626,6 +2654,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2946,7 +2985,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11692,6 +11732,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..4288f2f37f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,12 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..4f58fa029a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,7 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..8c04ff8bce
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22970f46cd..4d3d73c30c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6183,6 +6183,14 @@
   prorettype => 'bool', proargtypes => '',
   prosrc => 'pg_is_wal_replay_paused' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ad3a18dfc9..329eaed0b9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -64,6 +64,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -186,6 +187,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -500,6 +514,15 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter m_stream_bytes;
 } PgStat_MsgReplSlot;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
 
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
@@ -647,6 +670,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1552,6 +1576,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_wal(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1569,6 +1594,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 6a20a3bcec..bb2eede693 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -442,4 +442,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6293ab57bc..69f6c69c6b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,6 +1869,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.20.1

v15-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchtext/x-patch; charset=US-ASCII; name=v15-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patchDownload

From 2f6d690cefc0cad8cbd8b88dbed4d688399c6916 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v15 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record.  Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

XXX Needs review to figure out if it's safe or steamrolling over subtleties
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 ++--
 src/backend/access/transam/xlogreader.c   | 13 ++++++++
 src/backend/access/transam/xlogutils.c    | 23 ++++++++++---
 src/backend/storage/buffer/bufmgr.c       | 40 +++++++++++++++++++++++
 src/backend/storage/freespace/freespace.c |  3 +-
 src/include/access/xlogreader.h           |  7 ++++
 src/include/access/xlogutils.h            |  3 +-
 src/include/storage/bufmgr.h              |  2 ++
 9 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 594f09d7f2..7fbd86320e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1463,7 +1463,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index a8149b946c..948a63f25d 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -624,10 +624,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index fb0d80e7c7..9640899ea7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1651,6 +1651,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1860,6 +1862,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1874,6 +1885,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index eb7798f488..45a9f180b4 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -335,11 +335,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -361,7 +363,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -390,7 +393,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -438,7 +442,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -446,6 +451,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -504,6 +518,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c5e8707151..daf93f8302 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -598,6 +598,46 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+			return false;
+		}
+	}
+
+	/* Does the tag match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Nope -- this isn't the block we seek. */
+	UnpinBuffer(bufHdr, true);
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 6a96126b0c..c998b52c13 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 44f8847030..616e591259 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 374c1b16ce..a0c2b60c57 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..c3280b754e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.20.1

#66

andres@anarazel.de

about 5 years ago

In reply to: Thomas Munro (#65)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2020-12-24 16:06:38 +1300, Thomas Munro wrote:

From 85187ee6a1dd4c68ba70cfbce002a8fa66c99925 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v15 1/6] Add pg_atomic_unlocked_add_fetch_XXX().

Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value. On modern
architectures, this is simply value++, but there is a fallback to
spinlock emulation.

Wouldn't it be sufficient to implement this as one function implemented as
pg_atomic_write_u32(val, pg_atomic_read_u32(val) + 1)
then we'd not need any ifdefs?

+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable

It's really not adding "atomically"...

+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.

Maybe add a note saying that that in particularly means that
modifications could be lost when used concurrently?

Greetings,

Andres Freund

#67

https://www.postgresql.org/docs/13/log-shipping-alternative.html

thomas.munro@gmail.com

almost 5 years ago

In reply to: Stephen Frost (#61)

Re: WIP: WAL prefetch (another approach)

On Sat, Dec 5, 2020 at 7:27 AM Stephen Frost <sfrost@snowman.net> wrote:

* Thomas Munro (thomas.munro@gmail.com) wrote:

I just noticed this thread proposing to retire pg_standby on that
basis:

/messages/by-id/20201029024412.GP5380@telsasoft.com

I'd be happy to see that land, to fix this problem with my plan. But
are there other people writing restore scripts that block that would
expect them to work on PG14?

Ok, I think I finally get the concern that you're raising here-
basically that if a restore command was written to sit around and wait
for WAL segments to arrive, instead of just returning to PG and saying
"WAL segment not found", that this would be a problem if we are running
out ahead of the applying process and asking for WAL.

The thing is- that's an outright broken restore command script in the
first place. If PG is in standby mode, we'll ask again if we get an
error result indicating that the WAL file wasn't found. The restore
command documentation is quite clear on this point:

The command will be asked for file names that are not present in the
archive; it must return nonzero when so asked.

There's no "it can wait around for the next file to show up if it wants
to" in there- it *must* return nonzero when asked for files that don't
exist.

Well the manual does actually describe how to write your own version
of pg_standby, referred to as a "waiting restore script":

I've now poked that other thread threatening to commit the removal of
pg_standby, and while I was there, also to remove the section on how
to write your own (it's possible that I missed some other reference to
the concept elsewhere, I'll need to take another look).

So, I don't think that we really need to stress over this. The fact
that pg_standby offers options to have it wait instead of just returning
a non-zero error-code and letting the loop that we already do in the
core code seems like it's really just a legacy thing from before we were
doing that and probably should have been ripped out long ago... Even
more reason to get rid of pg_standby tho, imv, we haven't been properly
adjusting it when we've been making changes to the core code, it seems.

So far I haven't heard from anyone who thinks we should keep this old
facility (as useful as it was back then when it was the only way), so
I hope we can now quietly drop it. It's not strictly an obstacle to
this recovery prefetching work, but it'd interact confusingly in hard
to describe ways, and it seems strange to perpetuate something that
many were already proposing to drop due to obsolescence. Thanks for
the comments/sanity check.

#68

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#65)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

I did a bunch of tests on v15, mostly to asses how much could the
prefetching help. The most interesting test I did was this:

1) primary instance on a box with 16/32 cores, 64GB RAM, NVMe SSD

2) replica on small box with 4 cores, 8GB RAM, SSD RAID

3) pause replication on the replica (pg_wal_replay_pause)

4) initialize pgbench scale 2000 (fits into RAM on primary, while on
replica it's about 4x RAM)

5) run 1h pgbench: pgbench -N -c 16 -j 4 -T 3600 test

6) resume replication (pg_wal_replay_resume)

7) measure how long it takes to catch up, monitor lag

This is nicely reproducible test case, it eliminates influence of
network speed and so on.

Attached is a chart showing the lag with and without the prefetching. In
both cases we start with ~140GB of redo lag, and the chart shows how
quickly the replica applies that. The "waves" are checkpoints, where
right after a checkpoint the redo gets much faster thanks to FPIs and
then slows down as it gets to parts without them (having to do
synchronous random reads).

With master, it'd take ~16000 seconds to catch up. I don't have the
exact number, because I got tired of waiting, but the estimate is likely
accurate (judging by other tests and how regular the progress is).

With WAL prefetching enabled (I bumped up the buffer to 2MB, and
prefetch limit to 500, but that was mostly just arbitrary choice), it
finishes in ~3200 seconds. This includes replication of the pgbench
initialization, which took ~200 seconds and where prefetching is mostly
useless. That's a damn pretty improvement, I guess!

In a way, this means the tiny replica would be able to keep up with a
much larger machine, where everything is in memory.

One comment about the patch - the postgresql.conf.sample change says:

#recovery_prefetch = on # whether to prefetch pages logged with FPW
#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW

but clearly that comment is only for recovery_prefetch_fpw, the first
GUC enables prefetching in general.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#68)

Re: WIP: WAL prefetch (another approach)

On Thu, Feb 4, 2021 at 1:40 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

With master, it'd take ~16000 seconds to catch up. I don't have the
exact number, because I got tired of waiting, but the estimate is likely
accurate (judging by other tests and how regular the progress is).

With WAL prefetching enabled (I bumped up the buffer to 2MB, and
prefetch limit to 500, but that was mostly just arbitrary choice), it
finishes in ~3200 seconds. This includes replication of the pgbench
initialization, which took ~200 seconds and where prefetching is mostly
useless. That's a damn pretty improvement, I guess!

Hi Tomas,

Sorry for my slow response -- I've been catching up after some
vacation time. Thanks very much for doing all this testing work!
Those results are very good, and it's nice to see such compelling
cases even with FPI enabled.

I'm hoping to commit this in the next few weeks. There are a few
little todos to tidy up, and I need to do some more review/testing of
the error handling and edge cases. Any ideas on how to battle test it
are very welcome. I'm also currently testing how it interacts with
some other patches that are floating around. More soon.

#recovery_prefetch = on # whether to prefetch pages logged with FPW
#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW

but clearly that comment is only for recovery_prefetch_fpw, the first
GUC enables prefetching in general.

Ack, thanks.

#70

sfrost@snowman.net

almost 5 years ago

In reply to: Thomas Munro (#65)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

Rebase attached.

Subject: [PATCH v15 4/6] Prefetch referenced blocks during recovery.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b60382778..ac27392053 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3366,6 +3366,64 @@ include_dir 'conf.d'

[...]

+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer

The "might" above seems slightly confusing- such blocks will remain in
shared buffers until/unless they're forced out, right?

+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.

"when a blocks" above doesn't sound quite right, maybe reword this as:

"prefetching can avoid a costly read-before-write when WAL replay
reaches the block that needs to be written."

diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index d1c3893b14..c51c431398 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -720,6 +720,23 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
</sect1>

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c

@@ -3697,7 +3699,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
xlogfname);
set_ps_display(activitymsg);
-
restoredFromArchive = RestoreArchivedFile(path, xlogfname,
"RECOVERYXLOG",
wal_segment_size,

@@ -12566,6 +12585,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
else
havedata = false;
}
+
if (havedata)
{
/*

Random whitespace change hunks..?

diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c

+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.

Seems like "to" is missing- "The queue has space for up *to* the highest
possible value of the GUC + 1" ? Maybe also "between the head and the
tail when full".

+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}

FPIs in the stream aren't going to just avoid reads when the
filesystem's block size matches PG's- they're also going to avoid
subsequent modifications to the block, provided we don't end up pushing
that block out of shared buffers, rights?

That is, if you have an empty shared buffers and see:

Block 5 FPI
Block 6 FPI
Block 5 Update
Block 6 Update

it seems like, with this patch, we're going to Prefetch Block 5 & 6,
even though we almost certainly won't actually need them.

+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}

I'm sure this will help with some cases, but it wouldn't help with the
case that I mention above, as I understand it.

+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")

extra 'y' at the end of 'current', and "find uncached blocks" might be
misleading, maybe:

"Read out ahead of the current replay position and prefetch blocks."

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..4288f2f37f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -234,6 +234,12 @@
#checkpoint_flush_after = 0		# measured in pages, 0 disables
#checkpoint_warning = 30s		# 0 disables

+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW

Think this was already mentioned, but the above comments shouldn't be
the same. :)

From 2f6d690cefc0cad8cbd8b88dbed4d688399c6916 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v15 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
blocks.

Provide a some workspace in decoded WAL records, so that we can remember
which buffer recently contained we found a block cached in, for later
use when replaying the record. Provide a new way to look up a
recently-known buffer and check if it's still valid and has the right
tag.

"Provide a place in decoded WAL records to remember which buffer we
found a block cached in, to hopefully avoid having to look it up again
when we replay the record. Provide a way to look up a recently-known
buffer and check if it's still valid and has the right tag."

XXX Needs review to figure out if it's safe or steamrolling over subtleties

... that's a great question. :) Not sure that I can really answer it
conclusively, but I can't think of any reason, given the buffer tag
check that's included, that it would be an issue. I'm glad to see this
though since it addresses some of the concern about this patch slowing
down replay in cases where there are FPIs and checkpoints are less than
the size of shared buffers, which seems much more common than cases
where FPIs have been disabled and/or checkpoints are larger than SB.
Further effort to avoid having likely-unnecessary prefetching done for
blocks which recently had an FPI would further reduce the risk of this
change slowing down replay for common deployments, though I'm not sure
how much of an impact that likely has or what the cost would be to avoid
the prefetching (and it's complicated by hot standby, I imagine...).

Thanks,

Stephen

#71

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Stephen Frost (#70)

Re: WIP: WAL prefetch (another approach)

On 2/10/21 10:50 PM, Stephen Frost wrote:

...

+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+			continue;
+		}

That is, if you have an empty shared buffers and see:

Block 5 FPI
Block 6 FPI
Block 5 Update
Block 6 Update

it seems like, with this patch, we're going to Prefetch Block 5 & 6,
even though we almost certainly won't actually need them.

Yeah, that's a good point. I think it'd make sense to keep track of
recent FPIs and skip prefetching such blocks. But how exactly should we
implement that, how many blocks do we need to track? If you get an FPI,
how long should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons.
Firstly, the usual pattern is that we have FPI + several changes for
that block shortly after it. Secondly, maintenance_io_concurrency limits
this naturally - after crossing that, redo should place the FPI into
shared buffers, allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might
track more buffers to allow skipping prefetches of blocks that were
evicted from shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would
be too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency)
elements, and whenever we prefetch a block or find an FPI, we simply add
the block to the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip
the prefetch if it's sufficiently recent. Otherwise we prefetch.

We may issue some extra prefetches due to collisions, but that's fine I
think. There should not be very many of them, thanks to having the hash
table oversized.

The good thing is this is quite simple, fixed-sized data structure,
there's no need for allocations etc.

+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+				continue;
+			}

I'm sure this will help with some cases, but it wouldn't help with the
case that I mention above, as I understand it.

It won't but it's a pretty effective check. I've done some experiments
recently, and with random pgbench this eliminates ~15% of prefetches.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72

andres@anarazel.de

almost 5 years ago

In reply to: Tomas Vondra (#71)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

Regards,

Andres

#73

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Andres Freund (#72)

Re: WIP: WAL prefetch (another approach)

On 2/12/21 5:46 AM, Andres Freund wrote:

Hi,

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

I agree with treating this as an improvement - it's not something that
needs to be solved in the first verson. OTOH I think Stephen has a point
that just skipping FPIs like we do now has limited effect, because the
WAL usually contains additional changes to the same block.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#74

sfrost@snowman.net

almost 5 years ago

In reply to: Andres Freund (#72)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup. However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order. When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints. As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers. Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay. While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.

Thanks,

Stephen

#75

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Stephen Frost (#74)

Re: WIP: WAL prefetch (another approach)

On 2/13/21 10:39 PM, Stephen Frost wrote:

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup. However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order. When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.

The question is how common this pattern actually is - I don't know. As
noted, the non-FPI would have to be fairly close to the FPI, i.e. within
the wal_decode_buffer_size, to actually cause measurable harm.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.

Yeah, that's essentially what I proposed.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.

This seems like an over-engineering, at least for v1.

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints. As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

I'm OK to do some benchmarking, but it's not quite clear to me why does
it matter if the checkpoints are smaller than shared buffers? IMO what
matters is how "localized" the updates are, i.e. how likely it is to hit
the same page repeatedly (in a short amount of time). Regular pgbench is
not very suitable for that, but some non-uniform distribution should do
the trick, I think.

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers. Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay. While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.

I'm with Andres on this. It's fine to leave some possible optimizations
on the table for the future. And even if some workloads are affected
negatively, it's still possible to disable the prefetching.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76

sfrost@snowman.net

almost 5 years ago

In reply to: Tomas Vondra (#75)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

On 2/13/21 10:39 PM, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup. However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order. When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.

The question is how common this pattern actually is - I don't know. As
noted, the non-FPI would have to be fairly close to the FPI, i.e. within the
wal_decode_buffer_size, to actually cause measurable harm.

Yeah, so it'll depend on how big wal_decode_buffer_size is. Increasing
that would certainly help to show if there ends up being a degredation
with this patch due to the extra prefetching being done.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.

Yeah, that's essentially what I proposed.

Glad I captured it correctly.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.

This seems like an over-engineering, at least for v1.

Perhaps, though it didn't seem like it'd be very hard to do with the
already proposed changes to stash the buffer id in the WAL records.

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints. As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

I'm OK to do some benchmarking, but it's not quite clear to me why does it
matter if the checkpoints are smaller than shared buffers? IMO what matters
is how "localized" the updates are, i.e. how likely it is to hit the same
page repeatedly (in a short amount of time). Regular pgbench is not very
suitable for that, but some non-uniform distribution should do the trick, I
think.

I suppose strictly speaking it'd be
Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
it's more about the wal_decode_buffer_size than the checkpoint's size.
Apologies for the confusion. As suggested above, one way to benchmark
this to really see if there's any issue would be to increase
wal_decode_buffer_size to some pretty big size and then compare the
performance vs. unpatched. I'd think that could even be done with
pgbench, so you're not having to arrange for the same pages to get
updated over and over.

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers. Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay. While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.

I'm with Andres on this. It's fine to leave some possible optimizations on
the table for the future. And even if some workloads are affected
negatively, it's still possible to disable the prefetching.

While I'm generally in favor of this argument, that a feature is
particularly important and that it's worth slowing down the common cases
to enable it, I dislike that it's applied inconsistently. I'd certainly
feel better about it if we had actual performance numbers to consider.
I don't doubt the possibility that the extra prefetch's just don't
amount to enough to matter but I have a hard time seeing them as not
having some cost and without actually measuring it, it's hard to say
what that cost is.

Without looking farther back than the last record, we could end up
repeatedly asking for the same blocks to be prefetched too-

FPI for block 1
FPI for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2

... etc.

Entirely possible my math is off, but seems like the worst case
situation right now might end up with some 4500 unnecessary prefetch
syscalls even with the proposed default wal_decode_buffer_size of
512k and 56-byte WAL records ((524,288 - 16,384) / 56 / 2 = ~4534).

Issuing unnecessary prefetches for blocks we've already sent a prefetch
for is arguably a concern even if FPWs are off but the benefit of doing
the prefetching almost certainly will outweight that and mean that
finding a way to address it is something we could certainly do later as
a future improvement. I wouldn't have any issue with that. Just
doesn't seem as clear-cut to me when thinking about the FPW-enabled
case. Ultimately, if you, Andres and Munro are all not concerned about
it and no one else speaks up then I'm not going to pitch a fuss over it
being committed, but, as you said above, it seemed like a good point to
raise for everyone to consider.

Thanks,

Stephen

#77

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Stephen Frost (#76)

Re: WIP: WAL prefetch (another approach)

On 2/15/21 12:18 AM, Stephen Frost wrote:

Greetings,

...

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints. As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).

I'm OK to do some benchmarking, but it's not quite clear to me why does it
matter if the checkpoints are smaller than shared buffers? IMO what matters
is how "localized" the updates are, i.e. how likely it is to hit the same
page repeatedly (in a short amount of time). Regular pgbench is not very
suitable for that, but some non-uniform distribution should do the trick, I
think.

I suppose strictly speaking it'd be
Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
it's more about the wal_decode_buffer_size than the checkpoint's size.
Apologies for the confusion. As suggested above, one way to benchmark
this to really see if there's any issue would be to increase
wal_decode_buffer_size to some pretty big size and then compare the
performance vs. unpatched. I'd think that could even be done with
pgbench, so you're not having to arrange for the same pages to get
updated over and over.

What exactly would be the point of such benchmark? I don't think the
patch does prefetching based on wal_decode_buffer_size, that just says
how far ahead we decode - the prefetch distance I is defined by
maintenance_io_concurrency.

But it's not clear to me what exactly would the result say about the
necessity of the optimization at hand (skipping prefetches for blocks
with recent FPI). If the the maintenance_io_concurrency is very high,
the probability that a block is evicted prematurely grows, making the
prefetch useless in general. How does this say anything about the
problem at hand? Sure, we'll do unnecessary I/O, causing issues, but
that's a bit like complaining the engine gets very hot when driving on a
highway in reverse.

AFAICS to measure the worst case, you'd need a workload with a lot of
FPIs, and very little actual I/O. That means, data set that fits into
memory (either shared buffers or RAM), and short checkpoints. But that's
exactly the case where you don't need prefetching ...

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers. Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay. While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.

I'm with Andres on this. It's fine to leave some possible optimizations on
the table for the future. And even if some workloads are affected
negatively, it's still possible to disable the prefetching.

While I'm generally in favor of this argument, that a feature is
particularly important and that it's worth slowing down the common cases
to enable it, I dislike that it's applied inconsistently. I'd certainly

If you have a workload where this happens to cause issues, you can just
disable that. IMHO that's a perfectly reasonable engineering approach,
where we get something that significantly improves 80% of the cases,
allow disabling it for cases where it might cause issues, and then
improve it in the next version.

feel better about it if we had actual performance numbers to consider.
I don't doubt the possibility that the extra prefetch's just don't
amount to enough to matter but I have a hard time seeing them as not
having some cost and without actually measuring it, it's hard to say
what that cost is.

Without looking farther back than the last record, we could end up
repeatedly asking for the same blocks to be prefetched too-

FPI for block 1
FPI for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2

... etc.

Entirely possible my math is off, but seems like the worst case
situation right now might end up with some 4500 unnecessary prefetch
syscalls even with the proposed default wal_decode_buffer_size of
512k and 56-byte WAL records ((524,288 - 16,384) / 56 / 2 = ~4534).

Well, that's a bit extreme workload, I guess. If you really have such
long streaks of WAL records touching the same small set of blocks, you
don't need WAL prefetching at all and you can just disable it. Easy.

If you have workload with small active set, frequent checkpoint etc.
then just don't enable WAL prefetching. What's wrong with that?

Issuing unnecessary prefetches for blocks we've already sent a prefetch
for is arguably a concern even if FPWs are off but the benefit of doing
the prefetching almost certainly will outweight that and mean that
finding a way to address it is something we could certainly do later as
a future improvement. I wouldn't have any issue with that. Just
doesn't seem as clear-cut to me when thinking about the FPW-enabled
case. Ultimately, if you, Andres and Munro are all not concerned about
it and no one else speaks up then I'm not going to pitch a fuss over it
being committed, but, as you said above, it seemed like a good point to
raise for everyone to consider.

Right, I was just going to point out the FPIs are not necessary - what
matters is the presence of long streaks of WAL records touching the same
set of blocks. But people with workloads where this is common likely
don't need the WAL prefetching at all - the replica can keep up just
fine, because it doesn't need to do much I/O anyway (and if it can't
then prefetching won't help much anyway). So just don't enable the
prefetching, and there'll be no overhead.

If it was up to me, I'd just get the patch committed as is. Delaying the
feature because of concerns that it might have some negative effect in
some cases, when that can be simply mitigated by disabling the feature,
is not really beneficial for our users.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#78

sfrost@snowman.net

almost 5 years ago

In reply to: Tomas Vondra (#77)

Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

Right, I was just going to point out the FPIs are not necessary - what
matters is the presence of long streaks of WAL records touching the same
set of blocks. But people with workloads where this is common likely
don't need the WAL prefetching at all - the replica can keep up just
fine, because it doesn't need to do much I/O anyway (and if it can't
then prefetching won't help much anyway). So just don't enable the
prefetching, and there'll be no overhead.

Isn't this exactly the common case though..? Checkpoints happening
every 5 minutes, the replay of the FPI happens first and then the record
is updated and everything's in SB for the later changes? You mentioned
elsewhere that this would improve 80% of cases but that doesn't seem to
be backed up by anything and certainly doesn't seem likely to be the
case if we're talking about across all PG deployments. I also disagree
that asking the kernel to go do random I/O for us, even as a prefetch,
is entirely free simply because we won't actually need those pages. At
the least, it potentially pushes out pages that we might need shortly
from the filesystem cache, no?

If it was up to me, I'd just get the patch committed as is. Delaying the
feature because of concerns that it might have some negative effect in
some cases, when that can be simply mitigated by disabling the feature,
is not really beneficial for our users.

I don't know that we actually know how many cases it might have a
negative effect on or what the actual amount of such negative case there
might be- that's really why we should probably try to actually benchmark
it and get real numbers behind it, particularly when the chances of
running into such a negative effect with the default configuration (that
is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
likely to occur in the field than the cases where FPWs are disabled and
someone's running on ZFS.

Perhaps more to the point, it'd be nice to see how this change actually
improves the caes where PG is running with more-or-less the defaults on
the more commonly deployed filesystems. If it doesn't then maybe it
shouldn't be the default..? Surely the folks running on ZFS and running
with FPWs disabled would be able to manage to enable it if they
wished to and we could avoid entirely the question of if this has a
negative impact on the more common cases.

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Anyhow, ultimately I don't know that there's much more to discuss on
this thread with regard to this particular topic, at least. As I said
before, if everyone else is on board and not worried about it then so be
it; I feel that at least the concern that I raised has been heard.

Thanks,

Stephen

#79

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Stephen Frost (#78)

Re: WIP: WAL prefetch (another approach)

Hi,

On 3/17/21 10:43 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:

Right, I was just going to point out the FPIs are not necessary - what
matters is the presence of long streaks of WAL records touching the same
set of blocks. But people with workloads where this is common likely
don't need the WAL prefetching at all - the replica can keep up just
fine, because it doesn't need to do much I/O anyway (and if it can't
then prefetching won't help much anyway). So just don't enable the
prefetching, and there'll be no overhead.

Isn't this exactly the common case though..? Checkpoints happening
every 5 minutes, the replay of the FPI happens first and then the record
is updated and everything's in SB for the later changes?

Well, as I said before, the FPIs are not very significant - you'll have
mostly the same issue with any repeated changes to the same block. It
does not matter much if you do

FPI for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

or just

WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

In both cases some of the prefetches are probably unnecessary. But the
frequency of checkpoints does not really matter, the important bit is
repeated changes to the same block(s).

If you have active set much larger than RAM, this is quite unlikely. And
we know from the pgbench tests that prefetching has a huge positive
effect in this case.

On smaller active sets, with frequent updates to the same block, we may
issue unnecessary prefetches - that's true. But (a) you have not shown
any numbers suggesting this is actually an issue, and (b) those cases
don't really need prefetching because all the data is already either in
shared buffers or in page cache. So if it happens to be an issue, the
user can simply disable it.

So how exactly would a problematic workload look like?

You mentioned elsewhere that this would improve 80% of cases but that
doesn't seem to be backed up by anything and certainly doesn't seem
likely to be the case if we're talking about across all PG
deployments.

Obviously, the 80% was just a figure of speech, illustrating my belief
that the proposed patch is beneficial for most users who currently have
issues with replication lag. That is based on my experience with support
customers who have such issues - it's almost invariably an OLTP workload
with large active set, and we know (from the benchmarks) that in these
cases it helps.

Users who don't have issues with replication lag can disable (or not
enable) the prefetching, and won't get any negative effects.

Perhaps there are users with weird workloads that have replication lag
issues but this patch won't help them - bummer, we can't solve
everything in one go. Also, no one actually demonstrated such workload
in this thread so far.

But as you're suggesting we don't have data to support the claim that
this actually helps many users (with no risk to others), I'd point out
you have not actually provided any numbers showing that it actually is
an issue in practice.

I also disagree that asking the kernel to go do random I/O for us,
even as a prefetch, is entirely free simply because we won't
actually need those pages. At the least, it potentially pushes out
pages that we might need shortly from the filesystem cache, no?

Where exactly did I say it's free? I said that workloads where this
happens a lot most likely don't need the prefetching at all, so it can
be simply disabled, eliminating all negative effects.

Moreover, looking at a limited number of recently prefetched blocks
won't eliminate this problem anyway - imagine a random OLTP on large
data set that however fits into RAM. After a while no read I/O needs to
be done, but you'd need pretty much infinite list of prefetched blocks
to eliminate that, and with smaller lists you'll still do 99% of the
prefetches.

Just disabling prefetching on such instances seems quite reasonable.

If it was up to me, I'd just get the patch committed as is. Delaying the
feature because of concerns that it might have some negative effect in
some cases, when that can be simply mitigated by disabling the feature,
is not really beneficial for our users.

I don't know that we actually know how many cases it might have a
negative effect on or what the actual amount of such negative case there
might be- that's really why we should probably try to actually benchmark
it and get real numbers behind it, particularly when the chances of
running into such a negative effect with the default configuration (that
is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
likely to occur in the field than the cases where FPWs are disabled and
someone's running on ZFS.

Perhaps more to the point, it'd be nice to see how this change actually
improves the caes where PG is running with more-or-less the defaults on
the more commonly deployed filesystems. If it doesn't then maybe it
shouldn't be the default..? Surely the folks running on ZFS and running
with FPWs disabled would be able to manage to enable it if they
wished to and we could avoid entirely the question of if this has a
negative impact on the more common cases.

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Well, maybe it'll behave differently on systems with ZFS. I don't know,
and I have no such machine to test that at the moment. My argument
however remains the same - if if happens to be a problem, just don't
enable (or disable) the prefetching, and you get the current behavior.

FWIW I'm not sure there was a discussion or argument about what should
be the default setting (enabled or disabled). I'm fine with not enabling
this by default, so that people have to enable it explicitly.

In a way that'd be consistent with effective_io_concurrency being 1 by
default, which almost disables regular prefetching.

Anyhow, ultimately I don't know that there's much more to discuss on
this thread with regard to this particular topic, at least. As I said
before, if everyone else is on board and not worried about it then so be
it; I feel that at least the concern that I raised has been heard.

OK, thanks for the discussions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#80

[1]: /messages/by-id/20201230035736.qmyrtrpeewqbidfi@alap3.anarazel.de
[2]: /messages/by-id/20190418.210257.43726183.horiguchi.kyotaro@lab.ntt.co.jp

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#79)

5 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Thu, Mar 18, 2021 at 12:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 3/17/21 10:43 PM, Stephen Frost wrote:

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Well, maybe it'll behave differently on systems with ZFS. I don't know,
and I have no such machine to test that at the moment. My argument
however remains the same - if if happens to be a problem, just don't
enable (or disable) the prefetching, and you get the current behavior.

I see the road map for this feature being to get it working on every
OS via the AIO patchset, in later work, hopefully not very far in the
future (in the most portable mode, you get I/O worker processes doing
pread() or preadv() calls on behalf of recovery). So I'll be glad to
get this infrastructure in, even though it's maybe only useful for
some people in the first release.

FWIW I'm not sure there was a discussion or argument about what should
be the default setting (enabled or disabled). I'm fine with not enabling
this by default, so that people have to enable it explicitly.

In a way that'd be consistent with effective_io_concurrency being 1 by
default, which almost disables regular prefetching.

Yeah, I'm not sure but I'd be fine with disabling it by default in the
initial release. The current patch set has it enabled, but that's
mostly for testing, it's not an opinion on how it should ship.

I've attached a rebased patch set with a couple of small changes:

1. I abandoned the patch that proposed
pg_atomic_unlocked_add_fetch_u{32,64}() and went for a simple function
local to xlogprefetch.c that just does pg_atomic_write_u64(counter,
pg_atomic_read_u64(counter) + 1), in response to complaints from
Andres[1]/messages/by-id/20201230035736.qmyrtrpeewqbidfi@alap3.anarazel.de.

2. I fixed a bug in ReadRecentBuffer(), and moved it into its own
patch for separate review.

I'm now looking at Horiguchi-san and Heikki's patch[2]/messages/by-id/20190418.210257.43726183.horiguchi.kyotaro@lab.ntt.co.jp to remove
XLogReader's callbacks, to try to understand how these two patch sets
are related. I don't really like the way those callbacks work, and
I'm afraid had to make them more complicated. But I don't yet know
very much about that other patch set. More soon.

Attachments:

v16-0001-Provide-ReadRecentBuffer-to-re-pin-buffers-by-ID.patchtext/x-patch; charset=US-ASCII; name=v16-0001-Provide-ReadRecentBuffer-to-re-pin-buffers-by-ID.patchDownload

From 7908fc24c5ad8ab21944c725cb4b2c2bdf1eed4b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v16 1/5] Provide ReadRecentBuffer() to re-pin buffers by ID.

If you know the buffer ID that recently held a given block you would
like to pin, this function will check if it's still there and pin it if
the tag hasn't changed.  Otherwise, you'll need to use the regular
ReadBuffer() function.  This will be used by later patches to avoid
double lookup in some cases where it's very likely not to have moved.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 43 +++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |  2 ++
 2 files changed, 45 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..0e5f92d92b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -610,6 +610,49 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+
+			return false;
+		}
+	}
+
+	/* Does the tag still match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Too late!  Unpin if shared. */
+	if (!BufferIsLocal(recent_buffer))
+		UnpinBuffer(bufHdr, true);
+
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..aa64fb42ec 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.30.1

v16-0002-Improve-information-about-received-WAL.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Improve-information-about-received-WAL.patchDownload

From 9cfb7034319bdf77d5ac48e32e387aeb690aca52 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v16 2/5] Improve information about received WAL.

In commit d140f2f3, we cleaned up the distiction between flushed and
written LSN positions.  Go further, and expose the written location in a
way that allows for the associated timeline ID to be read consistently,
and by consistent about "written" and "flushed" here too.  Without that,
it might be difficult to know the path of the file that has been
written, without data races.  Also provide a fast way to read just the
written LSN, for cases where you don't need the timeline.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/replication/walreceiver.c      | 10 ++++--
 src/backend/replication/walreceiverfuncs.c | 41 +++++++++++++++++-----
 src/include/replication/walreceiver.h      | 30 +++++++++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8532296f26..72dd96e67b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -870,6 +870,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 {
 	int			startoff;
 	int			byteswritten;
+	WalRcvData *walrcv = WalRcv;
 
 	while (nbytes > 0)
 	{
@@ -961,7 +962,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	}
 
 	/* Update shared-memory status */
-	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	SpinLockAcquire(&walrcv->mutex);
+	pg_atomic_write_u64(&walrcv->writtenUpto, LogstreamResult.Write);
+	walrcv->writtenTLI = ThisTimeLineID;
+	SpinLockRelease(&walrcv->mutex);
 }
 
 /*
@@ -987,7 +991,7 @@ XLogWalRcvFlush(bool dying)
 		{
 			walrcv->latestChunkStart = walrcv->flushedUpto;
 			walrcv->flushedUpto = LogstreamResult.Flush;
-			walrcv->receivedTLI = ThisTimeLineID;
+			walrcv->flushedTLI = ThisTimeLineID;
 		}
 		SpinLockRelease(&walrcv->mutex);
 
@@ -1325,7 +1329,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
 	flushed_lsn = WalRcv->flushedUpto;
-	received_tli = WalRcv->receivedTLI;
+	received_tli = WalRcv->flushedTLI;
 	last_send_time = WalRcv->lastMsgSendTime;
 	last_receipt_time = WalRcv->lastMsgReceiptTime;
 	latest_end_lsn = WalRcv->latestWalEnd;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index fff6c54c45..e89f80d1c0 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -300,10 +300,12 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	 * If this is the first startup of walreceiver (on this timeline),
 	 * initialize flushedUpto and latestChunkStart to the starting point.
 	 */
-	if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+	if (walrcv->receiveStart == 0 || walrcv->flushedTLI != tli)
 	{
+		pg_atomic_write_u64(&walrcv->writtenUpto, recptr);
+		walrcv->writtenTLI = tli;
 		walrcv->flushedUpto = recptr;
-		walrcv->receivedTLI = tli;
+		walrcv->flushedTLI = tli;
 		walrcv->latestChunkStart = recptr;
 	}
 	walrcv->receiveStart = recptr;
@@ -325,10 +327,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
  * Optionally, returns the previous chunk start, that is the first byte
  * written in the most recent walreceiver flush cycle.  Callers not
  * interested in that value may pass NULL for latestChunkStart. Same for
- * receiveTLI.
+ * flushedTLI.
  */
 XLogRecPtr
-GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI)
 {
 	WalRcvData *walrcv = WalRcv;
 	XLogRecPtr	recptr;
@@ -337,8 +339,8 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 	recptr = walrcv->flushedUpto;
 	if (latestChunkStart)
 		*latestChunkStart = walrcv->latestChunkStart;
-	if (receiveTLI)
-		*receiveTLI = walrcv->receivedTLI;
+	if (flushedTLI)
+		*flushedTLI = walrcv->flushedTLI;
 	SpinLockRelease(&walrcv->mutex);
 
 	return recptr;
@@ -346,14 +348,35 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
 
 /*
  * Returns the last+1 byte position that walreceiver has written.
- * This returns a recently written value without taking a lock.
+ *
+ * The other arguments are similar to GetWalRcvFlushRecPtr()'s.
  */
 XLogRecPtr
-GetWalRcvWriteRecPtr(void)
+GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI)
 {
 	WalRcvData *walrcv = WalRcv;
+	XLogRecPtr	recptr;
+
+	SpinLockAcquire(&walrcv->mutex);
+	recptr = pg_atomic_read_u64(&walrcv->writtenUpto);
+	if (latestChunkStart)
+		*latestChunkStart = walrcv->latestChunkStart;
+	if (writtenTLI)
+		*writtenTLI = walrcv->writtenTLI;
+	SpinLockRelease(&walrcv->mutex);
 
-	return pg_atomic_read_u64(&walrcv->writtenUpto);
+	return recptr;
+}
+
+/*
+ * For callers that don't need a consistent LSN, TLI pair, and that don't mind
+ * a potentially slightly out of date value in exchange for speed, this
+ * version provides an unlocked view of the latest written location.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtrUnlocked(void)
+{
+	return pg_atomic_read_u64(&WalRcv->writtenUpto);
 }
 
 /*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4fd7c25ea7..b06fd8165d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -76,14 +76,25 @@ typedef struct
 	TimeLineID	receiveStartTLI;
 
 	/*
-	 * flushedUpto-1 is the last byte position that has already been received,
-	 * and receivedTLI is the timeline it came from.  At the first startup of
+	 * flushedUpto-1 is the last byte position that has already been flushed,
+	 * and flushedTLI is the timeline it came from.  At the first startup of
 	 * walreceiver, these are set to receiveStart and receiveStartTLI. After
 	 * that, walreceiver updates these whenever it flushes the received WAL to
 	 * disk.
 	 */
 	XLogRecPtr	flushedUpto;
-	TimeLineID	receivedTLI;
+	TimeLineID	flushedTLI;
+
+	/*
+	 * writtenUpto-1 is like flushedUpto-1, except that it's updated without
+	 * waiting for the flush, after the data has been written to disk and
+	 * available for reading.  It is an atomic type so that we can read it
+	 * without locks.  We still acquire the spinlock in cases where it is
+	 * written or read along with the TLI, so that they can be accessed
+	 * together consistently.
+	 */
+	pg_atomic_uint64 writtenUpto;
+	TimeLineID	writtenTLI;
 
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
@@ -144,14 +155,6 @@ typedef struct
 
 	slock_t		mutex;			/* locks shared variables shown above */
 
-	/*
-	 * Like flushedUpto, but advanced after writing and before flushing,
-	 * without the need to acquire the spin lock.  Data can be read by another
-	 * process up to this point, but shouldn't be used for data integrity
-	 * purposes.
-	 */
-	pg_atomic_uint64 writtenUpto;
-
 	/*
 	 * force walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
@@ -460,8 +463,9 @@ extern bool WalRcvRunning(void);
 extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
 								 const char *conninfo, const char *slotname,
 								 bool create_temp_slot);
-extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
-extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *flushedTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *writtenTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtrUnlocked(void);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
-- 
2.30.1

v16-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchtext/x-patch; charset=US-ASCII; name=v16-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patchDownload

From b201abfb8120b105aeb430f164a482d790b8596c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 10 Aug 2020 17:06:21 +1200
Subject: [PATCH v16 3/5] Provide XLogReadAhead() to decode future WAL records.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.  Provides two new interfaces:

 * XLogReadRecord() works as before, except that it returns a pointer to
   a new decoded record object rather than just the header

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the circular decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from code paths that don't need to
care about this change.

To support opportunistic readahead, the page-read callback function
gains a "noblock" parameter.  This allows for calls to XLogReadAhead()
to return without waiting if there is currently no data available, in
particular in the case of streaming replication.  For non-blocking
XLogReadAhead() to work, a page-read callback that understands "noblock"
must be supplied.  Existing callbacks that ignore it work as before, as
long as you only use the XLogReadRecord() interface.

The main XLogPageRead() routine used by recovery is extended to respect
noblock mode when the WAL source is a walreceiver.

Very large records that don't fit in the circular buffer are marked as
"oversized" and allocated and freed piecemeal.  The decoding buffer can
be placed in shared memory, for potential future work on parallelizing
recovery.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         | 105 +++-
 src/backend/access/transam/xlogreader.c   | 620 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   5 +-
 src/backend/postmaster/pgstat.c           |   3 +
 src/backend/replication/logical/decode.c  |   2 +-
 src/backend/replication/walsender.c       |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   8 +-
 src/bin/pg_waldump/pg_waldump.c           |  24 +-
 src/include/access/xlogreader.h           | 127 +++--
 src/include/access/xlogutils.h            |   3 +-
 src/include/pgstat.h                      |   1 +
 12 files changed, 699 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 63301a1ab1..0e9bcc7159 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5dea..c33f7722c9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -213,7 +213,8 @@ static XLogRecPtr LastRec;
 
 /* Local copy of WalRcv->flushedUpto */
 static XLogRecPtr flushedUpto = 0;
-static TimeLineID receiveTLI = 0;
+static XLogRecPtr writtenUpto = 0;
+static TimeLineID writtenTLI = 0;
 
 /*
  * During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -921,9 +922,11 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt, XLogRecPtr tliRecPtr,
+										bool nowait);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -1427,7 +1430,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4390,6 +4393,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -4433,6 +4437,42 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 		if (record)
 		{
+			if (readSource == XLOG_FROM_STREAM)
+			{
+				/*
+				 * In streaming mode, we allow ourselves to read records that
+				 * have been written but not yet flushed, for increased
+				 * concurrency.  We still have to wait until the record has
+				 * been flushed before allowing it to be replayed.
+				 *
+				 * XXX This logic preserves the traditional behaviour where we
+				 * didn't replay records until the walreceiver flushed them,
+				 * except that now we read and decode them sooner.  Could it
+				 * be relaxed even more?  Isn't the real data integrity
+				 * requirement for _writeback_ to stall until the WAL is
+				 * durable, not recovery, just as on a primary?
+				 *
+				 * XXX Are there any circumstances in which this should be
+				 * interruptible?
+				 *
+				 * XXX We don't replicate the XLogReceiptTime etc logic from
+				 * WaitForWALToBecomeAvailable() here...  probably need to
+				 * refactor/share code?
+				 */
+				if (EndRecPtr < flushedUpto)
+				{
+					while (EndRecPtr < (flushedUpto = GetWalRcvFlushRecPtr(NULL, NULL)))
+					{
+						(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
+										 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+										 -1,
+										 WAIT_EVENT_RECOVERY_WAL_FLUSH);
+						CHECK_FOR_INTERRUPTS();
+						ResetLatch(&XLogCtl->recoveryWakeupLatch);
+					}
+				}
+			}
+
 			/* Great, got a record */
 			return record;
 		}
@@ -10315,7 +10355,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10450,7 +10490,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -12100,7 +12140,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+			 XLogRecPtr targetRecPtr, char *readBuf, bool nowait)
 {
 	XLogPageReadPrivate *private =
 	(XLogPageReadPrivate *) xlogreader->private_data;
@@ -12112,6 +12152,15 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
 
+	/*
+	 * If streaming and asked not to wait, return as quickly as possible if
+	 * the data we want isn't available immediately.  Use an unlocked read of
+	 * the latest written position.
+	 */
+	if (readSource == XLOG_FROM_STREAM && nowait &&
+		GetWalRcvWriteRecPtrUnlocked() < targetPagePtr + reqLen)
+		return -1;
+
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
@@ -12122,6 +12171,9 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -12144,12 +12196,13 @@ retry:
 	/* See if we need to retrieve more data */
 	if (readFile < 0 ||
 		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+		 writtenUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
 										 private->fetching_ckpt,
-										 targetRecPtr))
+										 targetRecPtr,
+										 nowait))
 		{
 			if (readFile >= 0)
 				close(readFile);
@@ -12175,10 +12228,10 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (writtenUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
 		else
-			readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
+			readLen = XLogSegmentOffset(writtenUpto, wal_segment_size) -
 				targetPageOff;
 	}
 	else
@@ -12298,7 +12351,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							bool nowait)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12401,6 +12455,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * hope...
 					 */
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
 					 * standby mode.
@@ -12517,6 +12575,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (readFile >= 0)
 					return true;	/* success! */
 
+				/* If we were asked not to wait, give up immediately. */
+				if (nowait)
+					return false;
+
 				/*
 				 * Nope, not found in archive or pg_wal.
 				 */
@@ -12593,7 +12655,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 											 PrimarySlotName,
 											 wal_receiver_create_temp_slot);
-						flushedUpto = 0;
+						writtenUpto = 0;
 					}
 
 					/*
@@ -12616,15 +12678,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
-					if (RecPtr < flushedUpto)
+					if (RecPtr < writtenUpto)
 						havedata = true;
 					else
 					{
 						XLogRecPtr	latestChunkStart;
 
-						flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
-						if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
+						writtenUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &writtenTLI);
+						if (RecPtr < writtenUpto && writtenTLI == curFileTLI)
 						{
 							havedata = true;
 							if (latestChunkStart <= RecPtr)
@@ -12650,9 +12713,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						if (readFile < 0)
 						{
 							if (!expectedTLEs)
-								expectedTLEs = readTimeLineHistory(receiveTLI);
+								expectedTLEs = readTimeLineHistory(writtenTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
+													writtenTLI,
 													XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
 						}
@@ -12666,6 +12729,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						break;
 					}
 
+					/* If we were asked not to wait, give up immediately. */
+					if (nowait)
+						return false;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12702,6 +12769,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive. Time out after 5 seconds
 					 * to react to a trigger file promptly and to check if the
 					 * WAL receiver is still active.
+					 *
+					 * XXX This is signalled on *flush*, not on write.  Oops.
 					 */
 					(void) WaitLatch(&XLogCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_TIMEOUT |
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..07c05e01a6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -37,7 +37,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+							 int reqLen, bool nowait);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static DecodedXLogRecord *XLogReadRecordInternal(XLogReaderState *state, bool force);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -50,6 +52,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +68,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +92,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -138,18 +142,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -158,6 +155,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -245,7 +258,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -266,6 +281,261 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record;
+
+	/* We can release the most recently returned record. */
+	if (state->record)
+	{
+		/*
+		 * Remove it from the decoded record queue.  It must be the oldest
+		 * item decoded, decode_queue_tail.
+		 */
+		record = state->record;
+		Assert(record == state->decode_queue_tail);
+		state->record = NULL;
+		state->decode_queue_tail = record->next;
+
+		/* It might also be the newest item decoded, decode_queue_head. */
+		if (state->decode_queue_head == record)
+			state->decode_queue_head = NULL;
+
+		/* Release the space. */
+		if (unlikely(record->oversized))
+		{
+			/* It's not in the the decode buffer, so free it to release space. */
+			pfree(record);
+		}
+		else
+		{
+			/* It must be the tail record in the decode buffer. */
+			Assert(state->decode_buffer_tail == (char *) record);
+
+			/*
+			 * We need to update tail to point to the next record that is in
+			 * the decode buffer, if any, being careful to skip oversized ones
+			 * (they're not in the decode buffer).
+			 */
+			record = record->next;
+			while (unlikely(record && record->oversized))
+				record = record->next;
+			if (record)
+			{
+				/* Adjust tail to release space. */
+				state->decode_buffer_tail = (char *) record;
+			}
+			else
+			{
+				/* Nothing else in the decode buffer, so just reset it. */
+				state->decode_buffer_tail = state->decode_buffer;
+				state->decode_buffer_head = state->decode_buffer;
+			}
+		}
+	}
+
+	for (;;)
+	{
+		/* We can now return the tail item in the read queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Is this record at the LSN that the caller expects?  If it
+			 * isn't, this indicates that EndRecPtr has been moved to a new
+			 * position by the caller, so we'd better reset our read queue and
+			 * move to the new location.
+			 */
+
+
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decode rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Likewise, set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * XXX Calling code should perhaps access these through the
+			 * returned decoded record, but for now we'll update them directly
+			 * here, for the benefit of existing code that thinks there's only
+			 * one record in the decoder.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			/* XXX can't return pointer to header, will be given back to XLogDecodeRecord()! */
+			*errormsg = NULL;
+			return &state->record->header;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			state->errormsg_deferred = false;
+
+			/* Report the location of the error. */
+			state->ReadRecPtr = state->DecodeRecPtr;
+			state->EndRecPtr = state->NextRecPtr;
+
+			return NULL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		XLogReadRecordInternal(state, true /* wait */ );
+
+		/*
+		 * If that produced neither a queued record nor a queued error, then
+		 * we're at the end (for example, archive recovery with no more files
+		 * available).
+		 */
+		if (state->decode_queue_tail == NULL && !state->errormsg_deferred)
+		{
+			state->EndRecPtr = state->NextRecPtr;
+			*errormsg = NULL;
+			return NULL;
+		}
+	}
+
+	/* unreachable */
+	return NULL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *record = NULL;
+
+	if (!state->errormsg_deferred)
+	{
+		record = XLogReadRecordInternal(state, false);
+		if (state->errormsg_deferred)
+		{
+			/*
+			 * Report the error once, but don't consume it, so that
+			 * XLogReadRecord() can report it too.
+			 */
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			return NULL;
+		}
+	}
+	*errormsg = NULL;
+
+	return record;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
+ *
+ * If "force" is true, then wait for data to become available, and read a
+ * record even if it doesn't fit in the decode buffer, using overflow storage.
+ *
+ * If "force" is false, then return immediately if we'd have to wait for more
+ * data to become available, or if there isn't enough space in the decode
+ * buffer.
+ *
+ * Return the decoded record, or NULL if there was an error or ... XXX
+ */
+static DecodedXLogRecord *
+XLogReadRecordInternal(XLogReaderState *state, bool force)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -277,6 +547,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		pageHeaderSize;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg; /* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -286,19 +558,17 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -309,7 +579,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -327,7 +597,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+							   !force);
 	if (readOff < 0)
 		goto err;
 
@@ -374,6 +645,19 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
+	/* Find space to decode this record. */
+	decoded = XLogReadRecordAlloc(state, total_len, force);
+	if (decoded == NULL)
+	{
+		/*
+		 * We couldn't get space.  Usually this means that the decode buffer
+		 * was full, while trying to read ahead (that is, !force).  It's also
+		 * remotely possible for palloc() to have failed to allocate memory
+		 * for an oversized record.
+		 */
+		goto err;
+	}
+
 	/*
 	 * If the whole record header is on this page, validate it immediately.
 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
@@ -384,7 +668,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -438,7 +722,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Wait for the next page to become available */
 			readOff = ReadPageInternal(state, targetPagePtr,
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+										   XLOG_BLCKSZ),
+									   !force);
 
 			if (readOff < 0)
 				goto err;
@@ -475,7 +760,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+										   pageHeaderSize, !force);
 
 			Assert(pageHeaderSize <= readOff);
 
@@ -486,7 +771,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 			if (readOff < pageHeaderSize + len)
 				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
+										   pageHeaderSize + len,
+										   !force);
 
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
@@ -496,7 +782,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -510,15 +796,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
 	{
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ),
+								   !force);
 		if (readOff < 0)
 			goto err;
 
@@ -526,9 +813,9 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -538,25 +825,55 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return decoded;
+	}
 
 err:
+	if (decoded && decoded->oversized)
+		pfree(decoded);
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
-	 * failure.
+	 * Invalidate the read state, if this was an error. We might read from a
+	 * different source after failure.
 	 */
-	XLogReaderInvalReadState(state);
+	if (readOff < 0 || state->errormsg_buf[0] != '\0')
+		XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
 	return NULL;
 }
@@ -572,7 +889,8 @@ err:
  * data and if there hasn't been any error since caching the data.
  */
 static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+				 bool nowait)
 {
 	int			readLen;
 	uint32		targetPageOff;
@@ -607,7 +925,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 
@@ -625,7 +944,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 */
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
-									   state->readBuf);
+									   state->readBuf,
+									   nowait);
 	if (readLen < 0)
 		goto err;
 
@@ -644,7 +964,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	{
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
-										   state->readBuf);
+										   state->readBuf,
+										   nowait);
 		if (readLen < 0)
 			goto err;
 	}
@@ -663,7 +984,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
 	return -1;
 }
 
@@ -970,7 +1295,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
 		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff, false);
 		if (readLen < 0)
 			goto err;
 
@@ -979,7 +1304,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize, false);
 		if (readLen < 0)
 			goto err;
 
@@ -1143,34 +1468,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1185,17 +1559,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1213,7 +1590,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1224,18 +1601,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1243,7 +1620,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1251,9 +1632,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1397,17 +1778,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1416,58 +1798,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1493,10 +1854,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1516,10 +1878,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1547,12 +1910,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..5cd1c8ab1b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -350,7 +350,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
@@ -826,7 +826,8 @@ wal_segment_close(XLogReaderState *state)
  */
 int
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+					 bool nowait)
 {
 	XLogRecPtr	read_upto,
 				loc;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1e2d94951..732e75eb39 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4112,6 +4112,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_PAUSE:
 			event_name = "RecoveryPause";
 			break;
+		case WAIT_EVENT_RECOVERY_WAL_FLUSH:
+			event_name = "RecoveryWalFlush";
+			break;
 		case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
 			event_name = "ReplicationOriginDrop";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..e52fc50433 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -122,7 +122,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 23baa4498a..19de810931 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -808,7 +808,7 @@ StartReplication(StartReplicationCmd *cmd)
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+					   XLogRecPtr targetRecPtr, char *cur_page, bool nowait)
 {
 	XLogRecPtr	flushptr;
 	int			count;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7d6a..d0a28f4571 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -49,7 +49,8 @@ typedef struct XLogPageReadPrivate
 
 static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+							   bool nowait);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -248,7 +249,8 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 /* XLogReader callback function, to read a WAL page */
 static int
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+				   bool nowait)
 {
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 	uint32		targetPageOff;
@@ -432,7 +434,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 610f65e471..869c1e3101 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -333,7 +333,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 /* pg_waldump's XLogReaderRoutine->page_read callback */
 static int
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+				XLogRecPtr targetPtr, char *readBuff, bool nowait)
 {
 	XLogDumpPrivate *private = state->private_data;
 	int			count = XLOG_BLCKSZ;
@@ -392,10 +392,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -484,7 +484,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -515,7 +515,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -528,26 +528,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..e213c68256 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -62,7 +62,8 @@ typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen,
 							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+							   char *readBuf,
+							   bool nowait);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -144,6 +145,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -168,35 +193,25 @@ struct XLogReaderState
 	void	   *private_data;
 
 	/*
-	 * Start and end point of last record read.  EndRecPtr is also used as the
-	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
-	 * starting position and ReadRecPtr to invalid.
+	 * Start and end point of last record returned by XLogReadRecord().
+	 *
+	 * XXX These are also available as record->lsn and record->next_lsn,
+	 * but since these were part of the public interface...
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
-
-	/* ----------------------------------------
-	 * Decoded representation of current record
-	 *
-	 * Use XLogRecGet* functions to investigate the record; these fields
-	 * should not be accessed directly.
-	 * ----------------------------------------
+	/*
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord. */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
@@ -210,6 +225,26 @@ struct XLogReaderState
 	char	   *readBuf;
 	uint32		readLen;
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -252,6 +287,7 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 /* Get a new XLogReader */
@@ -264,6 +300,11 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -274,6 +315,10 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -297,25 +342,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 9ac602b674..73d3f4a129 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -49,7 +49,8 @@ extern void FreeFakeRelcacheEntry(Relation fakerel);
 
 extern int	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+								 XLogRecPtr targetRecPtr, char *cur_page,
+								 bool nowait);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..0ba3090cef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1005,6 +1005,7 @@ typedef enum
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
+	WAIT_EVENT_RECOVERY_WAL_FLUSH,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-- 
2.30.1

v16-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v16-0004-Prefetch-referenced-blocks-during-recovery.patchDownload

From 0e07c25a3c8815daa20985d8560158f6343819b8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v16 4/5] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled (the default),
then read ahead in the WAL and try to initiate asynchronous reading of
referenced blocks that will soon be needed but are not yet cached in our
buffer pool.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  22 +-
 src/backend/access/transam/xlogprefetch.c     | 907 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   2 +
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  79 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  26 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 18 files changed, 1399 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..06ecf9b426 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3408,6 +3408,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..a95a039865 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -337,6 +337,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2892,6 +2899,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5024,8 +5103,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index ae4a3c1293..eb7caaa963 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -796,6 +796,23 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c33f7722c9..e24e5f0c3c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -110,6 +111,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
@@ -3729,7 +3731,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -6674,6 +6675,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7354,6 +7361,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7364,6 +7372,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							LSN_FORMAT_ARGS(ReadRecPtr))));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7393,6 +7404,9 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				XLogPrefetch(&prefetch, xlogreader->ReadRecPtr);
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7566,6 +7580,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7582,6 +7599,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12436,6 +12454,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12699,6 +12718,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..7224d882cb
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,907 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int				next_block_id;
+	bool			shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation	last_reln;
+	RelFileNode		last_rnode;
+	BlockNumber		last_blkno;
+
+	/* Online averages. */
+	uint64			samples;
+	double			avg_queue_depth;
+	double			avg_distance;
+	XLogRecPtr		next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB		   *filter_table;
+	dlist_head		filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int				prefetch_head;
+	int				prefetch_tail;
+	int				prefetch_queue_size;
+	XLogRecPtr		prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode		rnode;
+	XLogRecPtr		filter_until_replayed;
+	BlockNumber		filter_from_block;
+	dlist_node		link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time; /* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float			avg_distance;
+	float			avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32			reset_handled;
+
+	/* Dynamic values */
+	int				distance;	/* Number of bytes ahead in the WAL. */
+	int				queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&Stats->prefetch, 0);
+	pg_atomic_write_u64(&Stats->skip_hit, 0);
+	pg_atomic_write_u64(&Stats->skip_new, 0);
+	pg_atomic_write_u64(&Stats->skip_fpw, 0);
+	pg_atomic_write_u64(&Stats->skip_seq, 0);
+	Stats->avg_distance = 0;
+	Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	Stats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+	if (!found)
+	{
+		pg_atomic_init_u32(&Stats->reset_request, 0);
+		Stats->reset_handled = 0;
+		pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&Stats->prefetch, 0);
+		pg_atomic_init_u64(&Stats->skip_hit, 0);
+		pg_atomic_init_u64(&Stats->skip_new, 0);
+		pg_atomic_init_u64(&Stats->skip_fpw, 0);
+		pg_atomic_init_u64(&Stats->skip_seq, 0);
+		Stats->avg_distance = 0;
+		Stats->avg_queue_depth = 0;
+		Stats->distance = 0;
+		Stats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&Stats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&Stats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess());
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	Stats->queue_depth = 0;
+	Stats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+			 (uint32) (prefetcher->reader->EndRecPtr << 32),
+			 (uint32) (prefetcher->reader->EndRecPtr),
+			 pg_atomic_read_u64(&Stats->prefetch),
+			 pg_atomic_read_u64(&Stats->skip_hit),
+			 pg_atomic_read_u64(&Stats->skip_new),
+			 pg_atomic_read_u64(&Stats->skip_fpw),
+			 pg_atomic_read_u64(&Stats->skip_seq),
+			 Stats->avg_distance,
+			 Stats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * pretching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&Stats->reset_request);
+	if (reset_request != Stats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		Stats->reset_handled = reset_request;
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			record = XLogReadAhead(reader, &error);
+			if (record == NULL)
+			{
+				/* If we got an error, log it and give up. */
+				if (error)
+				{
+					ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+					prefetcher->shutdown = true;
+					Stats->queue_depth = 0;
+					Stats->distance = 0;
+				}
+				/* Otherwise, we'll try again later when more data is here. */
+				return;
+			}
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		Stats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = Stats->distance;
+				prefetcher->avg_queue_depth = Stats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(Stats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(Stats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			Stats->avg_distance = prefetcher->avg_distance;
+			Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;		/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it
+		 * might still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			XLogPrefetchIncrement(&Stats->skip_fpw);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			XLogPrefetchIncrement(&Stats->skip_new);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			XLogPrefetchIncrement(&Stats->skip_new);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				XLogPrefetchIncrement(&Stats->skip_seq);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			XLogPrefetchIncrement(&Stats->skip_hit);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it
+			 * was already cached by the kernel, so we just have to assume
+			 * that it has due to lack of better information).  Record
+			 * this as an I/O in progress until eventually we replay this
+			 * LSN.
+			 */
+			XLogPrefetchIncrement(&Stats->prefetch);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			XLogPrefetchIncrement(&Stats->skip_new);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+	values[6] = Int32GetDatum(Stats->distance);
+	values[7] = Int32GetDatum(Stats->queue_depth);
+	values[8] = Float4GetDatum(Stats->avg_distance);
+	values[9] = Float4GetDatum(Stats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int		next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of the
+	 * time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	Stats->queue_depth++;
+	Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		Stats->queue_depth--;
+		Assert(Stats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int		next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 07c05e01a6..a85b718d54 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -865,6 +865,8 @@ err:
 	/*
 	 * Invalidate the read state, if this was an error. We might read from a
 	 * different source after failure.
+	 *
+	 * XXX !?!
 	 */
 	if (readOff < 0 || state->errormsg_buf[0] != '\0')
 		XLogReaderInvalReadState(state);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0dca65dc7b..238b111d2b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -841,6 +841,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 732e75eb39..e47d05adec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -299,6 +300,7 @@ static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
 static int	nReplSlotStats;
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -377,6 +379,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1452,11 +1455,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2902,6 +2914,22 @@ pgstat_fetch_replslot(int *nslots_p)
 	return replSlotStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
@@ -4804,6 +4832,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -5017,6 +5062,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -5309,6 +5358,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -5582,6 +5638,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -5687,6 +5744,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -6005,6 +6074,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -6081,6 +6151,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -6264,6 +6346,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -6964,6 +7053,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..47847563ef 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..9270ba12f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -37,6 +37,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -203,6 +204,7 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1251,6 +1253,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2669,6 +2697,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -2989,7 +3028,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -11781,6 +11821,20 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..31fcf32bad 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -235,6 +235,12 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 6d384d3ce6..f4a0a78ede 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -132,6 +132,7 @@ extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..772b5205b1
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,79 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch > 0)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 93393fcfd4..53b6a99e76 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6239,6 +6239,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0ba3090cef..75e8b63780 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -74,6 +74,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -197,6 +198,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -517,6 +531,15 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter m_stream_bytes;
 } PgStat_MsgReplSlot;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
 
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
@@ -679,6 +702,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1602,6 +1626,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
 
@@ -1621,6 +1646,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 5004ee4177..d0078779c8 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -442,4 +442,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b12cc122a..dd76a3c0aa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1878,6 +1878,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.30.1

v16-0005-Avoid-extra-buffer-lookup-when-prefetching-WAL-b.patchtext/x-patch; charset=US-ASCII; name=v16-0005-Avoid-extra-buffer-lookup-when-prefetching-WAL-b.patchDownload

From 4af3b9b7c0aa31c180ed2ab81498a833d9fcd4f1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 18 Mar 2021 13:01:25 +1300
Subject: [PATCH v16 5/5] Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records to remember which buffer
recently contained we found a block cached in, so that we can try to use
ReadRecentBuffer() while replaying.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 +++---
 src/backend/access/transam/xlogreader.c   | 13 +++++++++++++
 src/backend/access/transam/xlogutils.c    | 23 +++++++++++++++++++----
 src/backend/storage/buffer/bufmgr.c       |  1 +
 src/backend/storage/freespace/freespace.c |  3 ++-
 src/include/access/xlogreader.h           |  7 +++++++
 src/include/access/xlogutils.h            |  3 ++-
 8 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e24e5f0c3c..8798185cae 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1463,7 +1463,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index 7224d882cb..f14636de3e 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -636,10 +636,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			XLogPrefetchIncrement(&Stats->skip_hit);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a85b718d54..4f0ae8b71c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1646,6 +1646,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1853,6 +1855,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1867,6 +1878,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5cd1c8ab1b..41f56ff856 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -335,11 +335,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -361,7 +363,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -390,7 +393,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -437,7 +441,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -445,6 +450,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -503,6 +517,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0e5f92d92b..0cc1eb7971 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -651,6 +651,7 @@ ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
 	if (!BufferIsLocal(recent_buffer))
 		UnpinBuffer(bufHdr, true);
 
+
 	return false;
 }
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda238..cfa0414e5a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e213c68256..be370e9015 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -126,6 +127,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -377,5 +381,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 73d3f4a129..0ee4a7c52c 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
-- 
2.30.1

#81

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#80)

Re: WIP: WAL prefetch (another approach)

On 3/18/21 1:54 AM, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 12:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 3/17/21 10:43 PM, Stephen Frost wrote:

Guess I'm just not a fan of pushing out a change that will impact
everyone by default, in a possibly negative way (or positive, though
that doesn't seem terribly likely, but who knows), without actually
measuring what that impact will look like in those more common cases.
Showing that it's a great win when you're on ZFS or running with FPWs
disabled is good and the expected best case, but we should be
considering the worst case too when it comes to performance
improvements.

Well, maybe it'll behave differently on systems with ZFS. I don't know,
and I have no such machine to test that at the moment. My argument
however remains the same - if if happens to be a problem, just don't
enable (or disable) the prefetching, and you get the current behavior.

I see the road map for this feature being to get it working on every
OS via the AIO patchset, in later work, hopefully not very far in the
future (in the most portable mode, you get I/O worker processes doing
pread() or preadv() calls on behalf of recovery). So I'll be glad to
get this infrastructure in, even though it's maybe only useful for
some people in the first release.

+1 to that

FWIW I'm not sure there was a discussion or argument about what should
be the default setting (enabled or disabled). I'm fine with not enabling
this by default, so that people have to enable it explicitly.

In a way that'd be consistent with effective_io_concurrency being 1 by
default, which almost disables regular prefetching.

Yeah, I'm not sure but I'd be fine with disabling it by default in the
initial release. The current patch set has it enabled, but that's
mostly for testing, it's not an opinion on how it should ship.

+1 to that too. Better to have it disabled by default than not at all.

I've attached a rebased patch set with a couple of small changes:

1. I abandoned the patch that proposed
pg_atomic_unlocked_add_fetch_u{32,64}() and went for a simple function
local to xlogprefetch.c that just does pg_atomic_write_u64(counter,
pg_atomic_read_u64(counter) + 1), in response to complaints from
Andres[1].

2. I fixed a bug in ReadRecentBuffer(), and moved it into its own
patch for separate review.

I'm now looking at Horiguchi-san and Heikki's patch[2] to remove
XLogReader's callbacks, to try to understand how these two patch sets
are related. I don't really like the way those callbacks work, and
I'm afraid had to make them more complicated. But I don't yet know
very much about that other patch set. More soon.

OK. Do you think we should get both of those patches in, or do we need
to commit them in a particular order? Or what is your concern?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#82

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#81)

Re: WIP: WAL prefetch (another approach)

On Fri, Mar 19, 2021 at 2:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 3/18/21 1:54 AM, Thomas Munro wrote:

I'm now looking at Horiguchi-san and Heikki's patch[2] to remove
XLogReader's callbacks, to try to understand how these two patch sets
are related. I don't really like the way those callbacks work, and
I'm afraid had to make them more complicated. But I don't yet know
very much about that other patch set. More soon.

OK. Do you think we should get both of those patches in, or do we need
to commit them in a particular order? Or what is your concern?

I would like to commit the callback-removal patch first, and then the
WAL decoder and prefetcher patches become simpler and cleaner on top
of that. I will post the rebase and explanation shortly.

#83

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#82)

10 attachment(s)

Re: WIP: WAL prefetch (another approach)

Here's rebase, on top of Horiguchi-san's v19 patch set. My patches
start at 0007. Previously, there was a "nowait" flag that was passed
into all the callbacks so that XLogReader could wait for new WAL in
some cases but not others. This new version uses the proposed
XLREAD_NEED_DATA protocol, and the caller deals with waiting for data
to arrive when appropriate. This seems tidier to me.

I made one other simplifying change: previously, the prefetch module
would read the WAL up to the "written" LSN (so, allowing itself to
read data that had been written but not yet flushed to disk by the
walreceiver), though it still waited until a record's LSN was
"flushed" before replaying. That allowed prefetching to happen
concurrently with the WAL flush, which was nice, but it felt a little
too "special". I decided to remove that part for now, and I plan to
look into making standbys work more like primary servers, using WAL
buffers, the WAL writer and optionally the standard log-before-data
rule.

Attachments:

v17-0001-Move-callback-call-from-ReadPageInternal-to-XLog.patchtext/x-patch; charset=US-ASCII; name=v17-0001-Move-callback-call-from-ReadPageInternal-to-XLog.patchDownload

From f2df4aaf61c1700bddb6673cb42a0d8ae5037557 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 5 Sep 2019 20:21:55 +0900
Subject: [PATCH v17 01/10] Move callback-call from ReadPageInternal to
 XLogReadRecord.

The current WAL record reader reads page data using a call back
function.  Redesign the interface so that it asks the caller for more
data when required.  This model works better for proposed projects that
encryption, prefetching and other new features that would require
extending the callback interface for each case.

As the first step of that change, this patch moves the page reader
function out of ReadPageInternal(), then the remaining tasks of the
function are taken over by the new function XLogNeedData().

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
---
 src/backend/access/transam/xlog.c       |  16 +-
 src/backend/access/transam/xlogreader.c | 317 +++++++++++++++---------
 src/backend/access/transam/xlogutils.c  |  12 +-
 src/backend/replication/walsender.c     |  10 +-
 src/bin/pg_rewind/parsexlog.c           |  21 +-
 src/bin/pg_waldump/pg_waldump.c         |   8 +-
 src/include/access/xlogreader.h         |  32 ++-
 src/include/access/xlogutils.h          |   2 +-
 8 files changed, 263 insertions(+), 155 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a43..8085ca1117 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -920,7 +920,7 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
-static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
+static bool	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
@@ -4375,7 +4375,6 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	XLogRecord *record;
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
-	/* Pass through parameters to XLogPageRead */
 	private->fetching_ckpt = fetching_ckpt;
 	private->emode = emode;
 	private->randAccess = (xlogreader->ReadRecPtr == InvalidXLogRecPtr);
@@ -12107,7 +12106,7 @@ CancelBackup(void)
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
  */
-static int
+static bool
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			 XLogRecPtr targetRecPtr, char *readBuf)
 {
@@ -12166,7 +12165,8 @@ retry:
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
 
-			return -1;
+			xlogreader->readLen = -1;
+			return false;
 		}
 	}
 
@@ -12261,7 +12261,8 @@ retry:
 		goto next_record_is_invalid;
 	}
 
-	return readLen;
+	xlogreader->readLen = readLen;
+	return true;
 
 next_record_is_invalid:
 	lastSourceFailed = true;
@@ -12275,8 +12276,9 @@ next_record_is_invalid:
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
-	else
-		return -1;
+
+	xlogreader->readLen = -1;
+	return false;
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 42738eb940..f2345ab09e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -36,8 +36,8 @@
 static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 			pg_attribute_printf(2, 3);
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
-static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
-							 int reqLen);
+static bool XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr,
+						 int reqLen, bool header_inclusive);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
@@ -261,8 +261,48 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
+ * Returns XLREAD_NEED_DATA if more data is needed to finish reading the
+ * current record.  In that case, state->readPagePtr and state->readLen inform
+ * the desired position and minimum length of data needed. The caller shall
+ * read in the requested data and set state->readBuf to point to a buffer
+ * containing it. The caller must also set state->seg->ws_tli and
+ * state->readLen to indicate the timeline that it was read from, and the
+ * length of data that is now available (which must be >= given readLen),
+ * respectively.
+ *
+ * If invalid data is encountered, returns XLREAD_FAIL with *record being set to
+ * NULL. *errormsg is set to a string with details of the failure.
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
+ *
+ *
+ * This function runs a state machine consists of the following states.
+ *
+ * XLREAD_NEXT_RECORD :
+ *    The initial state, if called with valid RecPtr, try to read a record at
+ *    that position.  If invalid RecPtr is given try to read a record just after
+ *    the last one previously read.
+ *    This state ens after setting ReadRecPtr. Then goes to XLREAD_TOT_LEN.
+ *
+ * XLREAD_TOT_LEN:
+ *    Examining record header. Ends after reading record total
+ *    length. recordRemainLen and recordGotLen are initialized.
+ *
+ * XLREAD_FIRST_FRAGMENT:
+ *    Reading the first fragment. Ends with finishing reading a single
+ *    record. Goes to XLREAD_NEXT_RECORD if that's all or
+ *    XLREAD_CONTINUATION if we have continuation.
+
+ * XLREAD_CONTINUATION:
+ *    Reading continuation of record. Ends with finishing the whole record then
+ *    goes to XLREAD_NEXT_RECORD. During this state, recordRemainLen indicates
+ *    how much is left and readRecordBuf holds the partially assert
+ *    record.recordContRecPtr points to the beginning of the next page where to
+ *    continue.
+ *
+ * If wrong data found in any state, the state machine stays at the current
+ * state. This behavior allows to continue reading a reacord switching among
+ * different souces, while streaming replication.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
@@ -276,7 +316,6 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	uint32		targetRecOff;
 	uint32		pageHeaderSize;
 	bool		gotheader;
-	int			readOff;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -326,14 +365,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	 * byte to cover the whole record header, or at least the part of it that
 	 * fits on the same page.
 	 */
-	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	while (XLogNeedData(state, targetPagePtr,
+						Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+						targetRecOff != 0))
+	{
+		if (!state->routine.page_read(state, state->readPagePtr, state->readLen,
+									  RecPtr, state->readBuf))
+			break;
+	}
+
+	if (!state->page_verified)
 		goto err;
 
 	/*
-	 * ReadPageInternal always returns at least the page header, so we can
-	 * examine it now.
+	 * We have at least the page header, so we can examine it now.
 	 */
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
 	if (targetRecOff == 0)
@@ -359,8 +404,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		goto err;
 	}
 
-	/* ReadPageInternal has verified the page header */
-	Assert(pageHeaderSize <= readOff);
+	/* XLogNeedData has verified the page header */
+	Assert(pageHeaderSize <= state->readLen);
 
 	/*
 	 * Read the record length.
@@ -432,18 +477,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
 		do
 		{
+			int			rest_len = total_len - gotlen;
+
 			/* Calculate pointer to beginning of next page */
 			targetPagePtr += XLOG_BLCKSZ;
 
 			/* Wait for the next page to become available */
-			readOff = ReadPageInternal(state, targetPagePtr,
-									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+			while (XLogNeedData(state, targetPagePtr,
+								Min(rest_len, XLOG_BLCKSZ),
+								false))
+			{
+				if (!state->routine.page_read(state, state->readPagePtr,
+											  state->readLen,
+											  state->ReadRecPtr,
+											  state->readBuf))
+					break;
+			}
 
-			if (readOff < 0)
+			if (!state->page_verified)
 				goto err;
 
-			Assert(SizeOfXLogShortPHD <= readOff);
+			Assert(SizeOfXLogShortPHD <= state->readLen);
 
 			/* Check that the continuation on next page looks valid */
 			pageHeader = (XLogPageHeader) state->readBuf;
@@ -473,21 +527,14 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 			/* Append the continuation from this page to the buffer */
 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
 
-			if (readOff < pageHeaderSize)
-				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
-
-			Assert(pageHeaderSize <= readOff);
+			Assert(pageHeaderSize <= state->readLen);
 
 			contdata = (char *) state->readBuf + pageHeaderSize;
 			len = XLOG_BLCKSZ - pageHeaderSize;
 			if (pageHeader->xlp_rem_len < len)
 				len = pageHeader->xlp_rem_len;
 
-			if (readOff < pageHeaderSize + len)
-				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
-
+			Assert(pageHeaderSize + len <= state->readLen);
 			memcpy(buffer, (char *) contdata, len);
 			buffer += len;
 			gotlen += len;
@@ -517,9 +564,16 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	else
 	{
 		/* Wait for the record data to become available */
-		readOff = ReadPageInternal(state, targetPagePtr,
-								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		while (XLogNeedData(state, targetPagePtr,
+							Min(targetRecOff + total_len, XLOG_BLCKSZ), true))
+		{
+			if (!state->routine.page_read(state, state->readPagePtr,
+										  state->readLen,
+										  state->ReadRecPtr, state->readBuf))
+				break;
+		}
+
+		if (!state->page_verified)
 			goto err;
 
 		/* Record does not cross a page boundary */
@@ -562,109 +616,138 @@ err:
 }
 
 /*
- * Read a single xlog page including at least [pageptr, reqLen] of valid data
- * via the page_read() callback.
+ * Checks that an xlog page loaded in state->readBuf is including at least
+ * [pageptr, reqLen] and the page is valid. header_inclusive indicates that
+ * reqLen is calculated including page header length.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns false if the buffer already contains the requested data, or found
+ * error. state->page_verified is set to true for the former and false for the
+ * latter.
  *
- * We fetch the page from a reader-local cache if we know we have the required
- * data and if there hasn't been any error since caching the data.
+ * Otherwise returns true and requests data loaded onto state->readBuf by
+ * state->readPagePtr and state->readLen. The caller shall call this function
+ * again after filling the buffer at least with that portion of data and set
+ * state->readLen to the length of actually loaded data.
+ *
+ * If header_inclusive is false, corrects reqLen internally by adding the
+ * actual page header length and may request caller for new data.
  */
-static int
-ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
+static bool
+XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
+			 bool header_inclusive)
 {
-	int			readLen;
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo;
-	XLogPageHeader hdr;
+	uint32		addLen = 0;
 
-	Assert((pageptr % XLOG_BLCKSZ) == 0);
+	/* Some data is loaded, but page header is not verified yet. */
+	if (!state->page_verified &&
+		!XLogRecPtrIsInvalid(state->readPagePtr) && state->readLen >= 0)
+	{
+		uint32		pageHeaderSize;
 
-	XLByteToSeg(pageptr, targetSegNo, state->segcxt.ws_segsize);
-	targetPageOff = XLogSegmentOffset(pageptr, state->segcxt.ws_segsize);
+		/* just loaded new data so needs to verify page header */
 
-	/* check whether we have all the requested data already */
-	if (targetSegNo == state->seg.ws_segno &&
-		targetPageOff == state->segoff && reqLen <= state->readLen)
-		return state->readLen;
+		/* The caller must have loaded at least page header */
+		Assert(state->readLen >= SizeOfXLogShortPHD);
 
-	/*
-	 * Data is not in our buffer.
-	 *
-	 * Every time we actually read the segment, even if we looked at parts of
-	 * it before, we need to do verification as the page_read callback might
-	 * now be rereading data from a different source.
-	 *
-	 * Whenever switching to a new WAL segment, we read the first page of the
-	 * file and validate its header, even if that's not where the target
-	 * record is.  This is so that we can check the additional identification
-	 * info that is present in the first page's "long" header.
-	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
-	{
-		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
+		/*
+		 * We have enough data to check the header length. Recheck the loaded
+		 * length against the actual header length.
+		 */
+		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
 
-		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
-										   state->currRecPtr,
-										   state->readBuf);
-		if (readLen < 0)
-			goto err;
+		/* Request more data if we don't have the full header. */
+		if (state->readLen < pageHeaderSize)
+		{
+			state->readLen = pageHeaderSize;
+			return true;
+		}
 
-		/* we can be sure to have enough WAL available, we scrolled back */
-		Assert(readLen == XLOG_BLCKSZ);
+		/* Now that we know we have the full header, validate it. */
+		if (!XLogReaderValidatePageHeader(state, state->readPagePtr,
+										  (char *) state->readBuf))
+		{
+			/* That's bad. Force reading the page again. */
+			XLogReaderInvalReadState(state);
 
-		if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
-										  state->readBuf))
-			goto err;
+			return false;
+		}
+
+		state->page_verified = true;
+
+		XLByteToSeg(state->readPagePtr, state->seg.ws_segno,
+					state->segcxt.ws_segsize);
 	}
 
 	/*
-	 * First, read the requested data length, but at least a short page header
-	 * so that we can validate it.
+	 * The loaded page may not be the one caller is supposing to read when we
+	 * are verifying the first page of new segment. In that case, skip further
+	 * verification and immediately load the target page.
 	 */
-	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
-									   state->currRecPtr,
-									   state->readBuf);
-	if (readLen < 0)
-		goto err;
+	if (state->page_verified && pageptr == state->readPagePtr)
+	{
+		/*
+		 * calculate additional length for page header keeping the total
+		 * length within the block size.
+		 */
+		if (!header_inclusive)
+		{
+			uint32		pageHeaderSize =
+			XLogPageHeaderSize((XLogPageHeader) state->readBuf);
 
-	Assert(readLen <= XLOG_BLCKSZ);
+			addLen = pageHeaderSize;
+			if (reqLen + pageHeaderSize <= XLOG_BLCKSZ)
+				addLen = pageHeaderSize;
+			else
+				addLen = XLOG_BLCKSZ - reqLen;
 
-	/* Do we have enough data to check the header length? */
-	if (readLen <= SizeOfXLogShortPHD)
-		goto err;
+			Assert(addLen >= 0);
+		}
+
+		/* Return if we already have it. */
+		if (reqLen + addLen <= state->readLen)
+			return false;
+	}
 
-	Assert(readLen >= reqLen);
+	/* Data is not in our buffer, request the caller for it. */
+	XLByteToSeg(pageptr, targetSegNo, state->segcxt.ws_segsize);
+	targetPageOff = XLogSegmentOffset(pageptr, state->segcxt.ws_segsize);
+	Assert((pageptr % XLOG_BLCKSZ) == 0);
 
-	hdr = (XLogPageHeader) state->readBuf;
+	/*
+	 * Every time we request to load new data of a page to the caller, even if
+	 * we looked at a part of it before, we need to do verification on the
+	 * next invocation as the caller might now be rereading data from a
+	 * different source.
+	 */
+	state->page_verified = false;
 
-	/* still not enough */
-	if (readLen < XLogPageHeaderSize(hdr))
+	/*
+	 * Whenever switching to a new WAL segment, we read the first page of the
+	 * file and validate its header, even if that's not where the target
+	 * record is.  This is so that we can check the additional identification
+	 * info that is present in the first page's "long" header. Don't do this
+	 * if the caller requested the first page in the segment.
+	 */
+	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
-		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
-										   state->currRecPtr,
-										   state->readBuf);
-		if (readLen < 0)
-			goto err;
+		/*
+		 * Then we'll see that the targetSegNo now matches the ws_segno, and
+		 * will not come back here, but will request the actual target page.
+		 */
+		state->readPagePtr = pageptr - targetPageOff;
+		state->readLen = XLOG_BLCKSZ;
+		return true;
 	}
 
 	/*
-	 * Now that we know we have the full header, validate it.
+	 * Request the caller to load the page. We need at least a short page
+	 * header so that we can validate it.
 	 */
-	if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
-		goto err;
-
-	/* update read state information */
-	state->seg.ws_segno = targetSegNo;
-	state->segoff = targetPageOff;
-	state->readLen = readLen;
-
-	return readLen;
-
-err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	state->readPagePtr = pageptr;
+	state->readLen = Max(reqLen + addLen, SizeOfXLogShortPHD);
+	return true;
 }
 
 /*
@@ -673,9 +756,7 @@ err:
 static void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
-	state->seg.ws_segno = 0;
-	state->segoff = 0;
-	state->readLen = 0;
+	state->readPagePtr = InvalidXLogRecPtr;
 }
 
 /*
@@ -953,7 +1034,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
 		uint32		pageHeaderSize;
-		int			readLen;
 
 		/*
 		 * Compute targetRecOff. It should typically be equal or greater than
@@ -961,27 +1041,32 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		 * that, except when caller has explicitly specified the offset that
 		 * falls somewhere there or when we are skipping multi-page
 		 * continuation record. It doesn't matter though because
-		 * ReadPageInternal() is prepared to handle that and will read at
-		 * least short page-header worth of data
+		 * XLogNeedData() is prepared to handle that and will read at least
+		 * short page-header worth of data
 		 */
 		targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
 
 		/* scroll back to page boundary */
 		targetPagePtr = tmpRecPtr - targetRecOff;
 
-		/* Read the page containing the record */
-		readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
-		if (readLen < 0)
+		while (XLogNeedData(state, targetPagePtr, targetRecOff,
+							targetRecOff != 0))
+		{
+			if (!state->routine.page_read(state, state->readPagePtr,
+										  state->readLen,
+										  state->ReadRecPtr, state->readBuf))
+				break;
+		}
+
+		if (!state->page_verified)
 			goto err;
 
 		header = (XLogPageHeader) state->readBuf;
 
 		pageHeaderSize = XLogPageHeaderSize(header);
 
-		/* make sure we have enough data for the page header */
-		readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
-		if (readLen < 0)
-			goto err;
+		/* we should have read the page header */
+		Assert(state->readLen >= pageHeaderSize);
 
 		/* skip over potential continuation data */
 		if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..46eda33f25 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -686,8 +686,8 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
-									 state->segcxt.ws_segsize + state->segoff);
+	const XLogRecPtr lastReadPage = state->seg.ws_segno *
+	state->segcxt.ws_segsize + state->readLen;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -824,7 +824,7 @@ wal_segment_close(XLogReaderState *state)
  * exists for normal backends, so we have to do a check/sleep/repeat style of
  * loop for now.
  */
-int
+bool
 read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
 {
@@ -926,7 +926,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		return -1;
+		state->readLen = -1;
+		return false;
 	}
 	else
 	{
@@ -944,7 +945,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
-	return count;
+	state->readLen = count;
+	return true;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4bf8a18e01..a4d6f30957 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -806,7 +806,7 @@ StartReplication(StartReplicationCmd *cmd)
  * which has to do a plain sleep/busy loop, because the walsender's latch gets
  * set every time WAL is flushed.
  */
-static int
+static bool
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 					   XLogRecPtr targetRecPtr, char *cur_page)
 {
@@ -826,7 +826,10 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
-		return -1;
+	{
+		state->readLen = -1;
+		return false;
+	}
 
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;	/* more than one block available */
@@ -854,7 +857,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	XLByteToSeg(targetPagePtr, segno, state->segcxt.ws_segsize);
 	CheckXLogRemoved(segno, state->seg.ws_tli);
 
-	return count;
+	state->readLen = count;
+	return true;
 }
 
 /*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7d6a..cf119848b0 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -47,7 +47,7 @@ typedef struct XLogPageReadPrivate
 	int			tliIndex;
 } XLogPageReadPrivate;
 
-static int	SimpleXLogPageRead(XLogReaderState *xlogreader,
+static bool	SimpleXLogPageRead(XLogReaderState *xlogreader,
 							   XLogRecPtr targetPagePtr,
 							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 
@@ -246,7 +246,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 }
 
 /* XLogReader callback function, to read a WAL page */
-static int
+static bool
 SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
 {
@@ -306,7 +306,8 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			if (private->restoreCommand == NULL)
 			{
 				pg_log_error("could not open file \"%s\": %m", xlogfpath);
-				return -1;
+				xlogreader->readLen = -1;
+				return false;
 			}
 
 			/*
@@ -319,7 +320,10 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 											 private->restoreCommand);
 
 			if (xlogreadfd < 0)
-				return -1;
+			{
+				xlogreader->readLen = -1;
+				return false;
+			}
 			else
 				pg_log_debug("using file \"%s\" restored from archive",
 							 xlogfpath);
@@ -335,7 +339,8 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 	if (lseek(xlogreadfd, (off_t) targetPageOff, SEEK_SET) < 0)
 	{
 		pg_log_error("could not seek in file \"%s\": %m", xlogfpath);
-		return -1;
+		xlogreader->readLen = -1;
+		return false;
 	}
 
 
@@ -348,13 +353,15 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			pg_log_error("could not read file \"%s\": read %d of %zu",
 						 xlogfpath, r, (Size) XLOG_BLCKSZ);
 
-		return -1;
+		xlogreader->readLen = -1;
+		return false;
 	}
 
 	Assert(targetSegNo == xlogreadsegno);
 
 	xlogreader->seg.ws_tli = targetHistory[private->tliIndex].tli;
-	return XLOG_BLCKSZ;
+	xlogreader->readLen = XLOG_BLCKSZ;
+	return true;
 }
 
 /*
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f8b8afe4a7..75ece5c658 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -331,7 +331,7 @@ WALDumpCloseSegment(XLogReaderState *state)
 }
 
 /* pg_waldump's XLogReaderRoutine->page_read callback */
-static int
+static bool
 WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 				XLogRecPtr targetPtr, char *readBuff)
 {
@@ -348,7 +348,8 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		else
 		{
 			private->endptr_reached = true;
-			return -1;
+			state->readLen = -1;
+			return false;
 		}
 	}
 
@@ -373,7 +374,8 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 						(Size) errinfo.wre_req);
 	}
 
-	return count;
+	state->readLen = count;
+	return true;
 }
 
 /*
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..5d9e0d3292 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -57,12 +57,12 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
-/* Function type definitions for various xlogreader interactions */
-typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
-							   XLogRecPtr targetPagePtr,
-							   int reqLen,
-							   XLogRecPtr targetRecPtr,
-							   char *readBuf);
+/* Function type definition for the read_page callback */
+typedef bool (*XLogPageReadCB) (XLogReaderState *xlogreader,
+								XLogRecPtr targetPagePtr,
+								int reqLen,
+								XLogRecPtr targetRecPtr,
+								char *readBuf);
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
@@ -175,6 +175,19 @@ struct XLogReaderState
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
 
+	/* ----------------------------------------
+	 * Communication with page reader
+	 * readBuf is XLOG_BLCKSZ bytes, valid up to at least readLen bytes.
+	 *  ----------------------------------------
+	 */
+	/* variables to communicate with page reader */
+	XLogRecPtr	readPagePtr;	/* page pointer to read */
+	int32		readLen;		/* bytes requested to reader, or actual bytes
+								 * read by reader, which must be larger than
+								 * the request, or -1 on error */
+	char	   *readBuf;		/* buffer to store data */
+	bool		page_verified;	/* is the page on the buffer verified? */
+
 
 	/* ----------------------------------------
 	 * Decoded representation of current record
@@ -203,13 +216,6 @@ struct XLogReaderState
 	 * ----------------------------------------
 	 */
 
-	/*
-	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
-	 * readLen bytes)
-	 */
-	char	   *readBuf;
-	uint32		readLen;
-
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 9ac602b674..364a21c4ea 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,7 +47,7 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
-extern int	read_local_xlog_page(XLogReaderState *state,
+extern bool	read_local_xlog_page(XLogReaderState *state,
 								 XLogRecPtr targetPagePtr, int reqLen,
 								 XLogRecPtr targetRecPtr, char *cur_page);
 extern void wal_segment_open(XLogReaderState *state,
-- 
2.30.1

v17-0002-Move-page-reader-out-of-XLogReadRecord.patchtext/x-patch; charset=US-ASCII; name=v17-0002-Move-page-reader-out-of-XLogReadRecord.patchDownload

From f6ddd723fd67db5045c454f70786fcda98f36fab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 10 Sep 2019 12:58:27 +0900
Subject: [PATCH v17 02/10] Move page-reader out of XLogReadRecord().

This is the second step of removing callbacks from the WAL decoder.
XLogReadRecord() return XLREAD_NEED_DATA to indicate that the caller
should supply new data, and the decoder works as a state machine.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
---
 src/backend/access/transam/twophase.c         |  14 +-
 src/backend/access/transam/xlog.c             |  58 +-
 src/backend/access/transam/xlogreader.c       | 654 ++++++++++--------
 src/backend/access/transam/xlogutils.c        |  17 +-
 src/backend/replication/logical/logical.c     |  26 +-
 .../replication/logical/logicalfuncs.c        |  13 +-
 src/backend/replication/slotfuncs.c           |  18 +-
 src/backend/replication/walsender.c           |  32 +-
 src/bin/pg_rewind/parsexlog.c                 |  86 ++-
 src/bin/pg_waldump/pg_waldump.c               |  36 +-
 src/include/access/xlogreader.h               | 122 ++--
 src/include/access/xlogutils.h                |   4 +-
 src/include/pg_config_manual.h                |   2 +-
 src/include/replication/logical.h             |  11 +-
 14 files changed, 611 insertions(+), 482 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 89335b64a2..3137cb3ecc 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1330,11 +1330,8 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	char	   *errormsg;
 	TimeLineID	save_currtli = ThisTimeLineID;
 
-	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
-									XL_ROUTINE(.page_read = &read_local_xlog_page,
-											   .segment_open = &wal_segment_open,
-											   .segment_close = &wal_segment_close),
-									NULL);
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL, wal_segment_close);
+
 	if (!xlogreader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -1342,7 +1339,12 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	XLogBeginRead(xlogreader, lsn);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+		   XLREAD_NEED_DATA)
+	{
+		if (!read_local_xlog_page(xlogreader))
+			break;
+	}
 
 	/*
 	 * Restore immediately the timeline where it was previously, as
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8085ca1117..b7d7e6d31b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -838,13 +838,6 @@ static XLogSource currentSource = XLOG_FROM_ANY;
 static bool lastSourceFailed = false;
 static bool pendingWalRcvRestart = false;
 
-typedef struct XLogPageReadPrivate
-{
-	int			emode;
-	bool		fetching_ckpt;	/* are we fetching a checkpoint record? */
-	bool		randAccess;
-} XLogPageReadPrivate;
-
 /*
  * These variables track when we last obtained some WAL data to process,
  * and where we got it from.  (XLogReceiptSource is initially the same as
@@ -920,8 +913,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
-static bool	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+static bool XLogPageRead(XLogReaderState *xlogreader,
+						 bool fetching_ckpt, int emode, bool randAccess);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -1234,8 +1227,7 @@ XLogInsertRecord(XLogRecData *rdata,
 			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
 
 		if (!debug_reader)
-			debug_reader = XLogReaderAllocate(wal_segment_size, NULL,
-											  XL_ROUTINE(), NULL);
+			debug_reader = XLogReaderAllocate(wal_segment_size, NULL, NULL);
 
 		if (!debug_reader)
 		{
@@ -4369,15 +4361,10 @@ CleanupBackupHistory(void)
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
-		   bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, int emode, bool fetching_ckpt)
 {
 	XLogRecord *record;
-	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
-
-	private->fetching_ckpt = fetching_ckpt;
-	private->emode = emode;
-	private->randAccess = (xlogreader->ReadRecPtr == InvalidXLogRecPtr);
+	bool		randAccess = (xlogreader->ReadRecPtr == InvalidXLogRecPtr);
 
 	/* This is the first attempt to read this page. */
 	lastSourceFailed = false;
@@ -4385,8 +4372,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogReadRecordResult result;
+
+		while ((result = XLogReadRecord(xlogreader, &record, &errormsg))
+			   == XLREAD_NEED_DATA)
+		{
+			if (!XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess))
+				break;
+
+		}
 
-		record = XLogReadRecord(xlogreader, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
@@ -6456,7 +6451,6 @@ StartupXLOG(void)
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
-	XLogPageReadPrivate private;
 	bool		promoted = false;
 	struct stat st;
 
@@ -6615,13 +6609,9 @@ StartupXLOG(void)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/* Set up XLOG reader facility */
-	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
 	xlogreader =
-		XLogReaderAllocate(wal_segment_size, NULL,
-						   XL_ROUTINE(.page_read = &XLogPageRead,
-									  .segment_open = NULL,
-									  .segment_close = wal_segment_close),
-						   &private);
+		XLogReaderAllocate(wal_segment_size, NULL, wal_segment_close);
+
 	if (!xlogreader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -12107,12 +12097,13 @@ CancelBackup(void)
  * sleep and retry.
  */
 static bool
-XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 XLogRecPtr targetRecPtr, char *readBuf)
+XLogPageRead(XLogReaderState *xlogreader,
+			 bool fetching_ckpt, int emode, bool randAccess)
 {
-	XLogPageReadPrivate *private =
-	(XLogPageReadPrivate *) xlogreader->private_data;
-	int			emode = private->emode;
+	char *readBuf				= xlogreader->readBuf;
+	XLogRecPtr targetPagePtr	= xlogreader->readPagePtr;
+	int reqLen					= xlogreader->readLen;
+	XLogRecPtr targetRecPtr		= xlogreader->ReadRecPtr;
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
@@ -12155,8 +12146,8 @@ retry:
 		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
+										 randAccess,
+										 fetching_ckpt,
 										 targetRecPtr))
 		{
 			if (readFile >= 0)
@@ -12261,6 +12252,7 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	Assert(xlogreader->readPagePtr == targetPagePtr);
 	xlogreader->readLen = readLen;
 	return true;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f2345ab09e..661863e94b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -40,7 +40,7 @@ static bool XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr,
 						 int reqLen, bool header_inclusive);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
+								  XLogRecPtr PrevRecPtr, XLogRecord *record);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 							XLogRecPtr recptr);
 static void ResetDecoder(XLogReaderState *state);
@@ -73,7 +73,7 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
  */
 XLogReaderState *
 XLogReaderAllocate(int wal_segment_size, const char *waldir,
-				   XLogReaderRoutine *routine, void *private_data)
+				   WALSegmentCleanupCB cleanup_cb)
 {
 	XLogReaderState *state;
 
@@ -84,7 +84,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		return NULL;
 
 	/* initialize caller-provided support functions */
-	state->routine = *routine;
+	state->cleanup_cb = cleanup_cb;
 
 	state->max_block_id = -1;
 
@@ -107,8 +107,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	WALOpenSegmentInit(&state->seg, &state->segcxt, wal_segment_size,
 					   waldir);
 
-	/* system_identifier initialized to zeroes above */
-	state->private_data = private_data;
 	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
 	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
 										  MCXT_ALLOC_NO_OOM);
@@ -140,8 +138,8 @@ XLogReaderFree(XLogReaderState *state)
 {
 	int			block_id;
 
-	if (state->seg.ws_file != -1)
-		state->routine.segment_close(state);
+	if (state->seg.ws_file >= 0)
+		state->cleanup_cb(state);
 
 	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
 	{
@@ -246,6 +244,7 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->readRecordState = XLREAD_NEXT_RECORD;
 }
 
 /*
@@ -254,12 +253,12 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * XLogBeginRead() or XLogFindNextRecord() must be called before the first call
  * to XLogReadRecord().
  *
- * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * This function may return XLREAD_NEED_DATA several times before returning a
+ * result record. The caller shall read in some new data then call this
+ * function again with the same parameters.
  *
- * If the reading fails for some other reason, NULL is also returned, and
- * *errormsg is set to a string with details of the failure.
+ * When a record is successfully read, returns XLREAD_SUCCESS with result
+ * record being stored in *record. Otherwise *record is NULL.
  *
  * Returns XLREAD_NEED_DATA if more data is needed to finish reading the
  * current record.  In that case, state->readPagePtr and state->readLen inform
@@ -304,307 +303,410 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * state. This behavior allows to continue reading a reacord switching among
  * different souces, while streaming replication.
  */
-XLogRecord *
-XLogReadRecord(XLogReaderState *state, char **errormsg)
+XLogReadRecordResult
+XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 {
-	XLogRecPtr	RecPtr;
-	XLogRecord *record;
-	XLogRecPtr	targetPagePtr;
-	bool		randAccess;
-	uint32		len,
-				total_len;
-	uint32		targetRecOff;
-	uint32		pageHeaderSize;
-	bool		gotheader;
+	XLogRecord *prec;
 
-	/*
-	 * randAccess indicates whether to verify the previous-record pointer of
-	 * the record we're reading.  We only do this if we're reading
-	 * sequentially, which is what we initially assume.
-	 */
-	randAccess = false;
+	*record = NULL;
 
 	/* reset error state */
 	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
 
-	ResetDecoder(state);
-
-	RecPtr = state->EndRecPtr;
-
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	switch (state->readRecordState)
 	{
-		/* read the record after the one we just read */
+		case XLREAD_NEXT_RECORD:
+			ResetDecoder(state);
 
-		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
-		 * we're at a page boundary, no more records can fit on the current
-		 * page. We must skip over the page header, but we can't do that until
-		 * we've read in the page, since the header size is variable.
-		 */
-	}
-	else
-	{
-		/*
-		 * Caller supplied a position to start at.
-		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
-		 * record starting position.
-		 */
-		Assert(XRecOffIsValid(RecPtr));
-		randAccess = true;
-	}
+			if (state->ReadRecPtr != InvalidXLogRecPtr)
+			{
+				/* read the record after the one we just read */
 
-	state->currRecPtr = RecPtr;
+				/*
+				 * EndRecPtr is pointing to end+1 of the previous WAL record.
+				 * If we're at a page boundary, no more records can fit on the
+				 * current page. We must skip over the page header, but we
+				 * can't do that until we've read in the page, since the
+				 * header size is variable.
+				 */
+				state->PrevRecPtr = state->ReadRecPtr;
+				state->ReadRecPtr = state->EndRecPtr;
+			}
+			else
+			{
+				/*
+				 * Caller supplied a position to start at.
+				 *
+				 * In this case, EndRecPtr should already be pointing to a
+				 * valid record starting position.
+				 */
+				Assert(XRecOffIsValid(state->EndRecPtr));
+				state->ReadRecPtr = state->EndRecPtr;
 
-	targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
-	targetRecOff = RecPtr % XLOG_BLCKSZ;
+				/*
+				 * We cannot verify the previous-record pointer when we're
+				 * seeking to a particular record. Reset PrevRecPtr so that we
+				 * won't try doing that.
+				 */
+				state->PrevRecPtr = InvalidXLogRecPtr;
+				state->EndRecPtr = InvalidXLogRecPtr;	/* to be tidy */
+			}
 
-	/*
-	 * Read the page containing the record into state->readBuf. Request enough
-	 * byte to cover the whole record header, or at least the part of it that
-	 * fits on the same page.
-	 */
-	while (XLogNeedData(state, targetPagePtr,
-						Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
-						targetRecOff != 0))
-	{
-		if (!state->routine.page_read(state, state->readPagePtr, state->readLen,
-									  RecPtr, state->readBuf))
-			break;
-	}
+			state->record_verified = false;
+			state->readRecordState = XLREAD_TOT_LEN;
+			/* fall through */
 
-	if (!state->page_verified)
-		goto err;
+		case XLREAD_TOT_LEN:
+			{
+				uint32		total_len;
+				uint32		pageHeaderSize;
+				XLogRecPtr	targetPagePtr;
+				uint32		targetRecOff;
+				XLogPageHeader pageHeader;
 
-	/*
-	 * We have at least the page header, so we can examine it now.
-	 */
-	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-	if (targetRecOff == 0)
-	{
-		/*
-		 * At page start, so skip over page header.
-		 */
-		RecPtr += pageHeaderSize;
-		targetRecOff = pageHeaderSize;
-	}
-	else if (targetRecOff < pageHeaderSize)
-	{
-		report_invalid_record(state, "invalid record offset at %X/%X",
-							  LSN_FORMAT_ARGS(RecPtr));
-		goto err;
-	}
+				targetPagePtr =
+					state->ReadRecPtr - (state->ReadRecPtr % XLOG_BLCKSZ);
+				targetRecOff = state->ReadRecPtr % XLOG_BLCKSZ;
 
-	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
-		targetRecOff == pageHeaderSize)
-	{
-		report_invalid_record(state, "contrecord is requested by %X/%X",
-							  LSN_FORMAT_ARGS(RecPtr));
-		goto err;
-	}
+				/*
+				 * Check if we have enough data. For the first record in the
+				 * page, the requesting length doesn't contain page header.
+				 */
+				if (XLogNeedData(state, targetPagePtr,
+								 Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ),
+								 targetRecOff != 0))
+					return XLREAD_NEED_DATA;
 
-	/* XLogNeedData has verified the page header */
-	Assert(pageHeaderSize <= state->readLen);
+				/* error out if caller supplied bogus page */
+				if (!state->page_verified)
+					goto err;
 
-	/*
-	 * Read the record length.
-	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
+				/* examine page header now. */
+				pageHeaderSize =
+					XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+				if (targetRecOff == 0)
+				{
+					/* At page start, so skip over page header. */
+					state->ReadRecPtr += pageHeaderSize;
+					targetRecOff = pageHeaderSize;
+				}
+				else if (targetRecOff < pageHeaderSize)
+				{
+					report_invalid_record(state, "invalid record offset at %X/%X",
+										  LSN_FORMAT_ARGS(state->ReadRecPtr));
+					goto err;
+				}
 
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
-	 * check is necessary here to ensure that we enter the "Need to reassemble
-	 * record" code path below; otherwise we might fail to apply
-	 * ValidXLogRecordHeader at all.
-	 */
-	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
-	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
-								   randAccess))
-			goto err;
-		gotheader = true;
-	}
-	else
-	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
-			goto err;
-		}
-		gotheader = false;
-	}
+				pageHeader = (XLogPageHeader) state->readBuf;
+				if ((pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+					targetRecOff == pageHeaderSize)
+				{
+					report_invalid_record(state, "contrecord is requested by %X/%X",
+										  (uint32) (state->ReadRecPtr >> 32),
+										  (uint32) state->ReadRecPtr);
+					goto err;
+				}
 
-	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
-	if (total_len > len)
-	{
-		/* Need to reassemble record */
-		char	   *contdata;
-		XLogPageHeader pageHeader;
-		char	   *buffer;
-		uint32		gotlen;
+				/* XLogNeedData has verified the page header */
+				Assert(pageHeaderSize <= state->readLen);
 
-		/*
-		 * Enlarge readRecordBuf as needed.
-		 */
-		if (total_len > state->readRecordBufSize &&
-			!allocate_recordbuf(state, total_len))
-		{
-			/* We treat this as a "bogus data" condition */
-			report_invalid_record(state, "record length %u at %X/%X too long",
-								  total_len, LSN_FORMAT_ARGS(RecPtr));
-			goto err;
-		}
+				/*
+				 * Read the record length.
+				 *
+				 * NB: Even though we use an XLogRecord pointer here, the
+				 * whole record header might not fit on this page. xl_tot_len
+				 * is the first field of the struct, so it must be on this
+				 * page (the records are MAXALIGNed), but we cannot access any
+				 * other fields until we've verified that we got the whole
+				 * header.
+				 */
+				prec = (XLogRecord *) (state->readBuf +
+									   state->ReadRecPtr % XLOG_BLCKSZ);
+				total_len = prec->xl_tot_len;
 
-		/* Copy the first fragment of the record from the first page. */
-		memcpy(state->readRecordBuf,
-			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
-		buffer = state->readRecordBuf + len;
-		gotlen = len;
+				/*
+				 * If the whole record header is on this page, validate it
+				 * immediately.  Otherwise do just a basic sanity check on
+				 * xl_tot_len, and validate the rest of the header after
+				 * reading it from the next page.  The xl_tot_len check is
+				 * necessary here to ensure that we enter the
+				 * XLREAD_CONTINUATION state below; otherwise we might fail to
+				 * apply ValidXLogRecordHeader at all.
+				 */
+				if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+				{
+					if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+											   state->PrevRecPtr, prec))
+						goto err;
 
-		do
-		{
-			int			rest_len = total_len - gotlen;
+					state->record_verified = true;
+				}
+				else
+				{
+					/* XXX: more validation should be done here */
+					if (total_len < SizeOfXLogRecord)
+					{
+						report_invalid_record(state,
+											  "invalid record length at %X/%X: wanted %u, got %u",
+											  LSN_FORMAT_ARGS(state->ReadRecPtr),
+											  (uint32) SizeOfXLogRecord, total_len);
+						goto err;
+					}
+				}
 
-			/* Calculate pointer to beginning of next page */
-			targetPagePtr += XLOG_BLCKSZ;
+				/*
+				 * Wait for the rest of the record, or the part of the record
+				 * that fit on the first page if crossed a page boundary, to
+				 * become available.
+				 */
+				state->recordGotLen = 0;
+				state->recordRemainLen = total_len;
+				state->readRecordState = XLREAD_FIRST_FRAGMENT;
+			}
+			/* fall through */
 
-			/* Wait for the next page to become available */
-			while (XLogNeedData(state, targetPagePtr,
-								Min(rest_len, XLOG_BLCKSZ),
-								false))
+		case XLREAD_FIRST_FRAGMENT:
 			{
-				if (!state->routine.page_read(state, state->readPagePtr,
-											  state->readLen,
-											  state->ReadRecPtr,
-											  state->readBuf))
-					break;
-			}
+				uint32		total_len = state->recordRemainLen;
+				uint32		request_len;
+				uint32		record_len;
+				XLogRecPtr	targetPagePtr;
+				uint32		targetRecOff;
 
-			if (!state->page_verified)
-				goto err;
+				/*
+				 * Wait for the rest of the record on the first page to become
+				 * available
+				 */
+				targetPagePtr =
+					state->ReadRecPtr - (state->ReadRecPtr % XLOG_BLCKSZ);
+				targetRecOff = state->ReadRecPtr % XLOG_BLCKSZ;
 
-			Assert(SizeOfXLogShortPHD <= state->readLen);
+				request_len = Min(targetRecOff + total_len, XLOG_BLCKSZ);
+				record_len = request_len - targetRecOff;
 
-			/* Check that the continuation on next page looks valid */
-			pageHeader = (XLogPageHeader) state->readBuf;
-			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
-			{
-				report_invalid_record(state,
-									  "there is no contrecord flag at %X/%X",
-									  LSN_FORMAT_ARGS(RecPtr));
-				goto err;
-			}
+				/* ReadRecPtr contains page header */
+				Assert(targetRecOff != 0);
+				if (XLogNeedData(state, targetPagePtr, request_len, true))
+					return XLREAD_NEED_DATA;
 
-			/*
-			 * Cross-check that xlp_rem_len agrees with how much of the record
-			 * we expect there to be left.
-			 */
-			if (pageHeader->xlp_rem_len == 0 ||
-				total_len != (pageHeader->xlp_rem_len + gotlen))
-			{
-				report_invalid_record(state,
-									  "invalid contrecord length %u (expected %lld) at %X/%X",
-									  pageHeader->xlp_rem_len,
-									  ((long long) total_len) - gotlen,
-									  LSN_FORMAT_ARGS(RecPtr));
-				goto err;
-			}
+				/* error out if caller supplied bogus page */
+				if (!state->page_verified)
+					goto err;
 
-			/* Append the continuation from this page to the buffer */
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+				prec = (XLogRecord *) (state->readBuf + targetRecOff);
 
-			Assert(pageHeaderSize <= state->readLen);
+				/* validate record header if not yet */
+				if (!state->record_verified && record_len >= SizeOfXLogRecord)
+				{
+					if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+											   state->PrevRecPtr, prec))
+						goto err;
 
-			contdata = (char *) state->readBuf + pageHeaderSize;
-			len = XLOG_BLCKSZ - pageHeaderSize;
-			if (pageHeader->xlp_rem_len < len)
-				len = pageHeader->xlp_rem_len;
+					state->record_verified = true;
+				}
 
-			Assert(pageHeaderSize + len <= state->readLen);
-			memcpy(buffer, (char *) contdata, len);
-			buffer += len;
-			gotlen += len;
 
-			/* If we just reassembled the record header, validate it. */
-			if (!gotheader)
-			{
-				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
-										   record, randAccess))
+				if (total_len == record_len)
+				{
+					/* Record does not cross a page boundary */
+					Assert(state->record_verified);
+
+					if (!ValidXLogRecord(state, prec, state->ReadRecPtr))
+						goto err;
+
+					state->record_verified = true;	/* to be tidy */
+
+					/* We already checked the header earlier */
+					state->EndRecPtr = state->ReadRecPtr + MAXALIGN(record_len);
+
+					*record = prec;
+					state->readRecordState = XLREAD_NEXT_RECORD;
+					break;
+				}
+
+				/*
+				 * The record continues on the next page. Need to reassemble
+				 * record
+				 */
+				Assert(total_len > record_len);
+
+				/* Enlarge readRecordBuf as needed. */
+				if (total_len > state->readRecordBufSize &&
+					!allocate_recordbuf(state, total_len))
+				{
+					/* We treat this as a "bogus data" condition */
+					report_invalid_record(state,
+										  "record length %u at %X/%X too long",
+										  total_len,
+										  LSN_FORMAT_ARGS(state->ReadRecPtr));
 					goto err;
-				gotheader = true;
-			}
-		} while (gotlen < total_len);
+				}
 
-		Assert(gotheader);
+				/* Copy the first fragment of the record from the first page. */
+				memcpy(state->readRecordBuf, state->readBuf + targetRecOff,
+					   record_len);
+				state->recordGotLen += record_len;
+				state->recordRemainLen -= record_len;
 
-		record = (XLogRecord *) state->readRecordBuf;
-		if (!ValidXLogRecord(state, record, RecPtr))
-			goto err;
+				/* Calculate pointer to beginning of next page */
+				state->recordContRecPtr = state->ReadRecPtr + record_len;
+				Assert(state->recordContRecPtr % XLOG_BLCKSZ == 0);
 
-		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
-			+ MAXALIGN(pageHeader->xlp_rem_len);
-	}
-	else
-	{
-		/* Wait for the record data to become available */
-		while (XLogNeedData(state, targetPagePtr,
-							Min(targetRecOff + total_len, XLOG_BLCKSZ), true))
-		{
-			if (!state->routine.page_read(state, state->readPagePtr,
-										  state->readLen,
-										  state->ReadRecPtr, state->readBuf))
-				break;
-		}
+				state->readRecordState = XLREAD_CONTINUATION;
+			}
+			/* fall through */
 
-		if (!state->page_verified)
-			goto err;
+		case XLREAD_CONTINUATION:
+			{
+				XLogPageHeader pageHeader;
+				uint32		pageHeaderSize;
+				XLogRecPtr	targetPagePtr;
 
-		/* Record does not cross a page boundary */
-		if (!ValidXLogRecord(state, record, RecPtr))
-			goto err;
+				/*
+				 * we enter this state only if we haven't read the whole
+				 * record.
+				 */
+				Assert(state->recordRemainLen > 0);
+
+				while (state->recordRemainLen > 0)
+				{
+					char	   *contdata;
+					uint32		request_len;
+					uint32		record_len;
+
+					/* Wait for the next page to become available */
+					targetPagePtr = state->recordContRecPtr;
+
+					/* this request contains page header */
+					Assert(targetPagePtr != 0);
+					if (XLogNeedData(state, targetPagePtr,
+									 Min(state->recordRemainLen, XLOG_BLCKSZ),
+									 false))
+						return XLREAD_NEED_DATA;
+
+					if (!state->page_verified)
+						goto err;
+
+					Assert(SizeOfXLogShortPHD <= state->readLen);
+
+					/* Check that the continuation on next page looks valid */
+					pageHeader = (XLogPageHeader) state->readBuf;
+					if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+					{
+						report_invalid_record(
+											  state,
+											  "there is no contrecord flag at %X/%X reading %X/%X",
+											  (uint32) (state->recordContRecPtr >> 32),
+											  (uint32) state->recordContRecPtr,
+											  (uint32) (state->ReadRecPtr >> 32),
+											  (uint32) state->ReadRecPtr);
+						goto err;
+					}
+
+					/*
+					 * Cross-check that xlp_rem_len agrees with how much of
+					 * the record we expect there to be left.
+					 */
+					if (pageHeader->xlp_rem_len == 0 ||
+						pageHeader->xlp_rem_len != state->recordRemainLen)
+					{
+						report_invalid_record(
+											  state,
+											  "invalid contrecord length %u at %X/%X reading %X/%X, expected %u",
+											  pageHeader->xlp_rem_len,
+											  (uint32) (state->recordContRecPtr >> 32),
+											  (uint32) state->recordContRecPtr,
+											  (uint32) (state->ReadRecPtr >> 32),
+											  (uint32) state->ReadRecPtr,
+											  state->recordRemainLen);
+						goto err;
+					}
+
+					/* Append the continuation from this page to the buffer */
+					pageHeaderSize = XLogPageHeaderSize(pageHeader);
+
+					/*
+					 * XLogNeedData should have ensured that the whole page
+					 * header was read
+					 */
+					Assert(state->readLen >= pageHeaderSize);
+
+					contdata = (char *) state->readBuf + pageHeaderSize;
+					record_len = XLOG_BLCKSZ - pageHeaderSize;
+					if (pageHeader->xlp_rem_len < record_len)
+						record_len = pageHeader->xlp_rem_len;
+
+					request_len = record_len + pageHeaderSize;
+
+					/*
+					 * XLogNeedData should have ensured all needed data was
+					 * read
+					 */
+					Assert(state->readLen >= request_len);
+
+					memcpy(state->readRecordBuf + state->recordGotLen,
+						   (char *) contdata, record_len);
+					state->recordGotLen += record_len;
+					state->recordRemainLen -= record_len;
+
+					/* If we just reassembled the record header, validate it. */
+					if (!state->record_verified)
+					{
+						Assert(state->recordGotLen >= SizeOfXLogRecord);
+						if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+												   state->PrevRecPtr,
+												   (XLogRecord *) state->readRecordBuf))
+							goto err;
+
+						state->record_verified = true;
+					}
+
+					/*
+					 * Calculate pointer to beginning of next page, and
+					 * continue
+					 */
+					state->recordContRecPtr += XLOG_BLCKSZ;
+				}
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+				/* targetPagePtr is pointing the last-read page here */
+				prec = (XLogRecord *) state->readRecordBuf;
+				if (!ValidXLogRecord(state, prec, state->ReadRecPtr))
+					goto err;
+
+				pageHeaderSize =
+					XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+				state->EndRecPtr = targetPagePtr + pageHeaderSize
+					+ MAXALIGN(pageHeader->xlp_rem_len);
 
-		state->ReadRecPtr = RecPtr;
+				*record = prec;
+				state->readRecordState = XLREAD_NEXT_RECORD;
+				break;
+			}
 	}
 
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if (record->xl_rmid == RM_XLOG_ID &&
-		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+	if ((*record)->xl_rmid == RM_XLOG_ID &&
+		((*record)->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->EndRecPtr += state->segcxt.ws_segsize - 1;
 		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
-	else
-		return NULL;
+	Assert(!*record || state->readLen >= 0);
+	if (DecodeXLogRecord(state, *record, errormsg))
+		return XLREAD_SUCCESS;
+
+	*record = NULL;
+	return XLREAD_FAIL;
 
 err:
 
 	/*
-	 * Invalidate the read state. We might read from a different source after
+	 * Invalidate the read page. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
@@ -612,7 +714,8 @@ err:
 	if (state->errormsg_buf[0] != '\0')
 		*errormsg = state->errormsg_buf;
 
-	return NULL;
+	*record = NULL;
+	return XLREAD_FAIL;
 }
 
 /*
@@ -764,11 +867,12 @@ XLogReaderInvalReadState(XLogReaderState *state)
  *
  * This is just a convenience subroutine to avoid duplicated code in
  * XLogReadRecord.  It's not intended for use from anywhere else.
+ *
+ * If PrevRecPtr is valid, the xl_prev is is cross-checked with it.
  */
 static bool
 ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-					  XLogRecPtr PrevRecPtr, XLogRecord *record,
-					  bool randAccess)
+					  XLogRecPtr PrevRecPtr, XLogRecord *record)
 {
 	if (record->xl_tot_len < SizeOfXLogRecord)
 	{
@@ -785,7 +889,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 							  record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
 		return false;
 	}
-	if (randAccess)
+	if (PrevRecPtr == InvalidXLogRecPtr)
 	{
 		/*
 		 * We can't exactly verify the prev-link, but surely it should be less
@@ -1015,11 +1119,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
  * XLogReadRecord() will read the next valid record.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+				   XLogFindNextRecordCB read_page, void *private)
 {
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
+	XLogRecord *record;
+	XLogReadRecordResult result;
 	char	   *errormsg;
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
@@ -1052,9 +1159,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 		while (XLogNeedData(state, targetPagePtr, targetRecOff,
 							targetRecOff != 0))
 		{
-			if (!state->routine.page_read(state, state->readPagePtr,
-										  state->readLen,
-										  state->ReadRecPtr, state->readBuf))
+			if (!read_page(state, private))
 				break;
 		}
 
@@ -1106,8 +1211,16 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while ((result = XLogReadRecord(state, &record, &errormsg)) !=
+		   XLREAD_FAIL)
 	{
+		if (result == XLREAD_NEED_DATA)
+		{
+			if (!read_page(state, private))
+				goto err;
+			continue;
+		}
+
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
 		{
@@ -1127,9 +1240,9 @@ err:
 #endif							/* FRONTEND */
 
 /*
- * Helper function to ease writing of XLogRoutine->page_read callbacks.
- * If this function is used, caller must supply a segment_open callback in
- * 'state', as that is used here.
+ * Helper function to ease writing of page_read callback.
+ * If this function is used, caller must supply a segment_open callback and
+ * segment_close callback as that is used here.
  *
  * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
  * fetched from timeline 'tli'.
@@ -1142,6 +1255,7 @@ err:
  */
 bool
 WALRead(XLogReaderState *state,
+		WALSegmentOpenCB segopenfn, WALSegmentCloseCB segclosefn,
 		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 		WALReadError *errinfo)
 {
@@ -1173,10 +1287,10 @@ WALRead(XLogReaderState *state,
 			XLogSegNo	nextSegNo;
 
 			if (state->seg.ws_file >= 0)
-				state->routine.segment_close(state);
+				segclosefn(state);
 
 			XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
-			state->routine.segment_open(state, nextSegNo, &tli);
+			segopenfn(state, nextSegNo, &tli);
 
 			/* This shouldn't happen -- indicates a bug in segment_open */
 			Assert(state->seg.ws_file >= 0);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 46eda33f25..b003990745 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -686,8 +686,7 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
 void
 XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
 {
-	const XLogRecPtr lastReadPage = state->seg.ws_segno *
-	state->segcxt.ws_segsize + state->readLen;
+	const XLogRecPtr lastReadPage = state->readPagePtr;
 
 	Assert(wantPage != InvalidXLogRecPtr && wantPage % XLOG_BLCKSZ == 0);
 	Assert(wantLength <= XLOG_BLCKSZ);
@@ -702,7 +701,7 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * current TLI has since become historical.
 	 */
 	if (lastReadPage == wantPage &&
-		state->readLen != 0 &&
+		state->page_verified &&
 		lastReadPage + state->readLen >= wantPage + Min(wantLength, XLOG_BLCKSZ - 1))
 		return;
 
@@ -788,6 +787,7 @@ wal_segment_open(XLogReaderState *state, XLogSegNo nextSegNo,
 	char		path[MAXPGPATH];
 
 	XLogFilePath(path, tli, nextSegNo, state->segcxt.ws_segsize);
+	elog(LOG, "HOGE: %lu, %d => %s", nextSegNo, tli, path);
 	state->seg.ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
 	if (state->seg.ws_file >= 0)
 		return;
@@ -825,9 +825,11 @@ wal_segment_close(XLogReaderState *state)
  * loop for now.
  */
 bool
-read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
-					 int reqLen, XLogRecPtr targetRecPtr, char *cur_page)
+read_local_xlog_page(XLogReaderState *state)
 {
+	XLogRecPtr	targetPagePtr = state->readPagePtr;
+	int			reqLen		  = state->readLen;
+	char	   *cur_page	  = state->readBuf;
 	XLogRecPtr	read_upto,
 				loc;
 	TimeLineID	tli;
@@ -940,11 +942,12 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+	if (!WALRead(state, wal_segment_open, wal_segment_close,
+				 cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
+	state->readPagePtr = targetPagePtr;
 	state->readLen = count;
 	return true;
 }
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f6803637b..4f6e87f18d 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -148,7 +148,8 @@ StartupDecodingContext(List *output_plugin_options,
 					   TransactionId xmin_horizon,
 					   bool need_full_snapshot,
 					   bool fast_forward,
-					   XLogReaderRoutine *xl_routine,
+					   LogicalDecodingXLogPageReadCB page_read,
+					   WALSegmentCleanupCB cleanup_cb,
 					   LogicalOutputPluginWriterPrepareWrite prepare_write,
 					   LogicalOutputPluginWriterWrite do_write,
 					   LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -198,11 +199,12 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, xl_routine, ctx);
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, cleanup_cb);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
+	ctx->page_read = page_read;
 
 	ctx->reorder = ReorderBufferAllocate();
 	ctx->snapshot_builder =
@@ -319,7 +321,8 @@ CreateInitDecodingContext(const char *plugin,
 						  List *output_plugin_options,
 						  bool need_full_snapshot,
 						  XLogRecPtr restart_lsn,
-						  XLogReaderRoutine *xl_routine,
+						  LogicalDecodingXLogPageReadCB page_read,
+						  WALSegmentCleanupCB cleanup_cb,
 						  LogicalOutputPluginWriterPrepareWrite prepare_write,
 						  LogicalOutputPluginWriterWrite do_write,
 						  LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -422,7 +425,7 @@ CreateInitDecodingContext(const char *plugin,
 
 	ctx = StartupDecodingContext(NIL, restart_lsn, xmin_horizon,
 								 need_full_snapshot, false,
-								 xl_routine, prepare_write, do_write,
+								 page_read, cleanup_cb, prepare_write, do_write,
 								 update_progress);
 
 	/* call output plugin initialization callback */
@@ -476,7 +479,8 @@ LogicalDecodingContext *
 CreateDecodingContext(XLogRecPtr start_lsn,
 					  List *output_plugin_options,
 					  bool fast_forward,
-					  XLogReaderRoutine *xl_routine,
+					  LogicalDecodingXLogPageReadCB page_read,
+					  WALSegmentCleanupCB cleanup_cb,
 					  LogicalOutputPluginWriterPrepareWrite prepare_write,
 					  LogicalOutputPluginWriterWrite do_write,
 					  LogicalOutputPluginWriterUpdateProgress update_progress)
@@ -528,8 +532,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 
 	ctx = StartupDecodingContext(output_plugin_options,
 								 start_lsn, InvalidTransactionId, false,
-								 fast_forward, xl_routine, prepare_write,
-								 do_write, update_progress);
+								 fast_forward, page_read, cleanup_cb,
+								 prepare_write, do_write, update_progress);
 
 	/* call output plugin initialization callback */
 	old_context = MemoryContextSwitchTo(ctx->context);
@@ -585,7 +589,13 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 		char	   *err = NULL;
 
 		/* the read_page callback waits for new WAL */
-		record = XLogReadRecord(ctx->reader, &err);
+		while (XLogReadRecord(ctx->reader, &record, &err) ==
+			   XLREAD_NEED_DATA)
+		{
+			if (!ctx->page_read(ctx->reader))
+				break;
+		}
+
 		if (err)
 			elog(ERROR, "%s", err);
 		if (!record)
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 01d354829b..8f8c129620 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -233,9 +233,8 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 									options,
 									false,
-									XL_ROUTINE(.page_read = read_local_xlog_page,
-											   .segment_open = wal_segment_open,
-											   .segment_close = wal_segment_close),
+									read_local_xlog_page,
+									wal_segment_close,
 									LogicalOutputPrepareWrite,
 									LogicalOutputWrite, NULL);
 
@@ -284,7 +283,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 			XLogRecord *record;
 			char	   *errm = NULL;
 
-			record = XLogReadRecord(ctx->reader, &errm);
+			while (XLogReadRecord(ctx->reader, &record, &errm) ==
+				   XLREAD_NEED_DATA)
+			{
+				if (!ctx->page_read(ctx->reader))
+					break;
+			}
+
 			if (errm)
 				elog(ERROR, "%s", errm);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index d9d36879ed..7ab0b804e4 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -153,9 +153,8 @@ create_logical_replication_slot(char *name, char *plugin,
 	ctx = CreateInitDecodingContext(plugin, NIL,
 									false,	/* just catalogs is OK */
 									restart_lsn,
-									XL_ROUTINE(.page_read = read_local_xlog_page,
-											   .segment_open = wal_segment_open,
-											   .segment_close = wal_segment_close),
+									read_local_xlog_page,
+									wal_segment_close,
 									NULL, NULL, NULL);
 
 	/*
@@ -512,9 +511,8 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 									NIL,
 									true,	/* fast_forward */
-									XL_ROUTINE(.page_read = read_local_xlog_page,
-											   .segment_open = wal_segment_open,
-											   .segment_close = wal_segment_close),
+									read_local_xlog_page,
+									wal_segment_close,
 									NULL, NULL, NULL);
 
 		/*
@@ -536,7 +534,13 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 * Read records.  No changes are generated in fast_forward mode,
 			 * but snapbuilder/slot statuses are updated properly.
 			 */
-			record = XLogReadRecord(ctx->reader, &errm);
+			while (XLogReadRecord(ctx->reader, &record, &errm) ==
+				   XLREAD_NEED_DATA)
+			{
+				if (!ctx->page_read(ctx->reader))
+					break;
+			}
+
 			if (errm)
 				elog(ERROR, "%s", errm);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a4d6f30957..b024bbc3cd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -580,10 +580,7 @@ StartReplication(StartReplicationCmd *cmd)
 
 	/* create xlogreader for physical replication */
 	xlogreader =
-		XLogReaderAllocate(wal_segment_size, NULL,
-						   XL_ROUTINE(.segment_open = WalSndSegmentOpen,
-									  .segment_close = wal_segment_close),
-						   NULL);
+		XLogReaderAllocate(wal_segment_size, NULL, wal_segment_close);
 
 	if (!xlogreader)
 		ereport(ERROR,
@@ -807,9 +804,11 @@ StartReplication(StartReplicationCmd *cmd)
  * set every time WAL is flushed.
  */
 static bool
-logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-					   XLogRecPtr targetRecPtr, char *cur_page)
+logical_read_xlog_page(XLogReaderState *state)
 {
+	XLogRecPtr		targetPagePtr = state->readPagePtr;
+	int				reqLen		  = state->readLen;
+	char		   *cur_page	  = state->readBuf;
 	XLogRecPtr	flushptr;
 	int			count;
 	WALReadError errinfo;
@@ -837,7 +836,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	if (!WALRead(state, WalSndSegmentOpen, wal_segment_close,
 				 cur_page,
 				 targetPagePtr,
 				 XLOG_BLCKSZ,
@@ -1011,9 +1010,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
-										XL_ROUTINE(.page_read = logical_read_xlog_page,
-												   .segment_open = WalSndSegmentOpen,
-												   .segment_close = wal_segment_close),
+										logical_read_xlog_page,
+										wal_segment_close,
 										WalSndPrepareWrite, WalSndWriteData,
 										WalSndUpdateProgress);
 
@@ -1171,9 +1169,8 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	 */
 	logical_decoding_ctx =
 		CreateDecodingContext(cmd->startpoint, cmd->options, false,
-							  XL_ROUTINE(.page_read = logical_read_xlog_page,
-										 .segment_open = WalSndSegmentOpen,
-										 .segment_close = wal_segment_close),
+							  logical_read_xlog_page,
+							  wal_segment_close,
 							  WalSndPrepareWrite, WalSndWriteData,
 							  WalSndUpdateProgress);
 	xlogreader = logical_decoding_ctx->reader;
@@ -2749,7 +2746,7 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
-	if (!WALRead(xlogreader,
+	if (!WALRead(xlogreader, WalSndSegmentOpen, wal_segment_close,
 				 &output_message.data[output_message.len],
 				 startptr,
 				 nbytes,
@@ -2847,7 +2844,12 @@ XLogSendLogical(void)
 	 */
 	WalSndCaughtUp = false;
 
-	record = XLogReadRecord(logical_decoding_ctx->reader, &errm);
+	while (XLogReadRecord(logical_decoding_ctx->reader, &record, &errm) ==
+		   XLREAD_NEED_DATA)
+	{
+		if (!logical_decoding_ctx->page_read(logical_decoding_ctx->reader))
+			break;
+	}
 
 	/* xlog record was invalid */
 	if (errm != NULL)
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index cf119848b0..712c85281c 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -41,15 +41,9 @@ static int	xlogreadfd = -1;
 static XLogSegNo xlogreadsegno = -1;
 static char xlogfpath[MAXPGPATH];
 
-typedef struct XLogPageReadPrivate
-{
-	const char *restoreCommand;
-	int			tliIndex;
-} XLogPageReadPrivate;
-
-static bool	SimpleXLogPageRead(XLogReaderState *xlogreader,
-							   XLogRecPtr targetPagePtr,
-							   int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
+static bool SimpleXLogPageRead(XLogReaderState *xlogreader,
+							   const char *datadir, int *tliIndex,
+							   const char *restoreCommand);
 
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
@@ -66,20 +60,22 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
-	XLogPageReadPrivate private;
 
-	private.tliIndex = tliIndex;
-	private.restoreCommand = restoreCommand;
-	xlogreader = XLogReaderAllocate(WalSegSz, datadir,
-									XL_ROUTINE(.page_read = &SimpleXLogPageRead),
-									&private);
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, NULL);
+
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
 
 	XLogBeginRead(xlogreader, startpoint);
 	do
 	{
-		record = XLogReadRecord(xlogreader, &errormsg);
+		while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+			   XLREAD_NEED_DATA)
+		{
+			if (!SimpleXLogPageRead(xlogreader, datadir,
+									&tliIndex, restoreCommand))
+				break;
+		}
 
 		if (record == NULL)
 		{
@@ -123,19 +119,19 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
-	XLogPageReadPrivate private;
 	XLogRecPtr	endptr;
 
-	private.tliIndex = tliIndex;
-	private.restoreCommand = restoreCommand;
-	xlogreader = XLogReaderAllocate(WalSegSz, datadir,
-									XL_ROUTINE(.page_read = &SimpleXLogPageRead),
-									&private);
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, NULL);
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
 
 	XLogBeginRead(xlogreader, ptr);
-	record = XLogReadRecord(xlogreader, &errormsg);
+	while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+		   XLREAD_NEED_DATA)
+	{
+		if (!SimpleXLogPageRead(xlogreader, datadir, &tliIndex, restoreCommand))
+			break;
+	}
 	if (record == NULL)
 	{
 		if (errormsg)
@@ -170,7 +166,6 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
-	XLogPageReadPrivate private;
 
 	/*
 	 * The given fork pointer points to the end of the last common record,
@@ -186,11 +181,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 			forkptr += SizeOfXLogShortPHD;
 	}
 
-	private.tliIndex = tliIndex;
-	private.restoreCommand = restoreCommand;
-	xlogreader = XLogReaderAllocate(WalSegSz, datadir,
-									XL_ROUTINE(.page_read = &SimpleXLogPageRead),
-									&private);
+	xlogreader = XLogReaderAllocate(WalSegSz, datadir, NULL);
 	if (xlogreader == NULL)
 		pg_fatal("out of memory");
 
@@ -200,7 +191,13 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		uint8		info;
 
 		XLogBeginRead(xlogreader, searchptr);
-		record = XLogReadRecord(xlogreader, &errormsg);
+		while (XLogReadRecord(xlogreader, &record, &errormsg) ==
+			   XLREAD_NEED_DATA)
+		{
+			if (!SimpleXLogPageRead(xlogreader, datadir,
+									&tliIndex, restoreCommand))
+				break;
+		}
 
 		if (record == NULL)
 		{
@@ -247,10 +244,11 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 
 /* XLogReader callback function, to read a WAL page */
 static bool
-SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				   int reqLen, XLogRecPtr targetRecPtr, char *readBuf)
+SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
+				   int *tliIndex, const char *restoreCommand)
 {
-	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
+	XLogRecPtr	targetPagePtr = xlogreader->readPagePtr;
+	char	   *readBuf		  = xlogreader->readBuf;
 	uint32		targetPageOff;
 	XLogRecPtr	targetSegEnd;
 	XLogSegNo	targetSegNo;
@@ -283,14 +281,14 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 		 * be done both forward and backward, consider also switching timeline
 		 * accordingly.
 		 */
-		while (private->tliIndex < targetNentries - 1 &&
-			   targetHistory[private->tliIndex].end < targetSegEnd)
-			private->tliIndex++;
-		while (private->tliIndex > 0 &&
-			   targetHistory[private->tliIndex].begin >= targetSegEnd)
-			private->tliIndex--;
-
-		XLogFileName(xlogfname, targetHistory[private->tliIndex].tli,
+		while (*tliIndex < targetNentries - 1 &&
+			   targetHistory[*tliIndex].end < targetSegEnd)
+			(*tliIndex)++;
+		while (*tliIndex > 0 &&
+			   targetHistory[*tliIndex].begin >= targetSegEnd)
+			(*tliIndex)--;
+
+		XLogFileName(xlogfname, targetHistory[*tliIndex].tli,
 					 xlogreadsegno, WalSegSz);
 
 		snprintf(xlogfpath, MAXPGPATH, "%s/" XLOGDIR "/%s",
@@ -303,7 +301,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			/*
 			 * If we have no restore_command to execute, then exit.
 			 */
-			if (private->restoreCommand == NULL)
+			if (restoreCommand == NULL)
 			{
 				pg_log_error("could not open file \"%s\": %m", xlogfpath);
 				xlogreader->readLen = -1;
@@ -317,7 +315,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			xlogreadfd = RestoreArchivedFile(xlogreader->segcxt.ws_dir,
 											 xlogfname,
 											 WalSegSz,
-											 private->restoreCommand);
+											 restoreCommand);
 
 			if (xlogreadfd < 0)
 			{
@@ -359,7 +357,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 
 	Assert(targetSegNo == xlogreadsegno);
 
-	xlogreader->seg.ws_tli = targetHistory[private->tliIndex].tli;
+	xlogreader->seg.ws_tli = targetHistory[*tliIndex].tli;
 	xlogreader->readLen = XLOG_BLCKSZ;
 	return true;
 }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 75ece5c658..c4047b92b5 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -330,12 +330,17 @@ WALDumpCloseSegment(XLogReaderState *state)
 	state->seg.ws_file = -1;
 }
 
-/* pg_waldump's XLogReaderRoutine->page_read callback */
+/*
+ * pg_waldump's WAL page rader, also used as page_read callback for
+ * XLogFindNextRecord
+ */
 static bool
-WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-				XLogRecPtr targetPtr, char *readBuff)
+WALDumpReadPage(XLogReaderState *state, void *priv)
 {
-	XLogDumpPrivate *private = state->private_data;
+	XLogRecPtr	targetPagePtr = state->readPagePtr;
+	int			reqLen		  = state->readLen;
+	char	   *readBuff	  = state->readBuf;
+	XLogDumpPrivate *private  = (XLogDumpPrivate *) priv;
 	int			count = XLOG_BLCKSZ;
 	WALReadError errinfo;
 
@@ -353,8 +358,8 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 		}
 	}
 
-	if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
-				 &errinfo))
+	if (!WALRead(state, WALDumpOpenSegment, WALDumpCloseSegment,
+				 readBuff, targetPagePtr, count, private->timeline, &errinfo))
 	{
 		WALOpenSegment *seg = &errinfo.wre_seg;
 		char		fname[MAXPGPATH];
@@ -374,6 +379,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 						(Size) errinfo.wre_req);
 	}
 
+	Assert(count >= state->readLen);
 	state->readLen = count;
 	return true;
 }
@@ -1044,16 +1050,14 @@ main(int argc, char **argv)
 
 	/* we have everything we need, start reading */
 	xlogreader_state =
-		XLogReaderAllocate(WalSegSz, waldir,
-						   XL_ROUTINE(.page_read = WALDumpReadPage,
-									  .segment_open = WALDumpOpenSegment,
-									  .segment_close = WALDumpCloseSegment),
-						   &private);
+		XLogReaderAllocate(WalSegSz, waldir, WALDumpCloseSegment);
+
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
 	/* first find a valid recptr to start from */
-	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+	first_record = XLogFindNextRecord(xlogreader_state, private.startptr,
+									  &WALDumpReadPage, (void*) &private);
 
 	if (first_record == InvalidXLogRecPtr)
 		fatal_error("could not find a valid record after %X/%X",
@@ -1076,7 +1080,13 @@ main(int argc, char **argv)
 	for (;;)
 	{
 		/* try to read the next record */
-		record = XLogReadRecord(xlogreader_state, &errormsg);
+		while (XLogReadRecord(xlogreader_state, &record, &errormsg) ==
+			   XLREAD_NEED_DATA)
+		{
+			if (!WALDumpReadPage(xlogreader_state, (void *) &private))
+				break;
+		}
+
 		if (!record)
 		{
 			if (!config.follow || private.endptr_reached)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 5d9e0d3292..1492f1992d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -57,64 +57,15 @@ typedef struct WALSegmentContext
 
 typedef struct XLogReaderState XLogReaderState;
 
-/* Function type definition for the read_page callback */
-typedef bool (*XLogPageReadCB) (XLogReaderState *xlogreader,
-								XLogRecPtr targetPagePtr,
-								int reqLen,
-								XLogRecPtr targetRecPtr,
-								char *readBuf);
+/* Function type definition for the segment cleanup callback */
+typedef void (*WALSegmentCleanupCB) (XLogReaderState *xlogreader);
+
+/* Function type definition for the open/close callbacks for WALRead() */
 typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
 								  XLogSegNo nextSegNo,
 								  TimeLineID *tli_p);
 typedef void (*WALSegmentCloseCB) (XLogReaderState *xlogreader);
 
-typedef struct XLogReaderRoutine
-{
-	/*
-	 * Data input callback
-	 *
-	 * This callback shall read at least reqLen valid bytes of the xlog page
-	 * starting at targetPagePtr, and store them in readBuf.  The callback
-	 * shall return the number of bytes read (never more than XLOG_BLCKSZ), or
-	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
-	 * requested bytes to become available.  The callback will not be invoked
-	 * again for the same page unless more than the returned number of bytes
-	 * are needed.
-	 *
-	 * targetRecPtr is the position of the WAL record we're reading.  Usually
-	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
-	 * to read and verify the page or segment header, before it reads the
-	 * actual WAL record it's interested in.  In that case, targetRecPtr can
-	 * be used to determine which timeline to read the page from.
-	 *
-	 * The callback shall set ->seg.ws_tli to the TLI of the file the page was
-	 * read from.
-	 */
-	XLogPageReadCB page_read;
-
-	/*
-	 * Callback to open the specified WAL segment for reading.  ->seg.ws_file
-	 * shall be set to the file descriptor of the opened segment.  In case of
-	 * failure, an error shall be raised by the callback and it shall not
-	 * return.
-	 *
-	 * "nextSegNo" is the number of the segment to be opened.
-	 *
-	 * "tli_p" is an input/output argument. WALRead() uses it to pass the
-	 * timeline in which the new segment should be found, but the callback can
-	 * use it to return the TLI that it actually opened.
-	 */
-	WALSegmentOpenCB segment_open;
-
-	/*
-	 * WAL segment close callback.  ->seg.ws_file shall be set to a negative
-	 * number.
-	 */
-	WALSegmentCloseCB segment_close;
-} XLogReaderRoutine;
-
-#define XL_ROUTINE(...) &(XLogReaderRoutine){__VA_ARGS__}
-
 typedef struct
 {
 	/* Is this block ref in use? */
@@ -144,12 +95,36 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/* Return code from XLogReadRecord */
+typedef enum XLogReadRecordResult
+{
+	XLREAD_SUCCESS,				/* record is successfully read */
+	XLREAD_NEED_DATA,			/* need more data. see XLogReadRecord. */
+	XLREAD_FAIL					/* failed during reading a record */
+}			XLogReadRecordResult;
+
+/*
+ * internal state of XLogReadRecord
+ *
+ * XLogReadState runs a state machine while reading a record. Theses states
+ * are not seen outside the function. Each state may repeat several times
+ * exiting requesting caller for new data. See the comment of XLogReadRecrod
+ * for details.
+ */
+typedef enum XLogReadRecordState
+{
+	XLREAD_NEXT_RECORD,
+	XLREAD_TOT_LEN,
+	XLREAD_FIRST_FRAGMENT,
+	XLREAD_CONTINUATION
+}			XLogReadRecordState;
+
 struct XLogReaderState
 {
 	/*
 	 * Operational callbacks
 	 */
-	XLogReaderRoutine routine;
+	WALSegmentCleanupCB cleanup_cb;
 
 	/* ----------------------------------------
 	 * Public parameters
@@ -162,18 +137,14 @@ struct XLogReaderState
 	 */
 	uint64		system_identifier;
 
-	/*
-	 * Opaque data for callbacks to use.  Not used by XLogReader.
-	 */
-	void	   *private_data;
-
 	/*
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
 	 */
-	XLogRecPtr	ReadRecPtr;		/* start of last record read */
+	XLogRecPtr	ReadRecPtr;		/* start of last record read or being read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record read */
 
 	/* ----------------------------------------
 	 * Communication with page reader
@@ -186,7 +157,9 @@ struct XLogReaderState
 								 * read by reader, which must be larger than
 								 * the request, or -1 on error */
 	char	   *readBuf;		/* buffer to store data */
-	bool		page_verified;	/* is the page on the buffer verified? */
+	bool		page_verified;	/* is the page header on the buffer verified? */
+	bool		record_verified;	/* is the current record header verified? */
+
 
 
 	/* ----------------------------------------
@@ -228,8 +201,6 @@ struct XLogReaderState
 	XLogRecPtr	latestPagePtr;
 	TimeLineID	latestPageTLI;
 
-	/* beginning of the WAL record being read. */
-	XLogRecPtr	currRecPtr;
 	/* timeline to read it from, 0 if a lookup is required */
 	TimeLineID	currTLI;
 
@@ -256,6 +227,15 @@ struct XLogReaderState
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
+	/*
+	 * XLogReadRecord() state
+	 */
+	XLogReadRecordState readRecordState;	/* state machine state */
+	int			recordGotLen;	/* amount of current record that has already
+								 * been read */
+	int			recordRemainLen;	/* length of current record that remains */
+	XLogRecPtr	recordContRecPtr;	/* where the current record continues */
+
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
 };
@@ -263,9 +243,7 @@ struct XLogReaderState
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
-										   XLogReaderRoutine *routine,
-										   void *private_data);
-extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
+										   WALSegmentCleanupCB cleanup_cb);
 
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
@@ -273,12 +251,17 @@ extern void XLogReaderFree(XLogReaderState *state);
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+/* Function type definition for the read_page callback */
+typedef bool (*XLogFindNextRecordCB) (XLogReaderState *xlogreader,
+									  void *private);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+									 XLogFindNextRecordCB read_page, void *private);
 #endif							/* FRONTEND */
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+extern XLogReadRecordResult XLogReadRecord(XLogReaderState *state,
+										   XLogRecord **record,
+										   char **errormsg);
 
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
@@ -298,6 +281,7 @@ typedef struct WALReadError
 } WALReadError;
 
 extern bool WALRead(XLogReaderState *state,
+					WALSegmentOpenCB segopenfn, WALSegmentCloseCB sgclosefn,
 					char *buf, XLogRecPtr startptr, Size count,
 					TimeLineID tli, WALReadError *errinfo);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 364a21c4ea..397fb27fc2 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,9 +47,7 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
-extern bool	read_local_xlog_page(XLogReaderState *state,
-								 XLogRecPtr targetPagePtr, int reqLen,
-								 XLogRecPtr targetRecPtr, char *cur_page);
+extern bool read_local_xlog_page(XLogReaderState *state);
 extern void wal_segment_open(XLogReaderState *state,
 							 XLogSegNo nextSegNo,
 							 TimeLineID *tli_p);
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e28c990382..bd9a5c6c2b 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -384,7 +384,7 @@
  * Enable debugging print statements for WAL-related operations; see
  * also the wal_debug GUC var.
  */
-/* #define WAL_DEBUG */
+#define WAL_DEBUG
 
 /*
  * Enable tracing of resource consumption during sort operations;
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index af551d6f4e..94e278ef81 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -29,6 +29,10 @@ typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct LogicalDecodingC
 														 TransactionId xid
 );
 
+typedef struct LogicalDecodingContext LogicalDecodingContext;
+
+typedef bool (*LogicalDecodingXLogPageReadCB)(XLogReaderState *ctx);
+
 typedef struct LogicalDecodingContext
 {
 	/* memory context this is all allocated in */
@@ -39,6 +43,7 @@ typedef struct LogicalDecodingContext
 
 	/* infrastructure pieces for decoding */
 	XLogReaderState *reader;
+	LogicalDecodingXLogPageReadCB page_read;
 	struct ReorderBuffer *reorder;
 	struct SnapBuild *snapshot_builder;
 
@@ -105,14 +110,16 @@ extern LogicalDecodingContext *CreateInitDecodingContext(const char *plugin,
 														 List *output_plugin_options,
 														 bool need_full_snapshot,
 														 XLogRecPtr restart_lsn,
-														 XLogReaderRoutine *xl_routine,
+														 LogicalDecodingXLogPageReadCB page_read,
+														 WALSegmentCleanupCB cleanup_cb,
 														 LogicalOutputPluginWriterPrepareWrite prepare_write,
 														 LogicalOutputPluginWriterWrite do_write,
 														 LogicalOutputPluginWriterUpdateProgress update_progress);
 extern LogicalDecodingContext *CreateDecodingContext(XLogRecPtr start_lsn,
 													 List *output_plugin_options,
 													 bool fast_forward,
-													 XLogReaderRoutine *xl_routine,
+													 LogicalDecodingXLogPageReadCB page_read,
+													 WALSegmentCleanupCB cleanup_cb,
 													 LogicalOutputPluginWriterPrepareWrite prepare_write,
 													 LogicalOutputPluginWriterWrite do_write,
 													 LogicalOutputPluginWriterUpdateProgress update_progress);
-- 
2.30.1

v17-0003-Remove-globals-readOff-readLen-and-readSegNo.patchtext/x-patch; charset=US-ASCII; name=v17-0003-Remove-globals-readOff-readLen-and-readSegNo.patchDownload

From e212d9c87f417127061a051c69eaf25443c54e31 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 10 Sep 2019 17:28:48 +0900
Subject: [PATCH v17 03/10] Remove globals readOff, readLen and readSegNo.

The first two global variables are duplicated in XLogReaderState.
Remove them, and also readSegNo, which should move into that struct too.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
---
 src/backend/access/transam/xlog.c | 77 ++++++++++++++-----------------
 1 file changed, 35 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7d7e6d31b..9a9835f05d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -811,17 +811,13 @@ static XLogSegNo openLogSegNo = 0;
  * These variables are used similarly to the ones above, but for reading
  * the XLOG.  Note, however, that readOff generally represents the offset
  * of the page just read, not the seek position of the FD itself, which
- * will be just past that page. readLen indicates how much of the current
- * page has been read into readBuf, and readSource indicates where we got
- * the currently open file from.
+ * will be just past that page. readSource indicates where we got the
+ * currently open file from.
  * Note: we could use Reserve/ReleaseExternalFD to track consumption of
  * this FD too; but it doesn't currently seem worthwhile, since the XLOG is
  * not read by general-purpose sessions.
  */
 static int	readFile = -1;
-static XLogSegNo readSegNo = 0;
-static uint32 readOff = 0;
-static uint32 readLen = 0;
 static XLogSource readSource = XLOG_FROM_ANY;
 
 /*
@@ -913,10 +909,12 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
-static bool XLogPageRead(XLogReaderState *xlogreader,
+static bool XLogPageRead(XLogReaderState *state,
 						 bool fetching_ckpt, int emode, bool randAccess);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr);
+										bool fetching_ckpt,
+										XLogRecPtr tliRecPtr,
+										XLogSegNo readSegNo);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -7808,7 +7806,8 @@ StartupXLOG(void)
 		XLogRecPtr	pageBeginPtr;
 
 		pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
-		Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
+		Assert(XLogSegmentOffset(xlogreader->readPagePtr, wal_segment_size) ==
+			   XLogSegmentOffset(pageBeginPtr, wal_segment_size));
 
 		firstIdx = XLogRecPtrToBufIdx(EndOfLog);
 
@@ -12097,13 +12096,14 @@ CancelBackup(void)
  * sleep and retry.
  */
 static bool
-XLogPageRead(XLogReaderState *xlogreader,
+XLogPageRead(XLogReaderState *state,
 			 bool fetching_ckpt, int emode, bool randAccess)
 {
-	char *readBuf				= xlogreader->readBuf;
-	XLogRecPtr targetPagePtr	= xlogreader->readPagePtr;
-	int reqLen					= xlogreader->readLen;
-	XLogRecPtr targetRecPtr		= xlogreader->ReadRecPtr;
+	char *readBuf				= state->readBuf;
+	XLogRecPtr	targetPagePtr	= state->readPagePtr;
+	int			reqLen			= state->readLen;
+	int			readLen			= 0;
+	XLogRecPtr	targetRecPtr	= state->ReadRecPtr;
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
@@ -12116,7 +12116,7 @@ XLogPageRead(XLogReaderState *xlogreader,
 	 * is not in the currently open one.
 	 */
 	if (readFile >= 0 &&
-		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
+		!XLByteInSeg(targetPagePtr, state->seg.ws_segno, wal_segment_size))
 	{
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
@@ -12124,10 +12124,10 @@ XLogPageRead(XLogReaderState *xlogreader,
 		 */
 		if (bgwriterLaunched)
 		{
-			if (XLogCheckpointNeeded(readSegNo))
+			if (XLogCheckpointNeeded(state->seg.ws_segno))
 			{
 				(void) GetRedoRecPtr();
-				if (XLogCheckpointNeeded(readSegNo))
+				if (XLogCheckpointNeeded(state->seg.ws_segno))
 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
 		}
@@ -12137,7 +12137,7 @@ XLogPageRead(XLogReaderState *xlogreader,
 		readSource = XLOG_FROM_ANY;
 	}
 
-	XLByteToSeg(targetPagePtr, readSegNo, wal_segment_size);
+	XLByteToSeg(targetPagePtr, state->seg.ws_segno, wal_segment_size);
 
 retry:
 	/* See if we need to retrieve more data */
@@ -12146,17 +12146,14 @@ retry:
 		 flushedUpto < targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 randAccess,
-										 fetching_ckpt,
-										 targetRecPtr))
+										 randAccess, fetching_ckpt,
+										 targetRecPtr, state->seg.ws_segno))
 		{
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
-			readLen = 0;
 			readSource = XLOG_FROM_ANY;
-
-			xlogreader->readLen = -1;
+			state->readLen = -1;
 			return false;
 		}
 	}
@@ -12184,40 +12181,36 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
-	readOff = targetPageOff;
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) targetPageOff);
 	if (r != XLOG_BLCKSZ)
 	{
 		char		fname[MAXFNAMELEN];
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+		XLogFileName(fname, curFileTLI, state->seg.ws_segno, wal_segment_size);
 		if (r < 0)
 		{
 			errno = save_errno;
 			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
 					(errcode_for_file_access(),
 					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+							fname, targetPageOff)));
 		}
 		else
 			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
+							fname, targetPageOff, r, (Size) XLOG_BLCKSZ)));
 		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
-	Assert(targetSegNo == readSegNo);
-	Assert(targetPageOff == readOff);
+	Assert(targetSegNo == state->seg.ws_segno);
 	Assert(reqLen <= readLen);
 
-	xlogreader->seg.ws_tli = curFileTLI;
+	state->seg.ws_tli = curFileTLI;
 
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
@@ -12245,15 +12238,15 @@ retry:
 	 * Validating the page header is cheap enough that doing it twice
 	 * shouldn't be a big deal from a performance point of view.
 	 */
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (!XLogReaderValidatePageHeader(state, targetPagePtr, readBuf))
 	{
-		/* reset any error XLogReaderValidatePageHeader() might have set */
-		xlogreader->errormsg_buf[0] = '\0';
+		/* reset any error StateValidatePageHeader() might have set */
+		state->errormsg_buf[0] = '\0';
 		goto next_record_is_invalid;
 	}
 
-	Assert(xlogreader->readPagePtr == targetPagePtr);
-	xlogreader->readLen = readLen;
+	Assert(state->readPagePtr == targetPagePtr);
+	state->readLen = readLen;
 	return true;
 
 next_record_is_invalid:
@@ -12262,14 +12255,13 @@ next_record_is_invalid:
 	if (readFile >= 0)
 		close(readFile);
 	readFile = -1;
-	readLen = 0;
 	readSource = XLOG_FROM_ANY;
 
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
 
-	xlogreader->readLen = -1;
+	state->readLen = -1;
 	return false;
 }
 
@@ -12301,7 +12293,8 @@ next_record_is_invalid:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt, XLogRecPtr tliRecPtr)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr,
+							XLogSegNo readSegNo)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
-- 
2.30.1

v17-0004-Make-XLogFindNextRecord-not-use-callback-functio.patchtext/x-patch; charset=US-ASCII; name=v17-0004-Make-XLogFindNextRecord-not-use-callback-functio.patchDownload

From 14fee3e4a2256b5d47ca7b1b9ff562bc5f487e53 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 7 Apr 2021 15:32:10 +0900
Subject: [PATCH v17 04/10] Make XLogFindNextRecord not use callback function

The last function that uses page-read callback is
XLogFindNextRecord. Lets make it free from call-back.  This also
simplifies the interface of WALDumpReadPage.
---
 src/backend/access/transam/xlogreader.c |  73 ++++++++-------
 src/bin/pg_waldump/pg_waldump.c         | 115 ++++++++++++------------
 src/include/access/xlogreader.h         |  12 ++-
 3 files changed, 110 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 661863e94b..89c59843b9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1107,6 +1107,22 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
  * here.
  */
 
+XLogFindNextRecordState *
+InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
+{
+	XLogFindNextRecordState *state = (XLogFindNextRecordState *)
+		palloc_extended(sizeof(XLogFindNextRecordState),
+						MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
+	if (!state)
+		return NULL;
+
+	state->reader_state = reader_state;
+	state->targetRecPtr = start_ptr;
+	state->currRecPtr = start_ptr;
+
+	return state;
+}
+
 /*
  * Find the first record with an lsn >= RecPtr.
  *
@@ -1118,24 +1134,21 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
  * This positions the reader, like XLogBeginRead(), so that the next call to
  * XLogReadRecord() will read the next valid record.
  */
-XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
-				   XLogFindNextRecordCB read_page, void *private)
+bool
+XLogFindNextRecord(XLogFindNextRecordState *state)
 {
-	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	XLogRecord *record;
 	XLogReadRecordResult result;
 	char	   *errormsg;
 
-	Assert(!XLogRecPtrIsInvalid(RecPtr));
+	Assert(!XLogRecPtrIsInvalid(state->currRecPtr));
 
 	/*
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
 	 */
-	tmpRecPtr = RecPtr;
 	while (true)
 	{
 		XLogRecPtr	targetPagePtr;
@@ -1151,27 +1164,24 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
 		 * XLogNeedData() is prepared to handle that and will read at least
 		 * short page-header worth of data
 		 */
-		targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
+		targetRecOff = state->currRecPtr % XLOG_BLCKSZ;
 
 		/* scroll back to page boundary */
-		targetPagePtr = tmpRecPtr - targetRecOff;
+		targetPagePtr = state->currRecPtr - targetRecOff;
 
-		while (XLogNeedData(state, targetPagePtr, targetRecOff,
+		if (XLogNeedData(state->reader_state, targetPagePtr, targetRecOff,
 							targetRecOff != 0))
-		{
-			if (!read_page(state, private))
-				break;
-		}
+			return true;
 
-		if (!state->page_verified)
+		if (!state->reader_state->page_verified)
 			goto err;
 
-		header = (XLogPageHeader) state->readBuf;
+		header = (XLogPageHeader) state->reader_state->readBuf;
 
 		pageHeaderSize = XLogPageHeaderSize(header);
 
 		/* we should have read the page header */
-		Assert(state->readLen >= pageHeaderSize);
+		Assert(state->reader_state->readLen >= pageHeaderSize);
 
 		/* skip over potential continuation data */
 		if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
@@ -1186,21 +1196,21 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
 			 * Note that record headers are MAXALIGN'ed
 			 */
 			if (MAXALIGN(header->xlp_rem_len) >= (XLOG_BLCKSZ - pageHeaderSize))
-				tmpRecPtr = targetPagePtr + XLOG_BLCKSZ;
+				state->currRecPtr = targetPagePtr + XLOG_BLCKSZ;
 			else
 			{
 				/*
 				 * The previous continuation record ends in this page. Set
-				 * tmpRecPtr to point to the first valid record
+				 * state->currRecPtr to point to the first valid record
 				 */
-				tmpRecPtr = targetPagePtr + pageHeaderSize
+				state->currRecPtr = targetPagePtr + pageHeaderSize
 					+ MAXALIGN(header->xlp_rem_len);
 				break;
 			}
 		}
 		else
 		{
-			tmpRecPtr = targetPagePtr + pageHeaderSize;
+			state->currRecPtr = targetPagePtr + pageHeaderSize;
 			break;
 		}
 	}
@@ -1210,31 +1220,28 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
 	 * because either we're at the first record after the beginning of a page
 	 * or we just jumped over the remaining data of a continuation.
 	 */
-	XLogBeginRead(state, tmpRecPtr);
-	while ((result = XLogReadRecord(state, &record, &errormsg)) !=
+	XLogBeginRead(state->reader_state, state->currRecPtr);
+	while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) !=
 		   XLREAD_FAIL)
 	{
 		if (result == XLREAD_NEED_DATA)
-		{
-			if (!read_page(state, private))
-				goto err;
-			continue;
-		}
+			return true;
 
 		/* past the record we've found, break out */
-		if (RecPtr <= state->ReadRecPtr)
+		if (state->targetRecPtr <= state->reader_state->ReadRecPtr)
 		{
 			/* Rewind the reader to the beginning of the last record. */
-			found = state->ReadRecPtr;
-			XLogBeginRead(state, found);
-			return found;
+			state->currRecPtr = state->reader_state->ReadRecPtr;
+			XLogBeginRead(state->reader_state, found);
+			return false;
 		}
 	}
 
 err:
-	XLogReaderInvalReadState(state);
+	XLogReaderInvalReadState(state->reader_state);
 
-	return InvalidXLogRecPtr;
+	state->currRecPtr = InvalidXLogRecPtr;;
+	return false;
 }
 
 #endif							/* FRONTEND */
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index c4047b92b5..ab2d079bdb 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -29,14 +29,6 @@ static const char *progname;
 
 static int	WalSegSz;
 
-typedef struct XLogDumpPrivate
-{
-	TimeLineID	timeline;
-	XLogRecPtr	startptr;
-	XLogRecPtr	endptr;
-	bool		endptr_reached;
-} XLogDumpPrivate;
-
 typedef struct XLogDumpConfig
 {
 	/* display options */
@@ -331,35 +323,40 @@ WALDumpCloseSegment(XLogReaderState *state)
 }
 
 /*
- * pg_waldump's WAL page rader, also used as page_read callback for
- * XLogFindNextRecord
+ * pg_waldump's WAL page rader
+ *
+ * timeline and startptr specifies the LSN, and reads up to endptr.
  */
 static bool
-WALDumpReadPage(XLogReaderState *state, void *priv)
+WALDumpReadPage(XLogReaderState *state, TimeLineID timeline,
+				XLogRecPtr startptr, XLogRecPtr endptr)
 {
 	XLogRecPtr	targetPagePtr = state->readPagePtr;
 	int			reqLen		  = state->readLen;
 	char	   *readBuff	  = state->readBuf;
-	XLogDumpPrivate *private  = (XLogDumpPrivate *) priv;
 	int			count = XLOG_BLCKSZ;
 	WALReadError errinfo;
 
-	if (private->endptr != InvalidXLogRecPtr)
+	/* determine the number of bytes to read on the page */
+	if (endptr != InvalidXLogRecPtr)
 	{
-		if (targetPagePtr + XLOG_BLCKSZ <= private->endptr)
+		if (targetPagePtr + XLOG_BLCKSZ <= endptr)
 			count = XLOG_BLCKSZ;
-		else if (targetPagePtr + reqLen <= private->endptr)
-			count = private->endptr - targetPagePtr;
+		else if (targetPagePtr + reqLen <= endptr)
+			count = endptr - targetPagePtr;
 		else
 		{
-			private->endptr_reached = true;
+			/* Notify xlogreader that we didn't read at all */
 			state->readLen = -1;
 			return false;
 		}
 	}
 
+	/* We should read more than requested by xlogreader */
+	Assert(count >= state->readLen);
+
 	if (!WALRead(state, WALDumpOpenSegment, WALDumpCloseSegment,
-				 readBuff, targetPagePtr, count, private->timeline, &errinfo))
+				 readBuff, targetPagePtr, count, timeline, &errinfo))
 	{
 		WALOpenSegment *seg = &errinfo.wre_seg;
 		char		fname[MAXPGPATH];
@@ -379,7 +376,7 @@ WALDumpReadPage(XLogReaderState *state, void *priv)
 						(Size) errinfo.wre_req);
 	}
 
-	Assert(count >= state->readLen);
+	/* Notify xlogreader of how many bytes we have read */
 	state->readLen = count;
 	return true;
 }
@@ -762,7 +759,10 @@ main(int argc, char **argv)
 	uint32		xlogid;
 	uint32		xrecoff;
 	XLogReaderState *xlogreader_state;
-	XLogDumpPrivate private;
+	XLogFindNextRecordState *findnext_state;
+	TimeLineID	timeline;
+	XLogRecPtr	startptr;
+	XLogRecPtr	endptr;
 	XLogDumpConfig config;
 	XLogDumpStats stats;
 	XLogRecord *record;
@@ -808,14 +808,9 @@ main(int argc, char **argv)
 		}
 	}
 
-	memset(&private, 0, sizeof(XLogDumpPrivate));
-	memset(&config, 0, sizeof(XLogDumpConfig));
-	memset(&stats, 0, sizeof(XLogDumpStats));
-
-	private.timeline = 1;
-	private.startptr = InvalidXLogRecPtr;
-	private.endptr = InvalidXLogRecPtr;
-	private.endptr_reached = false;
+	timeline = 1;
+	startptr = InvalidXLogRecPtr;
+	endptr = InvalidXLogRecPtr;
 
 	config.quiet = false;
 	config.bkp_details = false;
@@ -849,7 +844,7 @@ main(int argc, char **argv)
 								 optarg);
 					goto bad_argument;
 				}
-				private.endptr = (uint64) xlogid << 32 | xrecoff;
+				endptr = (uint64) xlogid << 32 | xrecoff;
 				break;
 			case 'f':
 				config.follow = true;
@@ -902,10 +897,10 @@ main(int argc, char **argv)
 					goto bad_argument;
 				}
 				else
-					private.startptr = (uint64) xlogid << 32 | xrecoff;
+					startptr = (uint64) xlogid << 32 | xrecoff;
 				break;
 			case 't':
-				if (sscanf(optarg, "%d", &private.timeline) != 1)
+				if (sscanf(optarg, "%d", &timeline) != 1)
 				{
 					pg_log_error("could not parse timeline \"%s\"", optarg);
 					goto bad_argument;
@@ -982,21 +977,21 @@ main(int argc, char **argv)
 		close(fd);
 
 		/* parse position from file */
-		XLogFromFileName(fname, &private.timeline, &segno, WalSegSz);
+		XLogFromFileName(fname, &timeline, &segno, WalSegSz);
 
-		if (XLogRecPtrIsInvalid(private.startptr))
-			XLogSegNoOffsetToRecPtr(segno, 0, WalSegSz, private.startptr);
-		else if (!XLByteInSeg(private.startptr, segno, WalSegSz))
+		if (XLogRecPtrIsInvalid(startptr))
+			XLogSegNoOffsetToRecPtr(segno, 0, WalSegSz, startptr);
+		else if (!XLByteInSeg(startptr, segno, WalSegSz))
 		{
 			pg_log_error("start WAL location %X/%X is not inside file \"%s\"",
-						 LSN_FORMAT_ARGS(private.startptr),
+						 LSN_FORMAT_ARGS(startptr),
 						 fname);
 			goto bad_argument;
 		}
 
 		/* no second file specified, set end position */
-		if (!(optind + 1 < argc) && XLogRecPtrIsInvalid(private.endptr))
-			XLogSegNoOffsetToRecPtr(segno + 1, 0, WalSegSz, private.endptr);
+		if (!(optind + 1 < argc) && XLogRecPtrIsInvalid(endptr))
+			XLogSegNoOffsetToRecPtr(segno + 1, 0, WalSegSz, endptr);
 
 		/* parse ENDSEG if passed */
 		if (optind + 1 < argc)
@@ -1012,26 +1007,26 @@ main(int argc, char **argv)
 			close(fd);
 
 			/* parse position from file */
-			XLogFromFileName(fname, &private.timeline, &endsegno, WalSegSz);
+			XLogFromFileName(fname, &timeline, &endsegno, WalSegSz);
 
 			if (endsegno < segno)
 				fatal_error("ENDSEG %s is before STARTSEG %s",
 							argv[optind + 1], argv[optind]);
 
-			if (XLogRecPtrIsInvalid(private.endptr))
+			if (XLogRecPtrIsInvalid(endptr))
 				XLogSegNoOffsetToRecPtr(endsegno + 1, 0, WalSegSz,
-										private.endptr);
+										endptr);
 
 			/* set segno to endsegno for check of --end */
 			segno = endsegno;
 		}
 
 
-		if (!XLByteInSeg(private.endptr, segno, WalSegSz) &&
-			private.endptr != (segno + 1) * WalSegSz)
+		if (!XLByteInSeg(endptr, segno, WalSegSz) &&
+			endptr != (segno + 1) * WalSegSz)
 		{
 			pg_log_error("end WAL location %X/%X is not inside file \"%s\"",
-						 LSN_FORMAT_ARGS(private.endptr),
+						 LSN_FORMAT_ARGS(endptr),
 						 argv[argc - 1]);
 			goto bad_argument;
 		}
@@ -1040,7 +1035,7 @@ main(int argc, char **argv)
 		waldir = identify_target_directory(waldir, NULL);
 
 	/* we don't know what to print */
-	if (XLogRecPtrIsInvalid(private.startptr))
+	if (XLogRecPtrIsInvalid(startptr))
 	{
 		pg_log_error("no start WAL location given");
 		goto bad_argument;
@@ -1055,27 +1050,37 @@ main(int argc, char **argv)
 	if (!xlogreader_state)
 		fatal_error("out of memory");
 
+	findnext_state =
+		InitXLogFindNextRecord(xlogreader_state, startptr);
+
+	if (!findnext_state)
+		fatal_error("out of memory");
+
 	/* first find a valid recptr to start from */
-	first_record = XLogFindNextRecord(xlogreader_state, private.startptr,
-									  &WALDumpReadPage, (void*) &private);
+	while (XLogFindNextRecord(findnext_state))
+	{
+		if (!WALDumpReadPage(xlogreader_state, timeline, startptr, endptr))
+			break;
+	}
 
+	first_record = findnext_state->currRecPtr;
 	if (first_record == InvalidXLogRecPtr)
 		fatal_error("could not find a valid record after %X/%X",
-					LSN_FORMAT_ARGS(private.startptr));
+					LSN_FORMAT_ARGS(startptr));
 
 	/*
 	 * Display a message that we're skipping data if `from` wasn't a pointer
 	 * to the start of a record and also wasn't a pointer to the beginning of
 	 * a segment (e.g. we were used in file mode).
 	 */
-	if (first_record != private.startptr &&
-		XLogSegmentOffset(private.startptr, WalSegSz) != 0)
+	if (first_record != startptr &&
+		XLogSegmentOffset(startptr, WalSegSz) != 0)
 		printf(ngettext("first record is after %X/%X, at %X/%X, skipping over %u byte\n",
 						"first record is after %X/%X, at %X/%X, skipping over %u bytes\n",
-						(first_record - private.startptr)),
-			   LSN_FORMAT_ARGS(private.startptr),
+						(first_record - startptr)),
+			   LSN_FORMAT_ARGS(startptr),
 			   LSN_FORMAT_ARGS(first_record),
-			   (uint32) (first_record - private.startptr));
+			   (uint32) (first_record - startptr));
 
 	for (;;)
 	{
@@ -1083,13 +1088,13 @@ main(int argc, char **argv)
 		while (XLogReadRecord(xlogreader_state, &record, &errormsg) ==
 			   XLREAD_NEED_DATA)
 		{
-			if (!WALDumpReadPage(xlogreader_state, (void *) &private))
+			if (!WALDumpReadPage(xlogreader_state, timeline, startptr, endptr))
 				break;
 		}
 
 		if (!record)
 		{
-			if (!config.follow || private.endptr_reached)
+			if (!config.follow)
 				break;
 			else
 			{
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 1492f1992d..d8cb488820 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -56,6 +56,7 @@ typedef struct WALSegmentContext
 } WALSegmentContext;
 
 typedef struct XLogReaderState XLogReaderState;
+typedef struct XLogFindNextRecordState XLogFindNextRecordState;
 
 /* Function type definition for the segment cleanup callback */
 typedef void (*WALSegmentCleanupCB) (XLogReaderState *xlogreader);
@@ -240,6 +241,13 @@ struct XLogReaderState
 	char	   *errormsg_buf;
 };
 
+struct XLogFindNextRecordState
+{
+	XLogReaderState *reader_state;
+	XLogRecPtr		targetRecPtr;
+	XLogRecPtr		currRecPtr;
+};
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -254,8 +262,8 @@ extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 /* Function type definition for the read_page callback */
 typedef bool (*XLogFindNextRecordCB) (XLogReaderState *xlogreader,
 									  void *private);
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
-									 XLogFindNextRecordCB read_page, void *private);
+extern XLogFindNextRecordState *InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr);
+extern bool XLogFindNextRecord(XLogFindNextRecordState *state);
 #endif							/* FRONTEND */
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-- 
2.30.1

v17-0005-Split-readLen-and-reqLen-of-XLogReaderState.patchtext/x-patch; charset=US-ASCII; name=v17-0005-Split-readLen-and-reqLen-of-XLogReaderState.patchDownload

From 860c2b19b783c39b5d68d523e4e7b542da5db978 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 7 Apr 2021 16:38:21 +0900
Subject: [PATCH v17 05/10] Split readLen and reqLen of XLogReaderState.

The variable is used as both out and in parameter of page-read
functions.  Separate the varialbe according to the roles.  To avoid
confusion between the two variables, provide a setter function for
page-read functions to set readLen.
---
 src/backend/access/transam/xlog.c       | 10 +++++-----
 src/backend/access/transam/xlogreader.c | 17 ++++++++---------
 src/backend/access/transam/xlogutils.c  |  6 +++---
 src/backend/replication/walsender.c     |  6 +++---
 src/bin/pg_rewind/parsexlog.c           | 12 +++++++-----
 src/bin/pg_waldump/pg_waldump.c         |  6 +++---
 src/include/access/xlogreader.h         | 23 ++++++++++++++++-------
 7 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9a9835f05d..d3d6fb4643 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12101,7 +12101,7 @@ XLogPageRead(XLogReaderState *state,
 {
 	char *readBuf				= state->readBuf;
 	XLogRecPtr	targetPagePtr	= state->readPagePtr;
-	int			reqLen			= state->readLen;
+	int			reqLen			= state->reqLen;
 	int			readLen			= 0;
 	XLogRecPtr	targetRecPtr	= state->ReadRecPtr;
 	uint32		targetPageOff;
@@ -12153,7 +12153,7 @@ retry:
 				close(readFile);
 			readFile = -1;
 			readSource = XLOG_FROM_ANY;
-			state->readLen = -1;
+			XLogReaderNotifySize(state, -1);
 			return false;
 		}
 	}
@@ -12208,7 +12208,7 @@ retry:
 	pgstat_report_wait_end();
 
 	Assert(targetSegNo == state->seg.ws_segno);
-	Assert(reqLen <= readLen);
+	Assert(readLen >= reqLen);
 
 	state->seg.ws_tli = curFileTLI;
 
@@ -12246,7 +12246,7 @@ retry:
 	}
 
 	Assert(state->readPagePtr == targetPagePtr);
-	state->readLen = readLen;
+	XLogReaderNotifySize(state, readLen);
 	return true;
 
 next_record_is_invalid:
@@ -12261,7 +12261,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 
-	state->readLen = -1;
+	XLogReaderNotifySize(state, -1);
 	return false;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 89c59843b9..7d1b5b50e6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -107,7 +107,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	WALOpenSegmentInit(&state->seg, &state->segcxt, wal_segment_size,
 					   waldir);
 
-	/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
+	/* ReadRecPtr, EndRecPtr, reqLen and readLen initialized to zeroes above */
 	state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
 										  MCXT_ALLOC_NO_OOM);
 	if (!state->errormsg_buf)
@@ -261,12 +261,12 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * record being stored in *record. Otherwise *record is NULL.
  *
  * Returns XLREAD_NEED_DATA if more data is needed to finish reading the
- * current record.  In that case, state->readPagePtr and state->readLen inform
+ * current record.  In that case, state->readPagePtr and state->reqLen inform
  * the desired position and minimum length of data needed. The caller shall
  * read in the requested data and set state->readBuf to point to a buffer
  * containing it. The caller must also set state->seg->ws_tli and
  * state->readLen to indicate the timeline that it was read from, and the
- * length of data that is now available (which must be >= given readLen),
+ * length of data that is now available (which must be >= given reqLen),
  * respectively.
  *
  * If invalid data is encountered, returns XLREAD_FAIL with *record being set to
@@ -630,7 +630,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					 * XLogNeedData should have ensured that the whole page
 					 * header was read
 					 */
-					Assert(state->readLen >= pageHeaderSize);
+					Assert(pageHeaderSize <= state->readLen);
 
 					contdata = (char *) state->readBuf + pageHeaderSize;
 					record_len = XLOG_BLCKSZ - pageHeaderSize;
@@ -643,7 +643,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					 * XLogNeedData should have ensured all needed data was
 					 * read
 					 */
-					Assert(state->readLen >= request_len);
+					Assert(request_len <= state->readLen);
 
 					memcpy(state->readRecordBuf + state->recordGotLen,
 						   (char *) contdata, record_len);
@@ -696,7 +696,6 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
 	}
 
-	Assert(!*record || state->readLen >= 0);
 	if (DecodeXLogRecord(state, *record, errormsg))
 		return XLREAD_SUCCESS;
 
@@ -763,7 +762,7 @@ XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
 		/* Request more data if we don't have the full header. */
 		if (state->readLen < pageHeaderSize)
 		{
-			state->readLen = pageHeaderSize;
+			state->reqLen = pageHeaderSize;
 			return true;
 		}
 
@@ -840,7 +839,7 @@ XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
 		 * will not come back here, but will request the actual target page.
 		 */
 		state->readPagePtr = pageptr - targetPageOff;
-		state->readLen = XLOG_BLCKSZ;
+		state->reqLen = XLOG_BLCKSZ;
 		return true;
 	}
 
@@ -849,7 +848,7 @@ XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr, int reqLen,
 	 * header so that we can validate it.
 	 */
 	state->readPagePtr = pageptr;
-	state->readLen = Max(reqLen + addLen, SizeOfXLogShortPHD);
+	state->reqLen = Max(reqLen + addLen, SizeOfXLogShortPHD);
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b003990745..61361192e7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -828,7 +828,7 @@ bool
 read_local_xlog_page(XLogReaderState *state)
 {
 	XLogRecPtr	targetPagePtr = state->readPagePtr;
-	int			reqLen		  = state->readLen;
+	int			reqLen		  = state->reqLen;
 	char	   *cur_page	  = state->readBuf;
 	XLogRecPtr	read_upto,
 				loc;
@@ -928,7 +928,7 @@ read_local_xlog_page(XLogReaderState *state)
 	else if (targetPagePtr + reqLen > read_upto)
 	{
 		/* not enough data there */
-		state->readLen = -1;
+		XLogReaderNotifySize(state,  -1);
 		return false;
 	}
 	else
@@ -948,7 +948,7 @@ read_local_xlog_page(XLogReaderState *state)
 
 	/* number of valid bytes in the buffer */
 	state->readPagePtr = targetPagePtr;
-	state->readLen = count;
+	XLogReaderNotifySize(state, count);
 	return true;
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b024bbc3cd..31522c8cc2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -807,7 +807,7 @@ static bool
 logical_read_xlog_page(XLogReaderState *state)
 {
 	XLogRecPtr		targetPagePtr = state->readPagePtr;
-	int				reqLen		  = state->readLen;
+	int				reqLen		  = state->reqLen;
 	char		   *cur_page	  = state->readBuf;
 	XLogRecPtr	flushptr;
 	int			count;
@@ -826,7 +826,7 @@ logical_read_xlog_page(XLogReaderState *state)
 	/* fail if not (implies we are going to shut down) */
 	if (flushptr < targetPagePtr + reqLen)
 	{
-		state->readLen = -1;
+		XLogReaderNotifySize(state, -1);
 		return false;
 	}
 
@@ -856,7 +856,7 @@ logical_read_xlog_page(XLogReaderState *state)
 	XLByteToSeg(targetPagePtr, segno, state->segcxt.ws_segsize);
 	CheckXLogRemoved(segno, state->seg.ws_tli);
 
-	state->readLen = count;
+	XLogReaderNotifySize(state, count);
 	return true;
 }
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 712c85281c..707493dddf 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -254,6 +254,8 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
 	XLogSegNo	targetSegNo;
 	int			r;
 
+	Assert(xlogreader->reqLen <= XLOG_BLCKSZ);
+
 	XLByteToSeg(targetPagePtr, targetSegNo, WalSegSz);
 	XLogSegNoOffsetToRecPtr(targetSegNo + 1, 0, WalSegSz, targetSegEnd);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, WalSegSz);
@@ -304,7 +306,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
 			if (restoreCommand == NULL)
 			{
 				pg_log_error("could not open file \"%s\": %m", xlogfpath);
-				xlogreader->readLen = -1;
+				XLogReaderNotifySize(xlogreader, -1);
 				return false;
 			}
 
@@ -319,7 +321,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
 
 			if (xlogreadfd < 0)
 			{
-				xlogreader->readLen = -1;
+				XLogReaderNotifySize(xlogreader, -1);
 				return false;
 			}
 			else
@@ -337,7 +339,7 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
 	if (lseek(xlogreadfd, (off_t) targetPageOff, SEEK_SET) < 0)
 	{
 		pg_log_error("could not seek in file \"%s\": %m", xlogfpath);
-		xlogreader->readLen = -1;
+		XLogReaderNotifySize(xlogreader, -1);
 		return false;
 	}
 
@@ -351,14 +353,14 @@ SimpleXLogPageRead(XLogReaderState *xlogreader, const char *datadir,
 			pg_log_error("could not read file \"%s\": read %d of %zu",
 						 xlogfpath, r, (Size) XLOG_BLCKSZ);
 
-		xlogreader->readLen = -1;
+		XLogReaderNotifySize(xlogreader, -1);
 		return false;
 	}
 
 	Assert(targetSegNo == xlogreadsegno);
 
 	xlogreader->seg.ws_tli = targetHistory[*tliIndex].tli;
-	xlogreader->readLen = XLOG_BLCKSZ;
+	XLogReaderNotifySize(xlogreader, XLOG_BLCKSZ);
 	return true;
 }
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index ab2d079bdb..82af398d20 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -332,7 +332,7 @@ WALDumpReadPage(XLogReaderState *state, TimeLineID timeline,
 				XLogRecPtr startptr, XLogRecPtr endptr)
 {
 	XLogRecPtr	targetPagePtr = state->readPagePtr;
-	int			reqLen		  = state->readLen;
+	int			reqLen		  = state->reqLen;
 	char	   *readBuff	  = state->readBuf;
 	int			count = XLOG_BLCKSZ;
 	WALReadError errinfo;
@@ -347,7 +347,7 @@ WALDumpReadPage(XLogReaderState *state, TimeLineID timeline,
 		else
 		{
 			/* Notify xlogreader that we didn't read at all */
-			state->readLen = -1;
+			XLogReaderNotifySize(state,  -1);
 			return false;
 		}
 	}
@@ -377,7 +377,7 @@ WALDumpReadPage(XLogReaderState *state, TimeLineID timeline,
 	}
 
 	/* Notify xlogreader of how many bytes we have read */
-	state->readLen = count;
+	XLogReaderNotifySize(state, count);
 	return true;
 }
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d8cb488820..836ca7fce8 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -149,19 +149,21 @@ struct XLogReaderState
 
 	/* ----------------------------------------
 	 * Communication with page reader
-	 * readBuf is XLOG_BLCKSZ bytes, valid up to at least readLen bytes.
+	 * readBuf is XLOG_BLCKSZ bytes, valid up to at least reqLen bytes.
 	 *  ----------------------------------------
 	 */
-	/* variables to communicate with page reader */
+	/* variables to inform to the callers from xlogreader */
 	XLogRecPtr	readPagePtr;	/* page pointer to read */
-	int32		readLen;		/* bytes requested to reader, or actual bytes
-								 * read by reader, which must be larger than
-								 * the request, or -1 on error */
+	int32		reqLen;			/* bytes requested to the caller */
 	char	   *readBuf;		/* buffer to store data */
 	bool		page_verified;	/* is the page header on the buffer verified? */
-	bool		record_verified;	/* is the current record header verified? */
-
+	bool		record_verified;/* is the current record header verified? */
 
+	/* variables to respond from the callers to xlogreader */
+	int32		readLen;		/* actual bytes read by reader, which must be
+								 * larger than the request, or -1 on error.
+								 * Use XLogReaderNotifyLength() to set a
+								 * value. */
 
 	/* ----------------------------------------
 	 * Decoded representation of current record
@@ -248,6 +250,13 @@ struct XLogFindNextRecordState
 	XLogRecPtr		currRecPtr;
 };
 
+/* setter functions of XLogReaderState used by other modules */
+static inline void
+XLogReaderNotifySize(XLogReaderState *state, int32 len)
+{
+	state->readLen = len;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
-- 
2.30.1

v17-0006-fixup.patchtext/x-patch; charset=US-ASCII; name=v17-0006-fixup.patchDownload

From e312dab6a4c5843abff40dcd57b659653c07e590 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Apr 2021 22:56:28 +1200
Subject: [PATCH v17 06/10] fixup

---
 src/backend/access/transam/xlogreader.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7d1b5b50e6..af849723ec 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1136,7 +1136,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
 bool
 XLogFindNextRecord(XLogFindNextRecordState *state)
 {
-	XLogRecPtr	found = InvalidXLogRecPtr;
+	//XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
 	XLogRecord *record;
 	XLogReadRecordResult result;
@@ -1231,7 +1231,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 		{
 			/* Rewind the reader to the beginning of the last record. */
 			state->currRecPtr = state->reader_state->ReadRecPtr;
-			XLogBeginRead(state->reader_state, found);
+			XLogBeginRead(state->reader_state, state->currRecPtr);
 			return false;
 		}
 	}
-- 
2.30.1

v17-0007-Add-circular-WAL-decoding-buffer.patchtext/x-patch; charset=US-ASCII; name=v17-0007-Add-circular-WAL-decoding-buffer.patchDownload

From 8891fb62b054e9a64717a1da5349a5726026c80f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Apr 2021 03:26:30 +1200
Subject: [PATCH v17 07/10] Add circular WAL decoding buffer.

Teach xlogreader.c to decode its output into a circular buffer, to
support a future prefetching patch.

 * XLogReadRecord() works almost as before, except that it returns a
   pointer to DecodedXLogRecord instead of XLogRecord

 * XLogReadAhead() implements a second cursor that allows you to read
   further ahead, as long as there is enough space in the decoding
   buffer

To support existing callers of XLogReadRecord(), the most recently
returned record also becomes the "current" record, for the purpose of
calls to XLogRecGetXXX() macros and functions, so that the multi-record
nature of the WAL decoder is hidden from all the redo routines that
don't need to care about this change.

The buffer's size is controlled with wal_decode_buffer_size.  Large
records that don't fit in the circular buffer are marked as "oversized"
and allocated separately.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |  28 +-
 src/backend/access/transam/xlogreader.c   | 785 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  22 +-
 src/include/access/xlogreader.h           | 127 +++-
 8 files changed, 753 insertions(+), 221 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 63301a1ab1..0e9bcc7159 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d3d6fb4643..e7c789b6b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1209,6 +1209,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		StringInfoData recordBuf;
 		char	   *errormsg = NULL;
 		MemoryContext oldCxt;
+		DecodedXLogRecord *decoded;
 
 		oldCxt = MemoryContextSwitchTo(walDebugCxt);
 
@@ -1224,6 +1225,9 @@ XLogInsertRecord(XLogRecData *rdata,
 		for (; rdata != NULL; rdata = rdata->next)
 			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
 
+		/* How much space would it take to decode this record? */
+		decoded = palloc(DecodeXLogRecordRequiredSpace(recordBuf.len));
+
 		if (!debug_reader)
 			debug_reader = XLogReaderAllocate(wal_segment_size, NULL, NULL);
 
@@ -1231,7 +1235,9 @@ XLogInsertRecord(XLogRecData *rdata,
 		{
 			appendStringInfoString(&buf, "error decoding record: out of memory");
 		}
-		else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
+		else if (!DecodeXLogRecord(debug_reader, decoded,
+								   (XLogRecord *) recordBuf.data,
+								   EndPos,
 								   &errormsg))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
@@ -1240,10 +1246,17 @@ XLogInsertRecord(XLogRecData *rdata,
 		else
 		{
 			appendStringInfoString(&buf, " - ");
+			/*
+			 * Temporarily make this decoded record the current record for
+			 * XLogRecGetXXX() macros.
+			 */
+			debug_reader->record = decoded;
 			xlog_outdesc(&buf, debug_reader);
+			debug_reader->record = NULL;
 		}
 		elog(LOG, "%s", buf.data);
 
+		pfree(decoded);
 		pfree(buf.data);
 		pfree(recordBuf.data);
 		MemoryContextSwitchTo(oldCxt);
@@ -1417,7 +1430,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -4382,6 +4395,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, bool fetching_ckpt)
 
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
+
 		if (record == NULL)
 		{
 			if (readFile >= 0)
@@ -10299,7 +10313,7 @@ xlog_redo(XLogReaderState *record)
 		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
 		 * code just to distinguish them for statistics purposes.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10434,7 +10448,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -12103,7 +12117,7 @@ XLogPageRead(XLogReaderState *state,
 	XLogRecPtr	targetPagePtr	= state->readPagePtr;
 	int			reqLen			= state->reqLen;
 	int			readLen			= 0;
-	XLogRecPtr	targetRecPtr	= state->ReadRecPtr;
+	XLogRecPtr	targetRecPtr	= state->DecodeRecPtr;
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
@@ -12121,6 +12135,9 @@ XLogPageRead(XLogReaderState *state,
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
+		 *
+		 * XXX Why is this here?  Move it to recovery loop, since it's based
+		 * on replay position, not read position?
 		 */
 		if (bgwriterLaunched)
 		{
@@ -12612,6 +12629,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * be updated on each cycle. When we are behind,
 					 * XLogReceiptTime will not advance, so the grace time
 					 * allotted to conflicting queries will decrease.
+					 *
 					 */
 					if (RecPtr < flushedUpto)
 						havedata = true;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index af849723ec..653255b9c9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -38,6 +38,9 @@ static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
 static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static bool XLogNeedData(XLogReaderState *state, XLogRecPtr pageptr,
 						 int reqLen, bool header_inclusive);
+size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+static XLogReadRecordResult XLogDecodeOneRecord(XLogReaderState *state,
+												bool allow_oversized);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record);
@@ -50,6 +53,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -64,6 +69,8 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
 }
 
 /*
@@ -86,8 +93,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->cleanup_cb = cleanup_cb;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -136,18 +141,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file >= 0)
 		state->cleanup_cb(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -156,6 +154,22 @@ XLogReaderFree(XLogReaderState *state)
 	pfree(state);
 }
 
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
+}
+
 /*
  * Allocate readRecordBuf to fit a record of at least the given length.
  * Returns true if successful, false if out of memory.
@@ -243,22 +257,123 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
 	state->readRecordState = XLREAD_NEXT_RECORD;
 }
 
 /*
- * Attempt to read an XLOG record.
- *
- * XLogBeginRead() or XLogFindNextRecord() must be called before the first call
- * to XLogReadRecord().
+ * See if we can release the last record that was returned by
+ * XLogReadRecord(), to free up space.
+ */
+static void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest
+	 * item decoded, decode_queue_tail.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_tail);
+	state->record = NULL;
+	state->decode_queue_tail = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_head. */
+	if (state->decode_queue_head == record)
+		state->decode_queue_head = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the tail record in the decode buffer. */
+		Assert(state->decode_buffer_tail == (char *) record);
+
+		/*
+		 * We need to update tail to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust tail to release space up to the next record. */
+			state->decode_buffer_tail = (char *) record;
+		}
+		else if (state->decoding && !state->decoding->oversized)
+		{
+			/*
+			 * We're releasing the last fully decoded record in
+			 * XLogReadRecord(), but some time earlier we partially decoded a
+			 * record in XLogReadAhead() and were unable to complete the job.
+			 * We'll set the buffer head and tail to point to the record we
+			 * started working on, so that we can continue (perhaps from a
+			 * different source).
+			 */
+			state->decode_buffer_tail = (char *) state->decoding;
+			state->decode_buffer_head = (char *) state->decoding;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_tail = state->decode_buffer;
+			state->decode_buffer_head = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Similar to XLogNextRecord(), but this traditional interface is for code
+ * that just wants the header, not the decoded record.  Callers can access the
+ * decoded record through the XLogRecGetXXX() macros.
+ */
+XLogReadRecordResult
+XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
+{
+	XLogReadRecordResult result;
+	DecodedXLogRecord *decoded;
+
+	/* Consume the next decoded record. */
+	result = XLogNextRecord(state, &decoded, errormsg);
+	if (result == XLREAD_SUCCESS)
+	{
+		/*
+		 * The traditional interface just returns the header, not the decoded
+		 * record.  The caller will access the decoded record through the
+		 * XLogRecGetXXX() macros.
+		 */
+		*record = &decoded->header;
+	}
+	else
+		*record = NULL;
+	return result;
+}
+
+/*
+ * Consume the next record.  XLogBeginRead() or XLogFindNextRecord() must be
+ * called before the first call to XLogNextRecord().
  *
  * This function may return XLREAD_NEED_DATA several times before returning a
  * result record. The caller shall read in some new data then call this
  * function again with the same parameters.
  *
  * When a record is successfully read, returns XLREAD_SUCCESS with result
- * record being stored in *record. Otherwise *record is NULL.
+ * record being stored in *record.  Otherwise *record is set to NULL.
  *
  * Returns XLREAD_NEED_DATA if more data is needed to finish reading the
  * current record.  In that case, state->readPagePtr and state->reqLen inform
@@ -269,69 +384,306 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * length of data that is now available (which must be >= given reqLen),
  * respectively.
  *
- * If invalid data is encountered, returns XLREAD_FAIL with *record being set to
- * NULL. *errormsg is set to a string with details of the failure.
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * Returns XLREAD_FULL if allow_oversized is true, and no space is available.
+ * This is intended for readahead.
+ *
+ * If invalid data is encountered, returns XLREAD_FAIL with *record being set
+ * to NULL.  *errormsg is set to a string with details of the failure.  The
+ * returned pointer (or *errormsg) points to an internal buffer that's valid
+ * until the next call to XLogReadRecord.
+ *
+ */
+XLogReadRecordResult
+XLogNextRecord(XLogReaderState *state,
+			   DecodedXLogRecord **record,
+			   char **errormsg)
+{
+	/* Release the space occupied by the last record we returned. */
+	if (state->record)
+		XLogReleasePreviousRecord(state);
+
+	for (;;)
+	{
+		XLogReadRecordResult result;
+
+		/* We can now return the oldest item in the queue, if there is one. */
+		if (state->decode_queue_tail)
+		{
+			/*
+			 * Record this as the most recent record returned, so that we'll
+			 * release it next time.  This also exposes it to the
+			 * XLogRecXXX(decoder) macros, which pass in the decoder rather
+			 * than the record for historical reasons.
+			 */
+			state->record = state->decode_queue_tail;
+
+			/*
+			 * It should be immediately after the last the record returned by
+			 * XLogReadRecord(), or at the position set by XLogBeginRead() if
+			 * XLogReadRecord() hasn't been called yet.  It may be after a
+			 * page header, though.
+			 */
+			Assert(state->record->lsn == state->EndRecPtr ||
+				   (state->EndRecPtr % XLOG_BLCKSZ == 0 &&
+					(state->record->lsn == state->EndRecPtr + SizeOfXLogShortPHD ||
+					 state->record->lsn == state->EndRecPtr + SizeOfXLogLongPHD)));
+
+			/*
+			 * Set ReadRecPtr and EndRecPtr to correspond to that
+			 * record.
+			 *
+			 * Calling code could access these through the returned decoded
+			 * record, but for now we'll update them directly here, for the
+			 * benefit of all the existing code that accesses these variables
+			 * directly.
+			 */
+			state->ReadRecPtr = state->record->lsn;
+			state->EndRecPtr = state->record->next_lsn;
+
+			*errormsg = NULL;
+			*record = state->record;
+
+			return XLREAD_SUCCESS;
+		}
+		else if (state->errormsg_deferred)
+		{
+			/*
+			 * If we've run out of records, but we have a deferred error, now
+			 * is the time to report it.
+			 */
+			state->errormsg_deferred = false;
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			else
+				*errormsg = NULL;
+			*record = NULL;
+			state->EndRecPtr = state->DecodeRecPtr;
+
+			return XLREAD_FAIL;
+		}
+
+		/* We need to get a decoded record into our queue first. */
+		result = XLogDecodeOneRecord(state, true /* allow_oversized */ );
+		switch(result)
+		{
+		case XLREAD_NEED_DATA:
+			*errormsg = NULL;
+			*record = NULL;
+			return result;
+		case XLREAD_SUCCESS:
+			Assert(state->decode_queue_tail != NULL);
+			break;
+		case XLREAD_FULL:
+			/* Not expected because we passed allow_oversized = true */
+			Assert(false);
+			break;
+		case XLREAD_FAIL:
+			/*
+			 * If that produced neither a queued record nor a queued error,
+			 * then we're at the end (for example, archive recovery with no
+			 * more files available).
+			 */
+			Assert(state->decode_queue_tail == NULL);
+			if (!state->errormsg_deferred)
+			{
+				state->EndRecPtr = state->DecodeRecPtr;
+				*errormsg = NULL;
+				*record = NULL;
+				return result;
+			}
+			break;
+		}
+	}
+
+	/* unreachable */
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record.  The next record will also be
+ * returned to XLogRecordRead().
+ *
+ * In addition to the values that XLogReadRecord() can return, XLogReadAhead()
+ * can also return XLREAD_FULL to indicate that further readahead is not
+ * possible yet due to lack of space.
+ */
+XLogReadRecordResult
+XLogReadAhead(XLogReaderState *state, DecodedXLogRecord **record, char **errormsg)
+{
+	XLogReadRecordResult result;
+
+	/* We stop trying after encountering an error. */
+	if (unlikely(state->errormsg_deferred))
+	{
+		/* We only report the error message the first time, see below. */
+		*errormsg = NULL;
+		return XLREAD_FAIL;
+	}
+
+	/*
+	 * Try to decode one more record, if we have space.  Pass allow_oversized
+	 * = false, so that this call returns fast if the decode buffer is full.
+	 */
+	result = XLogDecodeOneRecord(state, false);
+	switch (result)
+	{
+	case XLREAD_SUCCESS:
+		/* New record at head of decode record queue. */
+		Assert(state->decode_queue_head != NULL);
+		*record = state->decode_queue_head;
+		return result;
+	case XLREAD_FULL:
+		/* No space in circular decode buffer. */
+		return result;
+	case XLREAD_NEED_DATA:
+		/* The caller needs to insert more data. */
+		return result;
+	case XLREAD_FAIL:
+		/* Report the error.  XLogReadRecord() will also report it. */
+		Assert(state->errormsg_deferred);
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		return result;
+	}
+
+	/* Unreachable. */
+	return XLREAD_FAIL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
  *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+/*
+ * Try to read and decode the next record and add it to the head of the
+ * decoded record queue.
  *
- * This function runs a state machine consists of the following states.
+ * If 'allow_oversized' is false, then XLREAD_FULL can be returned to indicate
+ * the decoding buffer is full.
  *
- * XLREAD_NEXT_RECORD :
- *    The initial state, if called with valid RecPtr, try to read a record at
- *    that position.  If invalid RecPtr is given try to read a record just after
- *    the last one previously read.
- *    This state ens after setting ReadRecPtr. Then goes to XLREAD_TOT_LEN.
+ * XLogBeginRead() or XLogFindNextRecord() must be called before the first call
+ * to XLogReadRecord().
+ *
+ * This function runs a state machine consisting of the following states.
+ *
+ * XLREAD_NEXT_RECORD: The initial state, if called with valid RecPtr, try to
+ *    read a record at that position.  If invalid RecPtr is given try to read
+ *    a record just after the last one previously read.  This state ens after
+ *    setting ReadRecPtr. Then goes to XLREAD_TOT_LEN.
  *
- * XLREAD_TOT_LEN:
- *    Examining record header. Ends after reading record total
+ * XLREAD_TOT_LEN: Examining record header.  Ends after reading record total
  *    length. recordRemainLen and recordGotLen are initialized.
  *
- * XLREAD_FIRST_FRAGMENT:
- *    Reading the first fragment. Ends with finishing reading a single
- *    record. Goes to XLREAD_NEXT_RECORD if that's all or
+ * XLREAD_FIRST_FRAGMENT: Reading the first fragment.  Ends with finishing
+ *    reading a single record. Goes to XLREAD_NEXT_RECORD if that's all or
  *    XLREAD_CONTINUATION if we have continuation.
 
- * XLREAD_CONTINUATION:
- *    Reading continuation of record. Ends with finishing the whole record then
- *    goes to XLREAD_NEXT_RECORD. During this state, recordRemainLen indicates
- *    how much is left and readRecordBuf holds the partially assert
- *    record.recordContRecPtr points to the beginning of the next page where to
- *    continue.
+ * XLREAD_CONTINUATION: Reading continuation of record. Ends with finishing
+ *    the whole record then goes to XLREAD_NEXT_RECORD.  During this state,
+ *    recordRemainLen indicates how much is left and readRecordBuf holds the
+ *    partially assert record.recordContRecPtr points to the beginning of the
+ *    next page where to continue.
  *
  * If wrong data found in any state, the state machine stays at the current
- * state. This behavior allows to continue reading a reacord switching among
- * different souces, while streaming replication.
+ * state.  This behavior allows us to continue reading a reacord switching
+ * among different souces, in streaming replication.
  */
-XLogReadRecordResult
-XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
+static XLogReadRecordResult
+XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized)
 {
+	XLogRecord *record;
+	char	   *errormsg; /* not used */
 	XLogRecord *prec;
 
-	*record = NULL;
-
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	record = NULL;
 
 	switch (state->readRecordState)
 	{
 		case XLREAD_NEXT_RECORD:
-			ResetDecoder(state);
+			Assert(!state->decoding);
 
-			if (state->ReadRecPtr != InvalidXLogRecPtr)
+			if (state->DecodeRecPtr != InvalidXLogRecPtr)
 			{
 				/* read the record after the one we just read */
 
 				/*
-				 * EndRecPtr is pointing to end+1 of the previous WAL record.
+				 * NextRecPtr is pointing to end+1 of the previous WAL record.
 				 * If we're at a page boundary, no more records can fit on the
 				 * current page. We must skip over the page header, but we
 				 * can't do that until we've read in the page, since the
 				 * header size is variable.
 				 */
-				state->PrevRecPtr = state->ReadRecPtr;
-				state->ReadRecPtr = state->EndRecPtr;
+				state->PrevRecPtr = state->DecodeRecPtr;
+				state->DecodeRecPtr = state->NextRecPtr;
 			}
 			else
 			{
@@ -341,8 +693,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				 * In this case, EndRecPtr should already be pointing to a
 				 * valid record starting position.
 				 */
-				Assert(XRecOffIsValid(state->EndRecPtr));
-				state->ReadRecPtr = state->EndRecPtr;
+				Assert(XRecOffIsValid(state->NextRecPtr));
+				state->DecodeRecPtr = state->NextRecPtr;
 
 				/*
 				 * We cannot verify the previous-record pointer when we're
@@ -350,7 +702,6 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				 * won't try doing that.
 				 */
 				state->PrevRecPtr = InvalidXLogRecPtr;
-				state->EndRecPtr = InvalidXLogRecPtr;	/* to be tidy */
 			}
 
 			state->record_verified = false;
@@ -365,9 +716,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				uint32		targetRecOff;
 				XLogPageHeader pageHeader;
 
+				Assert(!state->decoding);
+
 				targetPagePtr =
-					state->ReadRecPtr - (state->ReadRecPtr % XLOG_BLCKSZ);
-				targetRecOff = state->ReadRecPtr % XLOG_BLCKSZ;
+					state->DecodeRecPtr - (state->DecodeRecPtr % XLOG_BLCKSZ);
+				targetRecOff = state->DecodeRecPtr % XLOG_BLCKSZ;
 
 				/*
 				 * Check if we have enough data. For the first record in the
@@ -388,13 +741,13 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				if (targetRecOff == 0)
 				{
 					/* At page start, so skip over page header. */
-					state->ReadRecPtr += pageHeaderSize;
+					state->DecodeRecPtr += pageHeaderSize;
 					targetRecOff = pageHeaderSize;
 				}
 				else if (targetRecOff < pageHeaderSize)
 				{
 					report_invalid_record(state, "invalid record offset at %X/%X",
-										  LSN_FORMAT_ARGS(state->ReadRecPtr));
+									  LSN_FORMAT_ARGS(state->DecodeRecPtr));
 					goto err;
 				}
 
@@ -403,8 +756,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					targetRecOff == pageHeaderSize)
 				{
 					report_invalid_record(state, "contrecord is requested by %X/%X",
-										  (uint32) (state->ReadRecPtr >> 32),
-										  (uint32) state->ReadRecPtr);
+										  (uint32) (state->DecodeRecPtr >> 32),
+										  (uint32) state->DecodeRecPtr);
 					goto err;
 				}
 
@@ -422,9 +775,26 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				 * header.
 				 */
 				prec = (XLogRecord *) (state->readBuf +
-									   state->ReadRecPtr % XLOG_BLCKSZ);
+									   state->DecodeRecPtr % XLOG_BLCKSZ);
 				total_len = prec->xl_tot_len;
 
+				/* Find space to decode this record. */
+				Assert(state->decoding == NULL);
+				state->decoding = XLogReadRecordAlloc(state, total_len,
+												  allow_oversized);
+				if (state->decoding == NULL)
+				{
+					/*
+					 * We couldn't get space.  If allow_oversized was true,
+					 * then palloc() must have failed.  Otherwise, report that
+					 * our decoding buffer is full.  This means that weare
+					 * trying to read too far ahead.
+					 */
+					if (allow_oversized)
+						goto err;
+					return XLREAD_FULL;
+				}
+
 				/*
 				 * If the whole record header is on this page, validate it
 				 * immediately.  Otherwise do just a basic sanity check on
@@ -436,7 +806,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				 */
 				if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 				{
-					if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+					if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 											   state->PrevRecPtr, prec))
 						goto err;
 
@@ -449,7 +819,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					{
 						report_invalid_record(state,
 											  "invalid record length at %X/%X: wanted %u, got %u",
-											  LSN_FORMAT_ARGS(state->ReadRecPtr),
+										  LSN_FORMAT_ARGS(state->DecodeRecPtr),
 											  (uint32) SizeOfXLogRecord, total_len);
 						goto err;
 					}
@@ -474,13 +844,15 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				XLogRecPtr	targetPagePtr;
 				uint32		targetRecOff;
 
+				Assert(state->decoding);
+
 				/*
 				 * Wait for the rest of the record on the first page to become
 				 * available
 				 */
 				targetPagePtr =
-					state->ReadRecPtr - (state->ReadRecPtr % XLOG_BLCKSZ);
-				targetRecOff = state->ReadRecPtr % XLOG_BLCKSZ;
+					state->DecodeRecPtr - (state->DecodeRecPtr % XLOG_BLCKSZ);
+				targetRecOff = state->DecodeRecPtr % XLOG_BLCKSZ;
 
 				request_len = Min(targetRecOff + total_len, XLOG_BLCKSZ);
 				record_len = request_len - targetRecOff;
@@ -499,7 +871,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				/* validate record header if not yet */
 				if (!state->record_verified && record_len >= SizeOfXLogRecord)
 				{
-					if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 											   state->PrevRecPtr, prec))
 						goto err;
 
@@ -512,15 +884,15 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					/* Record does not cross a page boundary */
 					Assert(state->record_verified);
 
-					if (!ValidXLogRecord(state, prec, state->ReadRecPtr))
+					if (!ValidXLogRecord(state, prec, state->DecodeRecPtr))
 						goto err;
 
 					state->record_verified = true;	/* to be tidy */
 
 					/* We already checked the header earlier */
-					state->EndRecPtr = state->ReadRecPtr + MAXALIGN(record_len);
+					state->NextRecPtr = state->DecodeRecPtr + MAXALIGN(record_len);
 
-					*record = prec;
+					record = prec;
 					state->readRecordState = XLREAD_NEXT_RECORD;
 					break;
 				}
@@ -539,7 +911,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					report_invalid_record(state,
 										  "record length %u at %X/%X too long",
 										  total_len,
-										  LSN_FORMAT_ARGS(state->ReadRecPtr));
+										  LSN_FORMAT_ARGS(state->DecodeRecPtr));
 					goto err;
 				}
 
@@ -550,7 +922,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				state->recordRemainLen -= record_len;
 
 				/* Calculate pointer to beginning of next page */
-				state->recordContRecPtr = state->ReadRecPtr + record_len;
+				state->recordContRecPtr = state->DecodeRecPtr + record_len;
 				Assert(state->recordContRecPtr % XLOG_BLCKSZ == 0);
 
 				state->readRecordState = XLREAD_CONTINUATION;
@@ -567,6 +939,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 				 * we enter this state only if we haven't read the whole
 				 * record.
 				 */
+				Assert(state->decoding);
 				Assert(state->recordRemainLen > 0);
 
 				while (state->recordRemainLen > 0)
@@ -586,7 +959,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 						return XLREAD_NEED_DATA;
 
 					if (!state->page_verified)
-						goto err;
+					goto err_continue;
 
 					Assert(SizeOfXLogShortPHD <= state->readLen);
 
@@ -599,8 +972,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 											  "there is no contrecord flag at %X/%X reading %X/%X",
 											  (uint32) (state->recordContRecPtr >> 32),
 											  (uint32) state->recordContRecPtr,
-											  (uint32) (state->ReadRecPtr >> 32),
-											  (uint32) state->ReadRecPtr);
+											  (uint32) (state->DecodeRecPtr >> 32),
+											  (uint32) state->DecodeRecPtr);
 						goto err;
 					}
 
@@ -617,8 +990,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 											  pageHeader->xlp_rem_len,
 											  (uint32) (state->recordContRecPtr >> 32),
 											  (uint32) state->recordContRecPtr,
-											  (uint32) (state->ReadRecPtr >> 32),
-											  (uint32) state->ReadRecPtr,
+											  (uint32) (state->DecodeRecPtr >> 32),
+											  (uint32) state->DecodeRecPtr,
 											  state->recordRemainLen);
 						goto err;
 					}
@@ -654,7 +1027,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 					if (!state->record_verified)
 					{
 						Assert(state->recordGotLen >= SizeOfXLogRecord);
-						if (!ValidXLogRecordHeader(state, state->ReadRecPtr,
+						if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 												   state->PrevRecPtr,
 												   (XLogRecord *) state->readRecordBuf))
 							goto err;
@@ -671,16 +1044,17 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 
 				/* targetPagePtr is pointing the last-read page here */
 				prec = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecord(state, prec, state->ReadRecPtr))
+				if (!ValidXLogRecord(state, prec, state->DecodeRecPtr))
 					goto err;
 
 				pageHeaderSize =
 					XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-				state->EndRecPtr = targetPagePtr + pageHeaderSize
+				state->NextRecPtr = targetPagePtr + pageHeaderSize
 					+ MAXALIGN(pageHeader->xlp_rem_len);
 
-				*record = prec;
+				record = prec;
 				state->readRecordState = XLREAD_NEXT_RECORD;
+
 				break;
 			}
 	}
@@ -688,32 +1062,65 @@ XLogReadRecord(XLogReaderState *state, XLogRecord **record, char **errormsg)
 	/*
 	 * Special processing if it's an XLOG SWITCH record
 	 */
-	if ((*record)->xl_rmid == RM_XLOG_ID &&
-		((*record)->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+	if (record->xl_rmid == RM_XLOG_ID &&
+		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, *record, errormsg))
-		return XLREAD_SUCCESS;
+	Assert(!record || state->readLen >= 0);
+	if (DecodeXLogRecord(state, state->decoding, record, state->DecodeRecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		state->decoding->next_lsn = state->NextRecPtr;
 
-	*record = NULL;
-	return XLREAD_FAIL;
+		/*
+		 * If it's in the decode buffer (not an "oversized" record allocated
+		 * with palloc()), mark the decode buffer space as occupied.
+		 */
+		if (!state->decoding->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(state->decoding->size == MAXALIGN(state->decoding->size));
+			if ((char *) state->decoding == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer +
+					state->decoding->size;
+			else
+				state->decode_buffer_head += state->decoding->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != state->decoding);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = state->decoding;
+		state->decode_queue_head = state->decoding;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = state->decoding;
+		state->decoding = NULL;
+
+		return XLREAD_SUCCESS;
+	}
 
 err:
+	if (state->decoding && state->decoding->oversized)
+		pfree(state->decoding);
+	state->decoding = NULL;
 
+err_continue:
 	/*
 	 * Invalidate the read page. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
 
-	*record = NULL;
 	return XLREAD_FAIL;
 }
 
@@ -1348,34 +1755,84 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
+	DecodedXLogRecord *r;
 
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+	state->decoding = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1390,17 +1847,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1418,7 +1878,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1429,18 +1889,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1448,7 +1908,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1456,9 +1920,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1602,17 +2066,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1621,58 +2086,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1698,10 +2142,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1721,10 +2166,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1752,12 +2198,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 61361192e7..0426e9e0ef 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -350,7 +350,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9aab713684..7924581cdc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -123,7 +123,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 707493dddf..c94860707a 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -439,7 +439,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 82af398d20..10d3420b7a 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -397,10 +397,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -498,7 +498,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -529,7 +529,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -542,26 +542,26 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				if (record->blocks[block_id].bimg_info &
+				if (record->record->blocks[block_id].bimg_info &
 					BKPIMAGE_IS_COMPRESSED)
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u, "
 						   "compression saved: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len);
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len);
 				}
 				else
 				{
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 836ca7fce8..e22f7a0da2 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -101,6 +101,7 @@ typedef enum XLogReadRecordResult
 {
 	XLREAD_SUCCESS,				/* record is successfully read */
 	XLREAD_NEED_DATA,			/* need more data. see XLogReadRecord. */
+	XLREAD_FULL,		/* cannot hold more data while reading ahead */
 	XLREAD_FAIL					/* failed during reading a record */
 }			XLogReadRecordResult;
 
@@ -120,6 +121,30 @@ typedef enum XLogReadRecordState
 	XLREAD_CONTINUATION
 }			XLogReadRecordState;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next;	/* decoded record queue  link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -142,10 +167,12 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read or being read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
-	XLogRecPtr	PrevRecPtr;		/* start of previous record read */
 
 	/* ----------------------------------------
 	 * Communication with page reader
@@ -171,27 +198,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
-
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer;		/* need to free? */
+	char	   *decode_buffer_head;		/* write head */
+	char	   *decode_buffer_tail;		/* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/* last read XLOG position for data currently in readBuf */
 	WALSegmentContext segcxt;
 	WALOpenSegment seg;
@@ -231,7 +274,7 @@ struct XLogReaderState
 	uint32		readRecordBufSize;
 
 	/*
-	 * XLogReadRecord() state
+	 * XLogReadRecordInternal() state
 	 */
 	XLogReadRecordState readRecordState;	/* state machine state */
 	int			recordGotLen;	/* amount of current record that has already
@@ -239,8 +282,11 @@ struct XLogReaderState
 	int			recordRemainLen;	/* length of current record that remains */
 	XLogRecPtr	recordContRecPtr;	/* where the current record continues */
 
+	DecodedXLogRecord *decoding;	/* record currently being decoded */
+
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
 };
 
 struct XLogFindNextRecordState
@@ -265,6 +311,11 @@ extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
@@ -275,11 +326,21 @@ extern XLogFindNextRecordState *InitXLogFindNextRecord(XLogReaderState *reader_s
 extern bool XLogFindNextRecord(XLogFindNextRecordState *state);
 #endif							/* FRONTEND */
 
-/* Read the next XLog record. Returns NULL on end-of-WAL or failure */
+/* Read the next record's header.  Returns NULL on end-of-WAL or failure. */
 extern XLogReadRecordResult XLogReadRecord(XLogReaderState *state,
 										   XLogRecord **record,
 										   char **errormsg);
 
+/* Read the next decoded record.  Returns NULL on end-of-WAL or failure. */
+extern XLogReadRecordResult XLogNextRecord(XLogReaderState *state,
+										   DecodedXLogRecord **record,
+										   char **errormsg);
+
+/* Try to read ahead, if there is space in the decoding buffer. */
+extern XLogReadRecordResult XLogReadAhead(XLogReaderState *state,
+										  DecodedXLogRecord **record,
+										  char **errormsg);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -304,25 +365,31 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
 #define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
+	((decoder)->record->blocks[block_id].in_use)
 #define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
+	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
-- 
2.30.1

v17-0008-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v17-0008-Prefetch-referenced-blocks-during-recovery.patchDownload

From 1c38fdaa44a6ab64cc0bf5f80cb1221ac932a6bc Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v17 08/10] Prefetch referenced blocks during recovery.

Introduce a new GUC recovery_prefetch.  If it is enabled, then read
ahead in the WAL and try to initiate asynchronous reading of referenced
blocks that will soon be needed but are not yet cached in our buffer
pool.  For now, this is done with posix_fadvise().

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  58 ++
 doc/src/sgml/monitoring.sgml                  |  86 +-
 doc/src/sgml/wal.sgml                         |  17 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |  45 +-
 src/backend/access/transam/xlogprefetch.c     | 922 ++++++++++++++++++
 src/backend/catalog/system_views.sql          |  14 +
 src/backend/postmaster/pgstat.c               | 103 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  56 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetch.h             |  82 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/pgstat.h                          |  26 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  11 +
 src/tools/pgindent/typedefs.list              |   4 +
 18 files changed, 1438 insertions(+), 9 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetch.c
 create mode 100644 src/include/access/xlogprefetch.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e51639d56c..cf7f5811e3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3456,6 +3456,64 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is enabled
+        by default on systems that support <function>posix_fadvise</function>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+      <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to prefetch blocks that were logged with full page images,
+        during recovery.  Often this doesn't help, since such blocks will not
+        be read the first time they are needed and might remain in the buffer
+        pool after that.  However, on file systems with a block size larger
+        than
+        <productname>PostgreSQL</productname>'s, prefetching can avoid a
+        costly read-before-write when a blocks are later written.
+        The default is off.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  Setting it too high might be counterproductive,
+        if it means that data falls out of the
+        kernel cache before it is needed.  If this value is specified without
+        units, it is taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 56018745c8..5c567cc7cf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -337,6 +337,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2897,6 +2904,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+    </row>
+    <row>
+     <entry><structfield>distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+    </row>
+    <row>
+     <entry><structfield>queue_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_distance</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+    </row>
+    <row>
+     <entry><structfield>avg_queue_depth</structfield></entry>
+     <entry><type>float4</type></entry>
+     <entry>Average number of prefetches in flight while recovery is not idle</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5029,8 +5108,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 7d48f42710..0f13c43095 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,23 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  The
+   prefetching mechanism is most likely to be effective on systems
+   with <varname>full_page_writes</varname> set to
+   <varname>off</varname> (where that is safe), and where the working
+   set is larger than RAM.  By default, prefetching in recovery is enabled
+   on operating systems that have <function>posix_fadvise</function>
+   support.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetch.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e7c789b6b9..fe5978925c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -110,6 +111,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
@@ -910,7 +912,8 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static bool XLogPageRead(XLogReaderState *state,
-						 bool fetching_ckpt, int emode, bool randAccess);
+						 bool fetching_ckpt, int emode, bool randAccess,
+						 bool nowait);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt,
 										XLogRecPtr tliRecPtr,
@@ -3729,7 +3732,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
 													  "RECOVERYXLOG",
 													  wal_segment_size,
@@ -4388,9 +4390,9 @@ ReadRecord(XLogReaderState *xlogreader, int emode, bool fetching_ckpt)
 		while ((result = XLogReadRecord(xlogreader, &record, &errormsg))
 			   == XLREAD_NEED_DATA)
 		{
-			if (!XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess))
+			if (!XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess,
+							  false /* wait for data if streaming */))
 				break;
-
 		}
 
 		ReadRecPtr = xlogreader->ReadRecPtr;
@@ -6631,6 +6633,12 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -7311,6 +7319,7 @@ StartupXLOG(void)
 		{
 			ErrorContextCallback errcallback;
 			TimestampTz xtime;
+			XLogPrefetchState prefetch;
 			PGRUsage	ru0;
 
 			pg_rusage_init(&ru0);
@@ -7321,6 +7330,9 @@ StartupXLOG(void)
 					(errmsg("redo starts at %X/%X",
 							LSN_FORMAT_ARGS(ReadRecPtr))));
 
+			/* Prepare to prefetch, if configured. */
+			XLogPrefetchBegin(&prefetch, xlogreader);
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7350,6 +7362,14 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/* Perform WAL prefetching, if enabled. */
+				while (XLogPrefetch(&prefetch, xlogreader->ReadRecPtr) == XLREAD_NEED_DATA)
+				{
+					if (!XLogPageRead(xlogreader, false, LOG, false,
+									  true /* don't wait for streaming data */))
+						break;
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
@@ -7523,6 +7543,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7539,6 +7562,7 @@ StartupXLOG(void)
 			/*
 			 * end of main redo apply loop
 			 */
+			XLogPrefetchEnd(&prefetch);
 
 			if (reachedRecoveryTarget)
 			{
@@ -12108,10 +12132,13 @@ CancelBackup(void)
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * If nowait is true, then return false immediately if the requested data isn't
+ * available yet.
  */
 static bool
 XLogPageRead(XLogReaderState *state,
-			 bool fetching_ckpt, int emode, bool randAccess)
+			 bool fetching_ckpt, int emode, bool randAccess, bool nowait)
 {
 	char *readBuf				= state->readBuf;
 	XLogRecPtr	targetPagePtr	= state->readPagePtr;
@@ -12162,6 +12189,12 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
+		if (nowait)
+		{
+			XLogReaderNotifySize(state, -1);
+			return false;
+		}
+
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 randAccess, fetching_ckpt,
 										 targetRecPtr, state->seg.ws_segno))
@@ -12395,6 +12428,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
+					XLogPrefetchReconfigure();
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -12650,6 +12684,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..56676f0021
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,922 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed.  After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed.  These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed.  Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq".  Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * and perform a synchronous read.  Therefore, we track the number of
+ * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
+ * full, we have to wait for recovery to replay records so that the queue
+ * depth can be reduced, before we can do any more prefetching.  Ideally, this
+ * keeps us the right distance ahead to respect maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery.  It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+bool		recovery_prefetch = true;
+bool		recovery_prefetch_fpw = false;
+
+int			XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object.  There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+	/* Reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+	bool		shutdown;
+
+	/* Details of last prefetch to skip repeats and seq scans. */
+	SMgrRelation last_reln;
+	RelFileNode last_rnode;
+	BlockNumber last_blkno;
+
+	/* Online averages. */
+	uint64		samples;
+	double		avg_queue_depth;
+	double		avg_distance;
+	XLogRecPtr	next_sample_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping required to limit concurrent prefetches. */
+	int			prefetch_head;
+	int			prefetch_tail;
+	int			prefetch_queue_size;
+	XLogRecPtr	prefetch_queue[MAX_IO_CONCURRENCY + 1];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 skip_hit;	/* Blocks already buffered. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat blocks skipped. */
+	float		avg_distance;
+	float		avg_queue_depth;
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			distance;		/* Number of bytes ahead in the WAL. */
+	int			queue_depth;	/* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+											 XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static bool XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+									  XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *SharedStats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->skip_hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+	pg_atomic_write_u64(&SharedStats->skip_seq, 0);
+	SharedStats->avg_distance = 0;
+	SharedStats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->skip_hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+		pg_atomic_init_u64(&SharedStats->skip_seq, 0);
+		SharedStats->avg_distance = 0;
+		SharedStats->avg_queue_depth = 0;
+		SharedStats->distance = 0;
+		SharedStats->queue_depth = 0;
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+	PgStat_RecoveryPrefetchStats serialized = {
+		.prefetch = pg_atomic_read_u64(&SharedStats->prefetch),
+		.skip_hit = pg_atomic_read_u64(&SharedStats->skip_hit),
+		.skip_new = pg_atomic_read_u64(&SharedStats->skip_new),
+		.skip_fpw = pg_atomic_read_u64(&SharedStats->skip_fpw),
+		.skip_seq = pg_atomic_read_u64(&SharedStats->skip_seq),
+		.stat_reset_timestamp = pg_atomic_read_u64(&SharedStats->reset_time)
+	};
+
+	pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+	PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+	if (serialized->stat_reset_timestamp != 0)
+	{
+		pg_atomic_write_u64(&SharedStats->prefetch, serialized->prefetch);
+		pg_atomic_write_u64(&SharedStats->skip_hit, serialized->skip_hit);
+		pg_atomic_write_u64(&SharedStats->skip_new, serialized->skip_new);
+		pg_atomic_write_u64(&SharedStats->skip_fpw, serialized->skip_fpw);
+		pg_atomic_write_u64(&SharedStats->skip_seq, serialized->skip_seq);
+		pg_atomic_write_u64(&SharedStats->reset_time, serialized->stat_reset_timestamp);
+	}
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader)
+{
+	XLogPrefetchRestoreStats();
+
+	/* We'll reconfigure on the first call to XLogPrefetch(). */
+	state->reader = reader;
+	state->prefetcher = NULL;
+	state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+	XLogPrefetchSaveStats();
+
+	if (state->prefetcher)
+		XLogPrefetcherFree(state->prefetcher);
+	state->prefetcher = NULL;
+
+	SharedStats->queue_depth = 0;
+	SharedStats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	/*
+	 * The size of the queue is based on the maintenance_io_concurrency
+	 * setting.  In theory we might have a separate queue for each tablespace,
+	 * but it's not clear how that should work, so for now we'll just use the
+	 * general GUC to rate-limit all prefetching.  The queue has space for up
+	 * the highest possible value of the GUC + 1, because our circular buffer
+	 * has a gap between head and tail when full.
+	 */
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+	prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->queue_depth = 0;
+	SharedStats->distance = 0;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	/* Log final statistics. */
+	ereport(LOG,
+			(errmsg("recovery finished prefetching at %X/%X; "
+					"prefetch = " UINT64_FORMAT ", "
+					"skip_hit = " UINT64_FORMAT ", "
+					"skip_new = " UINT64_FORMAT ", "
+					"skip_fpw = " UINT64_FORMAT ", "
+					"skip_seq = " UINT64_FORMAT ", "
+					"avg_distance = %f, "
+					"avg_queue_depth = %f",
+					(uint32) (prefetcher->reader->EndRecPtr << 32),
+					(uint32) (prefetcher->reader->EndRecPtr),
+					pg_atomic_read_u64(&SharedStats->prefetch),
+					pg_atomic_read_u64(&SharedStats->skip_hit),
+					pg_atomic_read_u64(&SharedStats->skip_new),
+					pg_atomic_read_u64(&SharedStats->skip_fpw),
+					pg_atomic_read_u64(&SharedStats->skip_seq),
+					SharedStats->avg_distance,
+					SharedStats->avg_queue_depth)));
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+bool
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	uint32		reset_request;
+
+	/* If an error has occurred or we've hit the end of the WAL, do nothing. */
+	if (prefetcher->shutdown)
+		return false;
+
+	/*
+	 * Have any in-flight prefetches definitely completed, judging by the LSN
+	 * that is currently being replayed?
+	 */
+	XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+	/*
+	 * Do we already have the maximum permitted number of I/Os running
+	 * (according to the information we have)?  If so, we have to wait for at
+	 * least one to complete, so give up early and let recovery catch up.
+	 */
+	if (XLogPrefetcherSaturated(prefetcher))
+		return false;
+
+	/*
+	 * Can we drop any filters yet?  This happens when the LSN that is
+	 * currently being replayed has moved past a record that prevents
+	 * prefetching of a block range, such as relation extension.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+
+		prefetcher->avg_distance = 0;
+		prefetcher->avg_queue_depth = 0;
+		prefetcher->samples = 0;
+	}
+
+	/* OK, we can now try reading ahead. */
+	return XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ *
+ * Return true if the xlogreader would like more data.
+ */
+static bool
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	XLogReaderState *reader = prefetcher->reader;
+	DecodedXLogRecord *record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	for (;;)
+	{
+		char	   *error;
+		int64		distance;
+
+		/* If we don't already have a record, then try to read one. */
+		if (prefetcher->record == NULL)
+		{
+			switch (XLogReadAhead(reader, &record, &error))
+			{
+				case XLREAD_NEED_DATA:
+					return true;
+				case XLREAD_FAIL:
+					if (error)
+						ereport(LOG,
+								(errmsg("recovery no longer prefetching: %s",
+										error)));
+					else
+						ereport(LOG,
+								(errmsg("recovery no longer prefetching")));
+					prefetcher->shutdown = true;
+					SharedStats->queue_depth = 0;
+					SharedStats->distance = 0;
+
+					return false;
+				case XLREAD_FULL:
+					return false;
+				case XLREAD_SUCCESS:
+					prefetcher->record = record;
+					prefetcher->next_block_id = 0;
+					break;
+			}
+		}
+		else
+		{
+			/*
+			 * We ran out of I/O queue while part way through a record.  We'll
+			 * carry on where we left off, according to next_block_id.
+			 */
+			record = prefetcher->record;
+		}
+
+		/* How far ahead of replay are we now? */
+		distance = record->lsn - replaying_lsn;
+
+		/* Update distance shown in shm. */
+		SharedStats->distance = distance;
+
+		/* Periodically recompute some statistics. */
+		if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+		{
+			/* Compute online averages. */
+			prefetcher->samples++;
+			if (prefetcher->samples == 1)
+			{
+				prefetcher->avg_distance = SharedStats->distance;
+				prefetcher->avg_queue_depth = SharedStats->queue_depth;
+			}
+			else
+			{
+				prefetcher->avg_distance +=
+					(SharedStats->distance - prefetcher->avg_distance) /
+					prefetcher->samples;
+				prefetcher->avg_queue_depth +=
+					(SharedStats->queue_depth - prefetcher->avg_queue_depth) /
+					prefetcher->samples;
+			}
+
+			/* Expose it in shared memory. */
+			SharedStats->avg_distance = prefetcher->avg_distance;
+			SharedStats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+			/* Also periodically save the simple counters. */
+			XLogPrefetchSaveStats();
+
+			prefetcher->next_sample_lsn =
+				replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+		}
+
+		/* Are we not far enough ahead? */
+		if (distance <= 0)
+		{
+			/* XXX Is this still possible? */
+			prefetcher->record = NULL;	/* skip this record */
+			continue;
+		}
+
+		/*
+		 * If this is a record that creates a new SMGR relation, we'll avoid
+		 * prefetching anything from that rnode until it has been replayed.
+		 */
+		if (replaying_lsn < record->lsn &&
+			record->header.xl_rmid == RM_SMGR_ID &&
+			(record->header.xl_info & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+		{
+			xl_smgr_create *xlrec = (xl_smgr_create *) record->main_data;
+
+			XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0, record->lsn);
+		}
+
+		/* Scan the record's block references. */
+		if (!XLogPrefetcherScanBlocks(prefetcher))
+			return false;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+	DecodedXLogRecord *record = prefetcher->record;
+
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+
+	/*
+	 * We might already have been partway through processing this record when
+	 * our queue became saturated, so we need to start where we left off.
+	 */
+	for (int block_id = prefetcher->next_block_id;
+		 block_id <= record->max_block_id;
+		 ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		PrefetchBufferResult prefetch;
+		SMgrRelation reln;
+
+		/* Ignore everything but the main fork for now. */
+		if (block->forknum != MAIN_FORKNUM)
+			continue;
+
+		/*
+		 * If there is a full page image attached, we won't be reading the
+		 * page, so you might think we should skip it.  However, if the
+		 * underlying filesystem uses larger logical blocks than us, it might
+		 * still need to perform a read-before-write some time later.
+		 * Therefore, only prefetch if configured to do so.
+		 */
+		if (block->has_image && !recovery_prefetch_fpw)
+		{
+			XLogPrefetchIncrement(&SharedStats->skip_fpw);
+			continue;
+		}
+
+		/*
+		 * If this block will initialize a new page then it's probably a
+		 * relation extension.  Since that might create a new segment, we
+		 * can't try to prefetch this block until the record has been
+		 * replayed, or we might try to open a file that doesn't exist yet.
+		 */
+		if (block->flags & BKPBLOCK_WILL_INIT)
+		{
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+									record->lsn);
+			XLogPrefetchIncrement(&SharedStats->skip_new);
+			continue;
+		}
+
+		/* Should we skip this block due to a filter? */
+		if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+		{
+			XLogPrefetchIncrement(&SharedStats->skip_new);
+			continue;
+		}
+
+		/* Fast path for repeated references to the same relation. */
+		if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+		{
+			/*
+			 * If this is a repeat access to the same block, then skip it.
+			 *
+			 * XXX We could also check for last_blkno + 1 too, and also update
+			 * last_blkno; it's not clear if the kernel would do a better job
+			 * of sequential prefetching.
+			 */
+			if (block->blkno == prefetcher->last_blkno)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_seq);
+				continue;
+			}
+
+			/* We can avoid calling smgropen(). */
+			reln = prefetcher->last_reln;
+		}
+		else
+		{
+			/* Otherwise we have to open it. */
+			reln = smgropen(block->rnode, InvalidBackendId);
+			prefetcher->last_rnode = block->rnode;
+			prefetcher->last_reln = reln;
+		}
+		prefetcher->last_blkno = block->blkno;
+
+		/* Try to prefetch this block! */
+		prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+		if (BufferIsValid(prefetch.recent_buffer))
+		{
+			/*
+			 * It was already cached, so do nothing.  Perhaps in future we
+			 * could remember the buffer so that recovery doesn't have to look
+			 * it up again.
+			 */
+			XLogPrefetchIncrement(&SharedStats->skip_hit);
+		}
+		else if (prefetch.initiated_io)
+		{
+			/*
+			 * I/O has possibly been initiated (though we don't know if it was
+			 * already cached by the kernel, so we just have to assume that it
+			 * has due to lack of better information).  Record this as an I/O
+			 * in progress until eventually we replay this LSN.
+			 */
+			XLogPrefetchIncrement(&SharedStats->prefetch);
+			XLogPrefetcherInitiatedIO(prefetcher, record->lsn);
+
+			/*
+			 * If the queue is now full, we'll have to wait before processing
+			 * any more blocks from this record, or move to a new record if
+			 * that was the last block.
+			 */
+			if (XLogPrefetcherSaturated(prefetcher))
+			{
+				prefetcher->next_block_id = block_id + 1;
+				return false;
+			}
+		}
+		else
+		{
+			/*
+			 * Neither cached nor initiated.  The underlying segment file
+			 * doesn't exist.  Presumably it will be unlinked by a later WAL
+			 * record.  When recovery reads this block, it will use the
+			 * EXTENSION_CREATE_RECOVERY flag.  We certainly don't want to do
+			 * that sort of thing while merely prefetching, so let's just
+			 * ignore references to this relation until this record is
+			 * replayed, and let recovery create the dummy file or complain if
+			 * something is wrong.
+			 */
+			XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+									record->lsn);
+			XLogPrefetchIncrement(&SharedStats->skip_new);
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_seq));
+	values[6] = Int32GetDatum(SharedStats->distance);
+	values[7] = Int32GetDatum(SharedStats->queue_depth);
+	values[8] = Float4GetDatum(SharedStats->avg_distance);
+	values[9] = Float4GetDatum(SharedStats->avg_queue_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Compute (n + 1) % prefetch_queue_size, assuming n < prefetch_queue_size,
+ * without using division.
+ */
+static inline int
+XLogPrefetcherNext(XLogPrefetcher *prefetcher, int n)
+{
+	int			next = n + 1;
+
+	return next == prefetcher->prefetch_queue_size ? 0 : next;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the (presumably lower) block
+		 * number there because we don't want to have to track individual
+		 * blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+												   HASH_FIND, NULL);
+
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Insert an LSN into the queue.  The queue must not be full already.  This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+						  XLogRecPtr prefetching_lsn)
+{
+	Assert(!XLogPrefetcherSaturated(prefetcher));
+	prefetcher->prefetch_queue[prefetcher->prefetch_head] = prefetching_lsn;
+	prefetcher->prefetch_head =
+		XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+	SharedStats->queue_depth++;
+
+	Assert(SharedStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet?  That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches.  For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+		   prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+	{
+		prefetcher->prefetch_tail =
+			XLogPrefetcherNext(prefetcher, prefetcher->prefetch_tail);
+		SharedStats->queue_depth--;
+
+		Assert(SharedStats->queue_depth >= 0);
+	}
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+	int			next = XLogPrefetcherNext(prefetcher, prefetcher->prefetch_head);
+
+	return next == prefetcher->prefetch_tail;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch_fpw = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5f2541d316..095c646610 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -910,6 +910,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.skip_hit,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.distance,
+            s.queue_depth,
+	    s.avg_distance,
+	    s.avg_queue_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5ba776e789..2b10e04579 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "access/xlogprefetch.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
 #include "common/ip.h"
@@ -277,6 +278,7 @@ static PgStat_WalStats walStats;
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
 static PgStat_ReplSlotStats *replSlotStats;
 static int	nReplSlotStats;
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -347,6 +349,7 @@ static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
 static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
 static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1422,11 +1425,20 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		/*
+		 * We can't ask the stats collector to do this for us as it is not
+		 * attached to shared memory.
+		 */
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\" or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
@@ -2815,6 +2827,22 @@ pgstat_fetch_replslot(int *nslots_p)
 	return replSlotStats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ *	Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+	backend_read_statsfile();
+
+	return &recoveryPrefetchStats;
+}
+
+
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3090,6 +3118,23 @@ pgstat_send_slru(void)
 }
 
 
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ *		Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+	PgStat_MsgRecoveryPrefetch msg;
+
+	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+	msg.m_stats = *stats;
+	pgstat_send(&msg, sizeof(msg));
+}
+
+
 /* ----------
  * PgstatCollectorMain() -
  *
@@ -3303,6 +3348,10 @@ PgstatCollectorMain(int argc, char *argv[])
 					pgstat_recv_slru(&msg.msg_slru, len);
 					break;
 
+				case PGSTAT_MTYPE_RECOVERYPREFETCH:
+					pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+					break;
+
 				case PGSTAT_MTYPE_FUNCSTAT:
 					pgstat_recv_funcstat(&msg.msg_funcstat, len);
 					break;
@@ -3595,6 +3644,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
 	rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
 	(void) rc;					/* we'll check for error with ferror */
 
+	/*
+	 * Write recovery prefetch stats struct
+	 */
+	rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+				fpout);
+	(void) rc;					/* we'll check for error with ferror */
+
 	/*
 	 * Walk through the database table.
 	 */
@@ -3870,6 +3926,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 	memset(&archiverStats, 0, sizeof(archiverStats));
 	memset(&walStats, 0, sizeof(walStats));
 	memset(&slruStats, 0, sizeof(slruStats));
+	memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
 
 	/*
 	 * Set the current timestamp (will be kept only in case we can't load an
@@ -3975,6 +4032,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 		goto done;
 	}
 
+	/*
+	 * Read recoveryPrefetchStats struct
+	 */
+	if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+			  fpin) != sizeof(recoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+		goto done;
+	}
+
 	/*
 	 * We found an existing collector stats file. Read it and put all the
 	 * hashtable entries into place.
@@ -4293,6 +4362,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 	PgStat_WalStats myWalStats;
 	PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
 	PgStat_ReplSlotStats myReplSlotStats;
+	PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
 	FILE	   *fpin;
 	int32		format_id;
 	const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -4369,6 +4439,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
 		return false;
 	}
 
+	/*
+	 * Read recovery prefetch stats struct
+	 */
+	if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+			  fpin) != sizeof(myRecoveryPrefetchStats))
+	{
+		ereport(pgStatRunningInCollector ? LOG : WARNING,
+				(errmsg("corrupted statistics file \"%s\"", statfile)));
+		FreeFile(fpin);
+		return false;
+	}
+
 	/* By default, we're going to return the timestamp of the global file. */
 	*ts = myGlobalStats.stats_timestamp;
 
@@ -4552,6 +4634,13 @@ backend_read_statsfile(void)
 		if (ok && file_ts >= min_ts)
 			break;
 
+		/*
+		 * If we're in crash recovery, the collector may not even be running,
+		 * so work with what we have.
+		 */
+		if (InRecovery)
+			break;
+
 		/* Not there or too old, so kick the collector and wait a bit */
 		if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
 			pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
@@ -5259,6 +5348,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
 	slruStats[msg->m_index].truncate += msg->m_truncate;
 }
 
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ *	Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+	recoveryPrefetchStats = msg->m_stats;
+}
+
 /* ----------
  * pgstat_recv_recoveryconflict() -
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..47847563ef 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetch.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, PredicateLockShmemSize());
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
+		size = add_size(size, XLogPrefetchShmemSize());
 		size = add_size(size, CLOGShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9c9da85f3..f0fa2883cf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -209,6 +210,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1293,6 +1295,32 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		/* No point in enabling this on systems without a suitable API. */
+#ifdef USE_PREFETCH
+		true,
+#else
+		false,
+#endif
+		NULL, assign_recovery_prefetch, NULL
+	},
+	{
+		{"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch blocks that have full page images in the WAL"),
+			gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+						 "entirely overwritten, but if the logical page size of the filesystem is "
+						 "larger than PostgreSQL's, this can be beneficial.  This option has no "
+						 "effect unless recovery_prefetch is enabled.")
+		},
+		&recovery_prefetch_fpw,
+		false,
+		NULL, assign_recovery_prefetch_fpw, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2720,6 +2748,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3040,7 +3079,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -12010,6 +12050,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39da7cc942..9477cb9eaf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -235,6 +235,12 @@
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = on			# whether to prefetch pages logged with FPW
+#recovery_prefetch_fpw = off		# whether to prefetch pages logged with FPW
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 77187c12be..f542af0a26 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -132,6 +132,7 @@ extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..59ce9c6473
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int	XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+	XLogReaderState *reader;
+	XLogPrefetcher *prefetcher;
+	int			reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state, XLogReaderState *reader);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern bool XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+									XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.  Return
+ * true if more data is needed by the reader.
+ */
+static inline bool
+XLogPrefetch(XLogPrefetchState *state, XLogRecPtr replaying_lsn)
+{
+	/*
+	 * Handle any configuration changes.  Rather than trying to deal with
+	 * various parameter changes, we just tear down and set up a new
+	 * prefetcher if anything we depend on changes.
+	 */
+	if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+	{
+		/* If we had a prefetcher, tear it down. */
+		if (state->prefetcher)
+		{
+			XLogPrefetcherFree(state->prefetcher);
+			state->prefetcher = NULL;
+		}
+		/* If we want a prefetcher, set it up. */
+		if (recovery_prefetch)
+			state->prefetcher = XLogPrefetcherAllocate(state->reader);
+		state->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	if (state->prefetcher)
+		return XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+
+	return false;
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4309fa40dd..f640911cbf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6282,6 +6282,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7cd137506e..7c2caefd9b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -73,6 +73,7 @@ typedef enum StatMsgType
 	PGSTAT_MTYPE_BGWRITER,
 	PGSTAT_MTYPE_WAL,
 	PGSTAT_MTYPE_SLRU,
+	PGSTAT_MTYPE_RECOVERYPREFETCH,
 	PGSTAT_MTYPE_FUNCSTAT,
 	PGSTAT_MTYPE_FUNCPURGE,
 	PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -196,6 +197,19 @@ typedef struct PgStat_TableXactStatus
 	struct PgStat_TableXactStatus *next;	/* next of same subxact */
 } PgStat_TableXactStatus;
 
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+	PgStat_Counter prefetch;
+	PgStat_Counter skip_hit;
+	PgStat_Counter skip_new;
+	PgStat_Counter skip_fpw;
+	PgStat_Counter skip_seq;
+	TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
 
 /* ------------------------------------------------------------
  * Message formats follow
@@ -516,6 +530,15 @@ typedef struct PgStat_MsgReplSlot
 	PgStat_Counter m_stream_bytes;
 } PgStat_MsgReplSlot;
 
+/* ----------
+ * PgStat_MsgRecoveryPrefetch			Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+	PgStat_MsgHdr m_hdr;
+	PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
 
 /* ----------
  * PgStat_MsgRecoveryConflict	Sent by the backend upon recovery conflict
@@ -678,6 +701,7 @@ typedef union PgStat_Msg
 	PgStat_MsgBgWriter msg_bgwriter;
 	PgStat_MsgWal msg_wal;
 	PgStat_MsgSLRU msg_slru;
+	PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
 	PgStat_MsgFuncstat msg_funcstat;
 	PgStat_MsgFuncpurge msg_funcpurge;
 	PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1065,6 +1089,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
 extern void pgstat_report_wal(void);
 extern bool pgstat_send_wal(bool force);
 
@@ -1081,6 +1106,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 5004ee4177..d0078779c8 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -442,4 +442,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetch.c */
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9b59a7b4a5..cd7b63f634 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1878,6 +1878,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.skip_hit,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.distance,
+    s.queue_depth,
+    s.avg_distance,
+    s.avg_queue_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b26e81dcbf..b1220b07b3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2802,6 +2802,10 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.1

v17-0009-Provide-ReadRecentBuffer-to-re-pin-buffers-by-ID.patchtext/x-patch; charset=US-ASCII; name=v17-0009-Provide-ReadRecentBuffer-to-re-pin-buffers-by-ID.patchDownload

From 6f2bd62a520dbf9249879afa174faf4ae063818d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 14 Sep 2020 23:20:55 +1200
Subject: [PATCH v17 09/10] Provide ReadRecentBuffer() to re-pin buffers by ID.

If you know the buffer ID that recently held a given block you would
like to pin, this function will check if it's still there and pin it if
the tag hasn't changed.  Otherwise, you'll need to use the regular
ReadBuffer() function.  This will be used by later patches to avoid
double lookup in some cases where it's very likely not to have moved.

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 43 +++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |  2 ++
 2 files changed, 45 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..0e5f92d92b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -610,6 +610,49 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 }
 
+/*
+ * ReadRecentBuffer -- try to refind a buffer that we suspect holds a given
+ *		block
+ *
+ * Return true if the buffer is valid, has the correct tag, and we managed
+ * to pin it.
+ */
+bool
+ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
+				 Buffer recent_buffer)
+{
+	BufferDesc *bufHdr;
+	BufferTag	tag;
+
+	Assert(BufferIsValid(recent_buffer));
+
+	/* Look up the header by index, and try to pin if shared. */
+	if (BufferIsLocal(recent_buffer))
+		bufHdr = GetBufferDescriptor(-recent_buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		if (!PinBuffer(bufHdr, NULL))
+		{
+			/* Not valid, couldn't pin it. */
+			UnpinBuffer(bufHdr, true);
+
+			return false;
+		}
+	}
+
+	/* Does the tag still match? */
+	INIT_BUFFERTAG(tag, rnode, forkNum, blockNum);
+	if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+		return true;
+
+	/* Too late!  Unpin if shared. */
+	if (!BufferIsLocal(recent_buffer))
+		UnpinBuffer(bufHdr, true);
+
+	return false;
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..aa64fb42ec 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,6 +176,8 @@ extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_r
 												 BlockNumber blockNum);
 extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
+extern bool ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber blockNum, Buffer recent_buffer);
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
-- 
2.30.1

v17-0010-Avoid-extra-buffer-lookup-when-prefetching-WAL-b.patchtext/x-patch; charset=US-ASCII; name=v17-0010-Avoid-extra-buffer-lookup-when-prefetching-WAL-b.patchDownload

From 761ac75ef0ea2637accbadd38115b680aed5115c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 7 Apr 2021 15:03:49 +1200
Subject: [PATCH v17 10/10] Avoid extra buffer lookup when prefetching WAL
 blocks.

Provide a some workspace in decoded WAL records to remember which buffer
recently contained we found a block in while prefetching, so that we can
try to use ReadRecentBuffer() when recovery gets to processing this
record.

Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c         |  2 +-
 src/backend/access/transam/xlogprefetch.c |  6 +++---
 src/backend/access/transam/xlogreader.c   | 13 +++++++++++++
 src/backend/access/transam/xlogutils.c    | 23 +++++++++++++++++++----
 src/backend/storage/buffer/bufmgr.c       |  1 +
 src/backend/storage/freespace/freespace.c |  3 ++-
 src/include/access/xlogreader.h           |  7 +++++++
 src/include/access/xlogutils.h            |  3 ++-
 8 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fe5978925c..ec1986aff7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1464,7 +1464,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index 56676f0021..bec7b62844 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -649,10 +649,10 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 		if (BufferIsValid(prefetch.recent_buffer))
 		{
 			/*
-			 * It was already cached, so do nothing.  Perhaps in future we
-			 * could remember the buffer so that recovery doesn't have to look
-			 * it up again.
+			 * It was already cached, so do nothing.  We'll remember the
+			 * buffer, so that recovery can try to avoid looking it up again.
 			 */
+			block->recent_buffer = prefetch.recent_buffer;
 			XLogPrefetchIncrement(&SharedStats->skip_hit);
 		}
 		else if (prefetch.initiated_io)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 653255b9c9..ce10e28fe6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1932,6 +1932,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->recent_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -2139,6 +2141,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetRecentBuffer(record, block_id, rnode, forknum, blknum,
+								  NULL);
+}
+
+bool
+XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+					   RelFileNode *rnode, ForkNumber *forknum,
+					   BlockNumber *blknum, Buffer *recent_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -2153,6 +2164,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (recent_buffer)
+		*recent_buffer = bkpb->recent_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 0426e9e0ef..f6ddc888f3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -335,11 +335,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		recent_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetRecentBuffer(record, block_id, &rnode, &forknum, &blkno,
+								&recent_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -361,7 +363,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  recent_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -390,7 +393,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode,
+									  recent_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -437,7 +441,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -445,6 +450,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -503,6 +517,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0e5f92d92b..0cc1eb7971 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -651,6 +651,7 @@ ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
 	if (!BufferIsLocal(recent_buffer))
 		UnpinBuffer(bufHdr, true);
 
+
 	return false;
 }
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 8c12dda238..cfa0414e5a 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e22f7a0da2..dfca5a9a86 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -77,6 +78,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Workspace for remembering last known buffer holding this block. */
+	Buffer		recent_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -400,5 +404,8 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
+								   RelFileNode *rnode, ForkNumber *forknum,
+								   BlockNumber *blknum, Buffer *recent_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 397fb27fc2..bbc6085130 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -42,7 +42,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
-- 
2.30.1

#84

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#83)

Re: WIP: WAL prefetch (another approach)

On 4/7/21 1:24 PM, Thomas Munro wrote:

Here's rebase, on top of Horiguchi-san's v19 patch set. My patches
start at 0007. Previously, there was a "nowait" flag that was passed
into all the callbacks so that XLogReader could wait for new WAL in
some cases but not others. This new version uses the proposed
XLREAD_NEED_DATA protocol, and the caller deals with waiting for data
to arrive when appropriate. This seems tidier to me.

OK, seems reasonable.

I made one other simplifying change: previously, the prefetch module
would read the WAL up to the "written" LSN (so, allowing itself to
read data that had been written but not yet flushed to disk by the
walreceiver), though it still waited until a record's LSN was
"flushed" before replaying. That allowed prefetching to happen
concurrently with the WAL flush, which was nice, but it felt a little
too "special". I decided to remove that part for now, and I plan to
look into making standbys work more like primary servers, using WAL
buffers, the WAL writer and optionally the standard log-before-data
rule.

Not sure, but the removal seems unnecessary. I'm worried that this will
significantly reduce the amount of data that we'll be able to prefetch.
How likely it is that we have data that is written but not flushed?
Let's assume the replica is lagging and network bandwidth is not the
bottleneck - how likely is this "has to be flushed" a limit for the
prefetching?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#85

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#84)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/7/21 1:24 PM, Thomas Munro wrote:

I made one other simplifying change: previously, the prefetch module
would read the WAL up to the "written" LSN (so, allowing itself to
read data that had been written but not yet flushed to disk by the
walreceiver), though it still waited until a record's LSN was
"flushed" before replaying. That allowed prefetching to happen
concurrently with the WAL flush, which was nice, but it felt a little
too "special". I decided to remove that part for now, and I plan to
look into making standbys work more like primary servers, using WAL
buffers, the WAL writer and optionally the standard log-before-data
rule.

Not sure, but the removal seems unnecessary. I'm worried that this will
significantly reduce the amount of data that we'll be able to prefetch.
How likely it is that we have data that is written but not flushed?
Let's assume the replica is lagging and network bandwidth is not the
bottleneck - how likely is this "has to be flushed" a limit for the
prefetching?

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct. I do
intend to look into more concurrency of that kind for v15. I have
pushed these patches, updated to be disabled by default. I will look
into how I can run a BF animal that has it enabled during the recovery
tests for coverage. Thanks very much to everyone on this thread for
all the discussion and testing so far.

#86

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#85)

Re: WIP: WAL prefetch (another approach)

On 4/8/21 1:46 PM, Thomas Munro wrote:

On Thu, Apr 8, 2021 at 3:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/7/21 1:24 PM, Thomas Munro wrote:

I made one other simplifying change: previously, the prefetch module
would read the WAL up to the "written" LSN (so, allowing itself to
read data that had been written but not yet flushed to disk by the
walreceiver), though it still waited until a record's LSN was
"flushed" before replaying. That allowed prefetching to happen
concurrently with the WAL flush, which was nice, but it felt a little
too "special". I decided to remove that part for now, and I plan to
look into making standbys work more like primary servers, using WAL
buffers, the WAL writer and optionally the standard log-before-data
rule.

Not sure, but the removal seems unnecessary. I'm worried that this will
significantly reduce the amount of data that we'll be able to prefetch.
How likely it is that we have data that is written but not flushed?
Let's assume the replica is lagging and network bandwidth is not the
bottleneck - how likely is this "has to be flushed" a limit for the
prefetching?

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct. I do
intend to look into more concurrency of that kind for v15. I have
pushed these patches, updated to be disabled by default. I will look
into how I can run a BF animal that has it enabled during the recovery
tests for coverage. Thanks very much to everyone on this thread for
all the discussion and testing so far.

OK, understood. I'll rerun the benchmarks on this version, and if
there's a significant negative impact we can look into that during the
stabilization phase.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#87

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Thomas Munro (#83)

Re: WIP: WAL prefetch (another approach)

Here's some little language fixes.

BTW, before beginning "recovery", PG syncs all the data dirs.
This can be slow, and it seems like the slowness is frequently due to file
metadata. For example, that's an obvious consequence of an OS crash, after
which the page cache is empty. I've made a habit of running find /zfs -ls |wc
to pre-warm it, which can take a little bit, but then the recovery process
starts moments later. I don't have any timing measurements, but I expect that
starting to stat() all data files as soon as possible would be a win.

commit cc9707de333fe8242607cde9f777beadc68dbf04
Author: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Thu Apr 8 10:43:14 2021 -0500

WIP: doc review: Optionally prefetch referenced data in recovery.

1d257577e08d3e598011d6850fd1025858de8c8c

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bc4a8b2279..139dee7aa2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3621,7 +3621,7 @@ include_dir 'conf.d'
         pool after that.  However, on file systems with a block size larger
         than
         <productname>PostgreSQL</productname>'s, prefetching can avoid a
-        costly read-before-write when a blocks are later written.
+        costly read-before-write when blocks are later written.
         The default is off.
        </para>
       </listitem>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 24cf567ee2..36e00c92c2 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -816,9 +816,7 @@
    prefetching mechanism is most likely to be effective on systems
    with <varname>full_page_writes</varname> set to
    <varname>off</varname> (where that is safe), and where the working
-   set is larger than RAM.  By default, prefetching in recovery is enabled
-   on operating systems that have <function>posix_fadvise</function>
-   support.
+   set is larger than RAM.  By default, prefetching in recovery is disabled.
   </para>
  </sect1>

diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
index 28764326bc..363c079964 100644
--- a/src/backend/access/transam/xlogprefetch.c
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -31,7 +31,7 @@
  * stall; this is counted with "skip_fpw".
  *
  * The only way we currently have to know that an I/O initiated with
- * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(),
+ * PrefetchSharedBuffer() has that recovery will eventually call ReadBuffer(), XXX: what ??
  * and perform a synchronous read.  Therefore, we track the number of
  * potentially in-flight I/Os by using a circular buffer of LSNs.  When it's
  * full, we have to wait for recovery to replay records so that the queue
@@ -660,7 +660,7 @@ XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
 			/*
 			 * I/O has possibly been initiated (though we don't know if it was
 			 * already cached by the kernel, so we just have to assume that it
-			 * has due to lack of better information).  Record this as an I/O
+			 * was due to lack of better information).  Record this as an I/O
 			 * in progress until eventually we replay this LSN.
 			 */
 			XLogPrefetchIncrement(&SharedStats->prefetch);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 090abdad8b..8c72ba1f1a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2774,7 +2774,7 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
 			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
-			gettext_noop("This controls the maximum distance we can read ahead n the WAL to prefetch referenced blocks."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
 			GUC_UNIT_BYTE
 		},
 		&wal_decode_buffer_size,

#88

[1]: /messages/by-id/CA+hUKG+8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q@mail.gmail.com

thomas.munro@gmail.com

almost 5 years ago

In reply to: Justin Pryzby (#87)

Re: WIP: WAL prefetch (another approach)

On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Here's some little language fixes.

Thanks! Done. I rewrote the gibberish comment that made you say
"XXX: what?". Pushed.

BTW, before beginning "recovery", PG syncs all the data dirs.
This can be slow, and it seems like the slowness is frequently due to file
metadata. For example, that's an obvious consequence of an OS crash, after
which the page cache is empty. I've made a habit of running find /zfs -ls |wc
to pre-warm it, which can take a little bit, but then the recovery process
starts moments later. I don't have any timing measurements, but I expect that
starting to stat() all data files as soon as possible would be a win.

Did you see commit 61752afb, "Provide
recovery_init_sync_method=syncfs"? Actually I believe it's safe to
skip that phase completely and do a tiny bit more work during
recovery, which I'd like to work on for v15[1]/messages/by-id/CA+hUKG+8Wm8TSfMWPteMEHfh194RytVTBNoOkggTQT1p5NTY7Q@mail.gmail.com.

#89

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Thomas Munro (#88)

Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 08:27:42AM +1200, Thomas Munro wrote:

On Fri, Apr 9, 2021 at 3:37 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Here's some little language fixes.

Thanks! Done. I rewrote the gibberish comment that made you say
"XXX: what?". Pushed.

BTW, before beginning "recovery", PG syncs all the data dirs.
This can be slow, and it seems like the slowness is frequently due to file
metadata. For example, that's an obvious consequence of an OS crash, after
which the page cache is empty. I've made a habit of running find /zfs -ls |wc
to pre-warm it, which can take a little bit, but then the recovery process
starts moments later. I don't have any timing measurements, but I expect that
starting to stat() all data files as soon as possible would be a win.

Did you see commit 61752afb, "Provide
recovery_init_sync_method=syncfs"? Actually I believe it's safe to
skip that phase completely and do a tiny bit more work during
recovery, which I'd like to work on for v15[1].

Yes, I have it in my list for v14 deployment. Thanks for that.

Did you see this?
/messages/by-id/GV0P278MB0483490FEAC879DCA5ED583DD2739@GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM

I meant to mail you so you could include it in the same commit, but forgot
until now.

--
Justin

#90

Shinoda, Noriyoshi (PN Japan FSIP)

thomas.munro@gmail.com

almost 5 years ago

In reply to: Justin Pryzby (#89)

Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Did you see this?
/messages/by-id/GV0P278MB0483490FEAC879DCA5ED583DD2739@GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM

I meant to mail you so you could include it in the same commit, but forgot
until now.

Done, thanks.

#91

noriyoshi.shinoda@hpe.com

almost 5 years ago

In reply to: Thomas Munro (#90)

1 attachment(s)

RE: WIP: WAL prefetch (another approach)

Hi,

Thank you for developing a great feature. I tested this feature and checked the documentation.
Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

It is also not displayed in the list of "28.2. The Statistics Collector".
https://www.postgresql.org/docs/devel/monitoring.html

The attached patch modifies the pg_stat_prefetch_recovery view to appear as a separate view.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Thomas Munro [mailto:thomas.munro@gmail.com]
Sent: Saturday, April 10, 2021 5:46 AM
To: Justin Pryzby <pryzby@telsasoft.com>
Cc: Tomas Vondra <tomas.vondra@enterprisedb.com>; Stephen Frost <sfrost@snowman.net>; Andres Freund <andres@anarazel.de>; Jakub Wartak <Jakub.Wartak@tomtom.com>; Alvaro Herrera <alvherre@2ndquadrant.com>; Tomas Vondra <tomas.vondra@2ndquadrant.com>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Did you see this?
INVALID URI REMOVED
278MB0483490FEAC879DCA5ED583DD2739*40GV0P278MB0483.CHEP278.PROD.OUTLOO
K.COM__;JQ!!NpxR!wcPrhiB2CaHRtywGoh9Ap0M-kH1m07hGI37-ycYRGCPgCqGs30lRS
KicsXacduEXHxI$

I meant to mail you so you could include it in the same commit, but
forgot until now.

Done, thanks.

Attachments:

pg_stat_prefetch_recovery_doc_v1.diffapplication/octet-stream; name=pg_stat_prefetch_recovery_doc_v1.diffDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8287587..fe07f6f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -340,7 +340,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      <row>
       <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
       <entry>Only one row, showing statistics about blocks prefetched during recovery.
-       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+       See <xref linkend="monitoring-pg-stat-prefetch-recovery-view"/> for details.
       </entry>
      </row>
 
@@ -2910,21 +2910,25 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
- <sect2 id="monitoring-pg-stat-subscription">
-  <title><structname>pg_stat_subscription</structname></title>
+ <sect2 id="monitoring-pg-stat-prefetch-recovery-view">
+  <title><structname>pg_stat_prefetch_recovery</structname></title>
 
   <indexterm>
-   <primary>pg_stat_subscription</primary>
+   <primary>pg_stat_prefetch_recovery</primary>
   </indexterm>
 
   <para>
-   The <structname>pg_stat_subscription</structname> view will contain one
-   row per subscription for main worker (with null PID if the worker is
-   not running), and additional rows for workers handling the initial data
-   copy of the subscribed tables.
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.  The counters in this view are reset whenever the
+   <xref linkend="guc-recovery-prefetch"/>,
+   <xref linkend="guc-recovery-prefetch-fpw"/> or
+   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+   the server configuration is reloaded.
   </para>
 
-  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+  <table id="pg-stat-prefetch-recovery" xreflabel="pg_stat_prefetch_recovery">
    <title><structname>pg_stat_prefetch_recovery</structname> View</title>
    <tgroup cols="3">
     <thead>
@@ -2984,16 +2988,20 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     </tbody>
    </tgroup>
   </table>
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-subscription">
+  <title><structname>pg_stat_subscription</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_subscription</primary>
+  </indexterm>
 
   <para>
-   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
-   one row.  It is filled with nulls if recovery is not running or WAL
-   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
-   for more information.  The counters in this view are reset whenever the
-   <xref linkend="guc-recovery-prefetch"/>,
-   <xref linkend="guc-recovery-prefetch-fpw"/> or
-   <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
-   the server configuration is reloaded.
+   The <structname>pg_stat_subscription</structname> view will contain one
+   row per subscription for main worker (with null PID if the worker is
+   not running), and additional rows for workers handling the initial data
+   copy of the subscribed tables.
   </para>
 
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">

#92

Amit Kapila

amit.kapila16@gmail.com

over 4 years ago

In reply to: Thomas Munro (#90)

Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 2:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

In commit 1d257577e08d3e598011d6850fd1025858de8c8c, there is a change
in file format for stats, won't it require bumping
PGSTAT_FILE_FORMAT_ID?

Actually, I came across this while working on my today's commit
f5fc2f5b23 where I forgot to bump PGSTAT_FILE_FORMAT_ID. So, I thought
maybe we can bump it just once if required?

--
With Regards,
Amit Kapila.

#93

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#85)

Re: WIP: WAL prefetch (another approach)

Thomas Munro <thomas.munro@gmail.com> writes:

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct. I do
intend to look into more concurrency of that kind for v15. I have
pushed these patches, updated to be disabled by default.

I have a fairly bad feeling about these patches. I've already fixed
one critical bug (see 9e4114822), but I am still seeing random, hard
to reproduce failures in WAL replay testing. It looks like sometimes
the "decoded" version of a WAL record doesn't match what I see in
the on-disk data, which I'm having no luck tracing down.

Another interesting failure I just came across is

2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data checksum in record at F/438000A4
TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, PID: 14606)
2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was terminated by signal 6: Abort trap

with stack trace

#0 0x90b669f0 in kill ()
#1 0x90c01bfc in abort ()
#2 0x0057a6a0 in ExceptionalCondition (conditionName=<value temporarily unavailable, due to optimizations>, errorType=<value temporarily unavailable, due to optimizations>, fileName=<value temporarily unavailable, due to optimizations>, lineNumber=<value temporarily unavailable, due to optimizations>) at assert.c:69
#3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 '\001') at xlogreader.c:845
#4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, errormsg=0xbfffba9c) at xlogreader.c:466
#5 0x000f695c in XLogReadRecord (state=<value temporarily unavailable, due to optimizations>, record=0xbfffba98, errormsg=<value temporarily unavailable, due to optimizations>) at xlogreader.c:352
#6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 '\0') at xlog.c:4398
#7 0x000ea320 in StartupXLOG () at xlog.c:7567
#8 0x00362218 in StartupProcessMain () at startup.c:244
#9 0x000fc170 in AuxiliaryProcessMain (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at bootstrap.c:447
#10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
#11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
#12 0x0029737c in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at main.c:209

I am not sure whether the checksum failure itself is real or a variant
of the seeming bad-reconstruction problem, but what I'm on about right
at this moment is that the error handling logic for this case seems
quite broken. Why is a checksum failure only worthy of a LOG message?
Why is ValidXLogRecord() issuing a log message for itself, rather than
being tied into the report_invalid_record() mechanism? Why are we
evidently still trying to decode records afterwards?

In general, I'm not too pleased with the apparent attitude in this
thread that it's okay to push a patch that only mostly works on the
last day of the dev cycle and plan to stabilize it later.

regards, tom lane

#94

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Tom Lane (#93)

Re: WIP: WAL prefetch (another approach)

On 4/21/21 6:30 PM, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct. I do
intend to look into more concurrency of that kind for v15. I have
pushed these patches, updated to be disabled by default.

I have a fairly bad feeling about these patches. I've already fixed
one critical bug (see 9e4114822), but I am still seeing random, hard
to reproduce failures in WAL replay testing. It looks like sometimes
the "decoded" version of a WAL record doesn't match what I see in
the on-disk data, which I'm having no luck tracing down.

Another interesting failure I just came across is

2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data checksum in record at F/438000A4
TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, PID: 14606)
2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was terminated by signal 6: Abort trap

with stack trace

#0 0x90b669f0 in kill ()
#1 0x90c01bfc in abort ()
#2 0x0057a6a0 in ExceptionalCondition (conditionName=<value temporarily unavailable, due to optimizations>, errorType=<value temporarily unavailable, due to optimizations>, fileName=<value temporarily unavailable, due to optimizations>, lineNumber=<value temporarily unavailable, due to optimizations>) at assert.c:69
#3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 '\001') at xlogreader.c:845
#4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, errormsg=0xbfffba9c) at xlogreader.c:466
#5 0x000f695c in XLogReadRecord (state=<value temporarily unavailable, due to optimizations>, record=0xbfffba98, errormsg=<value temporarily unavailable, due to optimizations>) at xlogreader.c:352
#6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 '\0') at xlog.c:4398
#7 0x000ea320 in StartupXLOG () at xlog.c:7567
#8 0x00362218 in StartupProcessMain () at startup.c:244
#9 0x000fc170 in AuxiliaryProcessMain (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at bootstrap.c:447
#10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
#11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
#12 0x0029737c in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at main.c:209

I am not sure whether the checksum failure itself is real or a variant
of the seeming bad-reconstruction problem, but what I'm on about right
at this moment is that the error handling logic for this case seems
quite broken. Why is a checksum failure only worthy of a LOG message?
Why is ValidXLogRecord() issuing a log message for itself, rather than
being tied into the report_invalid_record() mechanism? Why are we
evidently still trying to decode records afterwards?

Yeah, that seems suspicious.

In general, I'm not too pleased with the apparent attitude in this
thread that it's okay to push a patch that only mostly works on the
last day of the dev cycle and plan to stabilize it later.

Was there such attitude? I don't think people were arguing for pushing a
patch's not working correctly. The discussion was mostly about getting
it committed even and leaving some optimizations for v15.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#95

thomas.munro@gmail.com

over 4 years ago

In reply to: Tomas Vondra (#94)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 22, 2021 at 8:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/21/21 6:30 PM, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Yeah, it would have been nice to include that but it'll have to be for
v15 due to lack of time to convince myself that it was correct. I do
intend to look into more concurrency of that kind for v15. I have
pushed these patches, updated to be disabled by default.

I have a fairly bad feeling about these patches. I've already fixed
one critical bug (see 9e4114822), but I am still seeing random, hard
to reproduce failures in WAL replay testing. It looks like sometimes
the "decoded" version of a WAL record doesn't match what I see in
the on-disk data, which I'm having no luck tracing down.

Ugh. Looking into this now. Also, this week I have been researching
a possible problem with eg ALTER TABLE SET TABLESPACE in the higher
level patch, which I'll write about soon.

I am not sure whether the checksum failure itself is real or a variant
of the seeming bad-reconstruction problem, but what I'm on about right
at this moment is that the error handling logic for this case seems
quite broken. Why is a checksum failure only worthy of a LOG message?
Why is ValidXLogRecord() issuing a log message for itself, rather than
being tied into the report_invalid_record() mechanism? Why are we
evidently still trying to decode records afterwards?

Yeah, that seems suspicious.

I may have invited trouble by deciding to rebase on the other proposal
late in the cycle. That interfaces around there.

In general, I'm not too pleased with the apparent attitude in this
thread that it's okay to push a patch that only mostly works on the
last day of the dev cycle and plan to stabilize it later.

Was there such attitude? I don't think people were arguing for pushing a
patch's not working correctly. The discussion was mostly about getting
it committed even and leaving some optimizations for v15.

That wasn't my plan, but I admit that the timing was non-ideal. In
any case, I'll dig into these failures and then consider options.
More soon.

#96

thomas.munro@gmail.com

over 4 years ago

In reply to: Thomas Munro (#95)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:

That wasn't my plan, but I admit that the timing was non-ideal. In
any case, I'll dig into these failures and then consider options.
More soon.

Yeah, this clearly needs more work. xlogreader.c is difficult to work
with and I think we need to keep trying to improve it, but I made a
bad call here trying to combine this with other refactoring work up
against a deadline and I made some dumb mistakes. I could of course
debug it in-tree, and I know that this has been an anticipated
feature. Personally I think the right thing to do now is to revert it
and re-propose for 15 early in the cycle, supported with some better
testing infrastructure.

#97

sfrost@snowman.net

over 4 years ago

In reply to: Thomas Munro (#96)

Re: WIP: WAL prefetch (another approach)

Greetings,

On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

That wasn't my plan, but I admit that the timing was non-ideal. In
any case, I'll dig into these failures and then consider options.
More soon.

Yeah, this clearly needs more work. xlogreader.c is difficult to work
with and I think we need to keep trying to improve it, but I made a
bad call here trying to combine this with other refactoring work up
against a deadline and I made some dumb mistakes. I could of course
debug it in-tree, and I know that this has been an anticipated
feature. Personally I think the right thing to do now is to revert it
and re-propose for 15 early in the cycle, supported with some better
testing infrastructure.

I tend to agree with the idea to revert it, perhaps a +0 on that, but if
others argue it should be fixed in-place, I wouldn’t complain about it.

I very much encourage the idea of improving testing in this area and would
be happy to try and help do so in the 15 cycle.

Thanks,

Stephen

Show quoted text

#98

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Stephen Frost (#97)

Re: WIP: WAL prefetch (another approach)

Stephen Frost <sfrost@snowman.net> writes:

On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas.munro@gmail.com> wrote:

... Personally I think the right thing to do now is to revert it
and re-propose for 15 early in the cycle, supported with some better
testing infrastructure.

I tend to agree with the idea to revert it, perhaps a +0 on that, but if
others argue it should be fixed in-place, I wouldn’t complain about it.

FWIW, I've so far only been able to see problems on two old PPC Macs,
one of which has been known to be a bit flaky in the past. So it's
possible that what I'm looking at is a hardware glitch. But it's
consistent enough that I rather doubt that.

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.
Fairly reproducibly (more than one run in ten), what I get on the
slightly-flaky machine is consistency check failures like

2021-04-21 17:42:56.324 EDT [42286] PANIC: inconsistent page found, rel 1663/354383/357033, forknum 0, blkno 9, byte offset 2069: replay 0x00 primary 0x03
2021-04-21 17:42:56.324 EDT [42286] CONTEXT: WAL redo at 24/121C97B0 for Heap/INSERT: off 107 flags 0x00; blkref #0: rel 1663/354383/357033, blk 9 FPW
2021-04-21 17:45:11.662 EDT [42284] LOG: startup process (PID 42286) was terminated by signal 6: Abort trap

2021-04-21 11:25:30.091 EDT [38891] PANIC: inconsistent page found, rel 1663/229880/237980, forknum 0, blkno 108, byte offset 3845: replay 0x00 primary 0x99
2021-04-21 11:25:30.091 EDT [38891] CONTEXT: WAL redo at 17/A99897FC for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/229880/237980, blk 108 FPW
2021-04-21 11:26:59.371 EDT [38889] LOG: startup process (PID 38891) was terminated by signal 6: Abort trap

2021-04-20 19:20:16.114 EDT [34405] PANIC: inconsistent page found, rel 1663/189216/197311, forknum 0, blkno 115, byte offset 6149: replay 0x37 primary 0x03
2021-04-20 19:20:16.114 EDT [34405] CONTEXT: WAL redo at 13/3CBFED00 for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/189216/197311, blk 115 FPW
2021-04-20 19:21:54.421 EDT [34403] LOG: startup process (PID 34405) was terminated by signal 6: Abort trap

2021-04-20 17:44:09.356 EDT [24106] FATAL: inconsistent page found, rel 1663/135419/143843, forknum 0, blkno 101, byte offset 6152: replay 0x40 primary 0x00
2021-04-20 17:44:09.356 EDT [24106] CONTEXT: WAL redo at D/5107D8A8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/135419/143843, blk 101 FPW

(Note I modified checkXLogConsistency to PANIC on failure, so I could get
a core dump to analyze; and it's also printing the first-mismatch location.)

I have not analyzed each one of these failures exhaustively, but on the
ones I have looked at closely, the replay_image_masked version of the page
appears correct while the primary_image_masked version is *not*.
Moreover, the primary_image_masked version does not match the full-page
image that I see in the on-disk WAL file. It did however seem to match
the in-memory WAL record contents that the decoder is operating on.
So unless you want to believe the buggy-hardware theory, something's
occasionally messing up while loading WAL records from disk. All of the
trouble cases involve records that span across WAL pages (unsurprising
since they contain FPIs), so maybe there's something not quite right
in there.

In the cases that I looked at closely, it appeared that there was a
block of 32 wrong bytes somewhere within the page image, with the data
before and after that being correct. I'm not sure if that pattern
holds in all cases though.

BTW, if I restart the failed standby, it plows through the same data
just fine, confirming that the on-disk WAL is not corrupt.

The other PPC machine (with no known history of trouble) is the one
that had the CRC failure I showed earlier. That one does seem to be
actual bad data in the stored WAL, because the problem was also seen
by pg_waldump, and trying to restart the standby got the same failure
again. I've not been able to duplicate the consistency-check failures
there. But because that machine is a laptop with a much inferior disk
drive, the speeds are enough different that it's not real surprising
if it doesn't hit the same problem.

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success. So if this is real, maybe it's related to being big-endian
hardware? But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

regards, tom lane

#99

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#98)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-21 21:21:05 -0400, Tom Lane wrote:

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.

Do you run them over replication, or sequentially by storing data into
an archive? Just curious, because its so painful to run that scenario in
the replication case due to the tablespace conflicting between
primary/standby, unless one disables the tablespace tests.

The other PPC machine (with no known history of trouble) is the one
that had the CRC failure I showed earlier. That one does seem to be
actual bad data in the stored WAL, because the problem was also seen
by pg_waldump, and trying to restart the standby got the same failure
again.

It seems like that could also indicate an xlogreader bug that is
reliably hit? Once it gets confused about record lengths or such I'd
expect CRC failures...

If it were actually wrong WAL contents I don't think any of the
xlogreader / prefetching changes could be responsible...

Have you tried reproducing it on commits before the recent xlogreader
changes?

commit 1d257577e08d3e598011d6850fd1025858de8c8c
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2021-04-08 23:03:43 +1200

Optionally prefetch referenced data in recovery.

commit f003d9f8721b3249e4aec8a1946034579d40d42c
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2021-04-08 23:03:34 +1200

Add circular WAL decoding buffer.

Discussion: /messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com

commit 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2021-04-08 23:03:23 +1200

Remove read_page callback from XLogReader.

Trying 323cbe7c7ddcf18aaf24b7f6d682a45a61d4e31b^ is probably the most
interesting bit.

I've not been able to duplicate the consistency-check failures
there. But because that machine is a laptop with a much inferior disk
drive, the speeds are enough different that it's not real surprising
if it doesn't hit the same problem.

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success. So if this is real, maybe it's related to being big-endian
hardware? But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

It might just be disk speed influencing how long the tests take, which
in turn increase the number of times checkpoints during the test,
increasing the number of FPIs?

Greetings,

Andres Freund

#100

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#98)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 22, 2021 at 1:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success. So if this is real, maybe it's related to being big-endian
hardware? But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

Ah, that's interesting. There are a couple of sparc64 failures and a
ppc64 failure in the build farm, but I couldn't immediately spot what
was wrong with them or whether it might be related to this stuff.

Thanks for the clues. I'll see what unusual systems I can find to try
this on....

#101

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#99)

Re: WIP: WAL prefetch (another approach)

Andres Freund <andres@anarazel.de> writes:

On 2021-04-21 21:21:05 -0400, Tom Lane wrote:

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.

Do you run them over replication, or sequentially by storing data into
an archive? Just curious, because its so painful to run that scenario in
the replication case due to the tablespace conflicting between
primary/standby, unless one disables the tablespace tests.

No, live over replication. I've been skipping the tablespace test.

Have you tried reproducing it on commits before the recent xlogreader
changes?

Nope.

regards, tom lane

#102

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#100)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-22 13:59:58 +1200, Thomas Munro wrote:

On Thu, Apr 22, 2021 at 1:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success. So if this is real, maybe it's related to being big-endian
hardware? But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

Ah, that's interesting. There are a couple of sparc64 failures and a
ppc64 failure in the build farm, but I couldn't immediately spot what
was wrong with them or whether it might be related to this stuff.

Thanks for the clues. I'll see what unusual systems I can find to try
this on....

FWIW, I've run 32 and 64 bit x86 through several hundred regression
cycles, without hitting an issue. For a lot of them I set
checkpoint_timeout to a lower value as I thought that might make it more
likely to reproduce an issue.

Tom, any chance you could check if your machine repros the issue before
these commits?

Greetings,

Andres Freund

#103

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#102)

Re: WIP: WAL prefetch (another approach)

Andres Freund <andres@anarazel.de> writes:

Tom, any chance you could check if your machine repros the issue before
these commits?

Wilco, but it'll likely take a little while to get results ...

regards, tom lane

#104

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#103)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 29, 2021 at 4:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Tom, any chance you could check if your machine repros the issue before
these commits?

Wilco, but it'll likely take a little while to get results ...

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator. No
cigar. However, it's so slow that I didn't make it to 10 runs...

#105

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#104)

Re: WIP: WAL prefetch (another approach)

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator. No
cigar. However, it's so slow that I didn't make it to 10 runs...

Speaking of megawatts ... my G4 has now finished about ten cycles of
installcheck-parallel without a failure, which isn't really enough
to draw any conclusions yet. But I happened to notice the
accumulated CPU time for the background processes:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
tgl 19048 0.0 4.4 229952 92196 ?? Ss 3:19PM 19:59.19 postgres: startup recovering 000000010000001400000022
tgl 19051 0.0 0.1 229656 1696 ?? Ss 3:19PM 27:09.14 postgres: walreceiver streaming 14/227D8F14
tgl 19052 0.0 0.1 229904 2516 ?? Ss 3:19PM 17:38.17 postgres: walsender tgl [local] streaming 14/227D8F14

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby. Is this
expected? I've got wal_consistency_checking = all, which is bloating
the WAL volume quite a bit, but still it seems like the walsender and
walreceiver have little excuse for spending more cycles per byte
than the startup process.

(This is testing b3ee4c503, so if Thomas' WAL changes improved
efficiency of the replay process at all, the discrepancy could be
even worse in HEAD.)

regards, tom lane

#106

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#105)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:

But I happened to notice the accumulated CPU time for the background
processes:

USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
tgl 19048 0.0 4.4 229952 92196 ?? Ss 3:19PM 19:59.19 postgres: startup recovering 000000010000001400000022
tgl 19051 0.0 0.1 229656 1696 ?? Ss 3:19PM 27:09.14 postgres: walreceiver streaming 14/227D8F14
tgl 19052 0.0 0.1 229904 2516 ?? Ss 3:19PM 17:38.17 postgres: walsender tgl [local] streaming 14/227D8F14

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby. Is this
expected? I've got wal_consistency_checking = all, which is bloating
the WAL volume quite a bit, but still it seems like the walsender and
walreceiver have little excuse for spending more cycles per byte
than the startup process.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO? On the primary the WAL IO is
done by a lot of backends, but on the standby it's all going to be the
walreceiver. And the walreceiver does fsyncs in a not particularly
efficient manner.

FWIW, on my linux workstation no such difference is visible:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
andres 2910540 9.4 0.0 2237252 126680 ? Ss 16:55 0:20 postgres: dev assert standby: startup recovering 00000001000000020000003F
andres 2910544 5.2 0.0 2236724 9260 ? Ss 16:55 0:11 postgres: dev assert standby: walreceiver streaming 2/3FDCF118
andres 2910545 2.1 0.0 2237036 10672 ? Ss 16:55 0:04 postgres: dev assert: walsender andres [local] streaming 2/3FDCF118

(This is testing b3ee4c503, so if Thomas' WAL changes improved
efficiency of the replay process at all, the discrepancy could be
even worse in HEAD.)

The prefetching isn't enabled by default, so I'd not expect meaningful
differences... And even with the prefetching enabled, our normal
regression tests largely are resident in s_b, so there shouldn't be much
prefetching.

Oh! I was about to ask how much shared buffers your primary / standby
have. And I think I may actually have reproduce a variant of the issue!

I previously had played around with different settings that I thought
might increase the likelihood of reproducing the problem. But this time
I set shared_buffers lower than before, and got:

2021-04-28 17:03:22.174 PDT [2913840][] LOG: database system was shut down in recovery at 2021-04-28 17:03:11 PDT
2021-04-28 17:03:22.174 PDT [2913840][] LOG: entering standby mode
2021-04-28 17:03:22.178 PDT [2913840][1/0] LOG: redo starts at 2/416C6278
2021-04-28 17:03:37.628 PDT [2913840][1/0] LOG: consistent recovery state reached at 4/7F5C3200
2021-04-28 17:03:37.628 PDT [2913840][1/0] FATAL: invalid memory alloc request size 3053455757
2021-04-28 17:03:37.628 PDT [2913839][] LOG: database system is ready to accept read only connections
2021-04-28 17:03:37.636 PDT [2913839][] LOG: startup process (PID 2913840) exited with exit code 1

This reproduces across restarts. Yay, I guess.

Isn't it off that we get a "database system is ready to accept read only
connections"?

Greetings,

Andres Freund

#107

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#106)

Re: WIP: WAL prefetch (another approach)

Andres Freund <andres@anarazel.de> writes:

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO?

I'd be pretty astonished if it did. This is basically a NetBSD system
remember (in fact, this ancient macOS release is a good deal closer
to those roots than modern versions). BSDen have never accounted for
time that way AFAIK. Also, the "ps" man page says specifically that
that column is CPU time.

Oh! I was about to ask how much shared buffers your primary / standby
have. And I think I may actually have reproduce a variant of the issue!

Default configurations, so 128MB each.

regards, tom lane

#108

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#107)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 20:24:43 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Oh! I was about to ask how much shared buffers your primary / standby
have.

Default configurations, so 128MB each.

I thought that possibly initdb would detect less or something...

I assume this is 32bit? I did notice that a 32bit test took a lot longer
than a 64bit test. But didn't investigate so far.

And I think I may actually have reproduce a variant of the issue!

Unfortunately I had not set up things in a way that the primary retains
the WAL, making it harder to compare whether it's the WAL that got
corrupted or whether it's a decoding bug.

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF000000 at 4/7F5C3200

In frontend code that allocation actually succeeds, because there is no
size check. But in backend code we run into the size check, and thus
don't even display a useful error.

In 13 the header is validated before allocating space for the
record(except if header is spread across pages) - it seems inadvisable
to turn that around?

Greetings,

Andres Freund

#109

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#108)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF000000 at 4/7F5C3200

There's definitely something broken around continuation records, in
XLogFindNextRecord(). Which means that it's not the cause for the server
side issue, but obviously still not good.

The conversion of XLogFindNextRecord() to be state machine based
basically only works in a narrow set of circumstances. Whenever the end
of the first record read is on a different page than the start of the
record, we'll endlessly loop.

We'll go into XLogFindNextRecord(), and return until we've successfully
read the page header. Then we'll enter the second loop. Which will try
to read until the end of the first record. But after returning the first
loop will again ask for page header.

Even if that's fixed, the second loop alone has the same problem: As
XLogBeginRead() is called unconditionally we'll start read the start of
the record, discover that it needs data on a second page, return, and
do the same thing again.

I think it needs something roughly like the attached.

Greetings,

Andres Freund

Attachments:

fix-xlogfindnext.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/include/access/xlogreader.h w/src/include/access/xlogreader.h
index 3b8af31a8fe..82a80cf2bf5 100644
--- i/src/include/access/xlogreader.h
+++ w/src/include/access/xlogreader.h
@@ -297,6 +297,7 @@ struct XLogFindNextRecordState
 	XLogReaderState *reader_state;
 	XLogRecPtr		targetRecPtr;
 	XLogRecPtr		currRecPtr;
+	bool			found_start;
 };
 
 /* Report that data is available for decoding. */
diff --git i/src/backend/access/transam/xlogreader.c w/src/backend/access/transam/xlogreader.c
index 4277e92d7c9..935c841347f 100644
--- i/src/backend/access/transam/xlogreader.c
+++ w/src/backend/access/transam/xlogreader.c
@@ -868,7 +868,7 @@ XLogDecodeOneRecord(XLogReaderState *state, bool allow_oversized)
 				/* validate record header if not yet */
 				if (!state->record_verified && record_len >= SizeOfXLogRecord)
 				{
-				if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
+					if (!ValidXLogRecordHeader(state, state->DecodeRecPtr,
 											   state->PrevRecPtr, prec))
 						goto err;
 
@@ -1516,6 +1516,7 @@ InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr)
 	state->reader_state = reader_state;
 	state->targetRecPtr = start_ptr;
 	state->currRecPtr = start_ptr;
+	state->found_start = false;
 
 	return state;
 }
@@ -1545,7 +1546,7 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
 	 */
-	while (true)
+	while (!state->found_start)
 	{
 		XLogRecPtr	targetPagePtr;
 		int			targetRecOff;
@@ -1616,7 +1617,12 @@ XLogFindNextRecord(XLogFindNextRecordState *state)
 	 * because either we're at the first record after the beginning of a page
 	 * or we just jumped over the remaining data of a continuation.
 	 */
-	XLogBeginRead(state->reader_state, state->currRecPtr);
+	if (!state->found_start)
+	{
+		XLogBeginRead(state->reader_state, state->currRecPtr);
+		state->found_start = true;
+	}
+
 	while ((result = XLogReadRecord(state->reader_state, &record, &errormsg)) !=
 		   XLREAD_FAIL)
 	{

#110

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#108)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-04-28 17:59:22 -0700, Andres Freund wrote:

I can however say that pg_waldump on the standby's pg_wal does also
fail. The failure as part of the backend is "invalid memory alloc
request size", whereas in pg_waldump I get the much more helpful:
pg_waldump: fatal: error in WAL record at 4/7F5C31C8: record with incorrect prev-link 416200FF/FF000000 at 4/7F5C3200

In frontend code that allocation actually succeeds, because there is no
size check. But in backend code we run into the size check, and thus
don't even display a useful error.

In 13 the header is validated before allocating space for the
record(except if header is spread across pages) - it seems inadvisable
to turn that around?

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's. The allocation thing above is
the issue in my case:

The walsender connection ended (I restarted the primary), thus the
startup switches to replaying locally. For some reason the end of the
WAL contains non-zero data (I think it's because walreceiver doesn't
zero out pages - that's bad!). Because the allocation happen before the
header is validated, we reproducably end up in the mcxt.c ERROR path,
failing recovery.

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

/*
* Note that in much unlucky circumstances, the random data read from a
* recycled segment can cause this routine to be called with a size
* causing a hard failure at allocation. For a standby, this would cause
* the instance to stop suddenly with a hard failure, preventing it to
* retry fetching WAL from one of its sources which could allow it to move
* on with replay without a manual restart. If the data comes from a past
* recycled segment and is still valid, then the allocation may succeed
* but record checks are going to fail so this would be short-lived. If
* the allocation fails because of a memory shortage, then this is not a
* hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
*/
if (!AllocSizeIsValid(newSize))
return false;

but it looks to me like that's pretty much the wrong fix, at least in
the case where we've not yet validated the rest of the header. We don't
need to allocate all that data before we've read the rest of the
*fixed-size* header.

It also seems to me that 70b4f82a4b should also have changed walsender
to pad out the received data to an 8KB boundary?

Greetings,

Andres Freund

#111

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andres Freund (#110)

Re: WIP: WAL prefetch (another approach)

Andres Freund <andres@anarazel.de> writes:

I was now able to reproduce the problem again, and I'm afraid that the
bug I hit is likely separate from Tom's.

Yeah, I think so --- the symptoms seem quite distinct.

My score so far today on the G4 is:

12 error-free regression test cycles on b3ee4c503

(plus one more with shared_buffers set to 16MB, on the strength
of your previous hunch --- didn't fail for me though)

HEAD failed on the second run with the same symptom as before:

2021-04-28 22:57:17.048 EDT [50479] FATAL: inconsistent page found, rel 1663/58183/69545, forknum 0, blkno 696
2021-04-28 22:57:17.048 EDT [50479] CONTEXT: WAL redo at 4/B72D408 for Heap/INSERT: off 77 flags 0x00; blkref #0: rel 1663/58183/69545, blk 696 FPW

This seems to me to be pretty strong evidence that I'm seeing *something*
real. I'm currently trying to isolate a specific commit to pin it on.
A straight "git bisect" isn't going to work because so many people had
broken so many different things right around that date :-(, so it may
take awhile to get a good answer.

regards, tom lane

#112

[1]: /messages/by-id/CA+hUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk+Heskxg@mail.gmail.com

thomas.munro@gmail.com

over 4 years ago

In reply to: Andres Freund (#110)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:

To me it looks like a smaller version of the problem is present in < 14,
albeit only when the page header is at a record boundary. In that case
we don't validate the page header immediately, only once it's completely
read. But we do believe the total size, and try to allocate
that.

There's a really crufty escape hatch (from 70b4f82a4b) to that:

Right, I made that problem worse, and that could probably be changed
to be no worse than 13 by reordering those operations.

PS Sorry for my intermittent/slow responses on this thread this week,
as I'm mostly away from the keyboard due to personal commitments.
I'll be back in the saddle next week to tidy this up, most likely by
reverting. The main thought I've been having about this whole area is
that, aside from the lack of general testing of recovery, which we
should definitely address[1]/messages/by-id/CA+hUKGKpRWQ9SxdxxDmTBCJoR0YnFpMBe7kyzY8SUQk+Heskxg@mail.gmail.com, what it really needs is a decent test
harness to drive it through all interesting scenarios and states at a
lower level, independently.

#113

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Thomas Munro (#104)

Re: WIP: WAL prefetch (another approach)

Thomas Munro <thomas.munro@gmail.com> writes:

On Thu, Apr 29, 2021 at 4:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

Tom, any chance you could check if your machine repros the issue before
these commits?

Wilco, but it'll likely take a little while to get results ...

FWIW I also chewed through many megawatts trying to reproduce this on
a PowerPC system in 64 bit big endian mode, with an emulator. No
cigar. However, it's so slow that I didn't make it to 10 runs...

So I've expended a lot of kilowatt-hours over the past several days,
and I've got results that are interesting but don't really get us
any closer to a resolution.

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

The failures I'm seeing generally look like

2021-05-01 15:33:10.968 EDT [8281] FATAL: inconsistent page found, rel 1663/58186/66338, forknum 0, blkno 19
2021-05-01 15:33:10.968 EDT [8281] CONTEXT: WAL redo at 3/4CE905B8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/58186/66338, blk 19 FPW

with a variety of WAL record types being named, so it doesn't seem
to be specific to any particular record type. I've twice gotten the
bogus-checksum-and-then-assertion-failure I reported before:

2021-05-01 17:07:52.992 EDT [17464] LOG: incorrect resource manager data checksum in record at 3/E0073EA4
TRAP: FailedAssertion("state->recordRemainLen > 0", File: "xlogreader.c", Line: 567, PID: 17464)

In both of those cases, the WAL on disk was perfectly fine, and the same
is true of most of the "inconsistent page" complaints. So the issue
definitely seems to be about the startup process mis-reading data that
was correctly shipped over.

Anyway, the new and interesting data concerns the relative failure rates
of different builds:

* Recent HEAD (from 4-28 and 5-1): 4 failures in 8 test cycles

* Reverting 1d257577e: 1 failure in 8 test cycles

* Reverting 1d257577e and f003d9f87: 3 failures in 28 cycles

* Reverting 1d257577e, f003d9f87, and 323cbe7c7: 2 failures in 93 cycles

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes. However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

It seems like it's still an open question whether all this is
a real bug, or flaky hardware. I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory. But it doesn't
smell like that kind of problem to me. I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware. This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

(I did try the time-honored ritual of reseating all the machine's
RAM, partway through this. Doesn't seem to have changed anything.)

Anyway, I'm not sure where to go from here. I'm for sure nowhere
near being able to identify the bug --- and if there really is
a bug that formerly had a one-in-fifty reproduction rate, I have
zero interest in trying to identify where it started by bisecting.
It'd take at least a day per bisection step, and even that might
not be accurate enough. (But, if anyone has ideas of specific
commits to test, I'd be willing to try a few.)

regards, tom lane

#114

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#107)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 29, 2021 at 12:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

On 2021-04-28 19:24:53 -0400, Tom Lane wrote:

IOW, we've spent over twice as many CPU cycles shipping data to the
standby as we did in applying the WAL on the standby.

I don't really know how the time calculation works on mac. Is there a
chance it includes time spent doing IO?

For comparison, on a modern Linux system I see numbers like this,
while running that 025_stream_rep_regress.pl test I posted in a nearby
thread:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tmunro 2150863 22.5 0.0 55348 6752 ? Ss 12:59 0:07
postgres: standby_1: startup recovering 00000001000000020000003C
tmunro 2150867 17.5 0.0 55024 6364 ? Ss 12:59 0:05
postgres: standby_1: walreceiver streaming 2/3C675D80
tmunro 2150868 11.7 0.0 55296 7192 ? Ss 12:59 0:04
postgres: primary: walsender tmunro [local] streaming 2/3C675D80

Those ratios are better but it's still hard work, and perf shows the
CPU time is all in page cache schlep:

22.44% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
20.12% postgres [kernel.kallsyms] [k] __add_to_page_cache_locked
7.30% postgres [kernel.kallsyms] [k] iomap_set_page_dirty

That was with all three patches reverted, so it's nothing new.
Definitely room for improvement... there have been a few discussions
about not using a buffered file for high-frequency data exchange and
relaxing various timing rules, which we should definitely look into,
but I wouldn't be at all surprised if HFS+ was just much worse at
this.

Thinking more about good old HFS+... I guess it's remotely possible
that there might have been coherency bugs in that could be exposed by
our usage pattern, but then that doesn't fit too well with the clues I
have from light reading: this is a non-SMP system, and it's said that
HFS+ used to serialise pretty much everything on big filesystem locks
anyway.

#115

thomas.munro@gmail.com

over 4 years ago

In reply to: Tom Lane (#113)

Re: WIP: WAL prefetch (another approach)

On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes. However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Oh, wow. There are several surprising results there. Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

It seems like it's still an open question whether all this is
a real bug, or flaky hardware. I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory. But it doesn't
smell like that kind of problem to me. I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware. This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible. It would be nice to see a report
from any other system though. I'm still trying, and reviewing...

#116

tomas.vondra@enterprisedb.com

over 4 years ago

In reply to: Thomas Munro (#115)

Re: WIP: WAL prefetch (another approach)

On 5/3/21 7:42 AM, Thomas Munro wrote:

On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes. However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Oh, wow. There are several surprising results there. Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

+1

It seems like it's still an open question whether all this is
a real bug, or flaky hardware. I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory. But it doesn't
smell like that kind of problem to me. I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware. This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible. It would be nice to see a report
from any other system though. I'm still trying, and reviewing...

FWIW I've ran the test (make installcheck-parallel in a loop) on four
different machines - two x86_64 ones, and two rpi4. The x86 boxes did
~1000 rounds each (and one of them had 5 local replicas) without any
issue. The rpi4 machines did ~50 rounds each, also without failures.

Obviously, it's possible there's something that neither of those (very
different systems) triggers, but I'd say it might also be a hint that
this really is a hw issue on the old ppc macs. Or maybe something very
specific to that arch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#117

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tomas Vondra (#116)

Re: WIP: WAL prefetch (another approach)

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 5/3/21 7:42 AM, Thomas Munro wrote:

Hmm, yeah that does seem plausible. It would be nice to see a report
from any other system though. I'm still trying, and reviewing...

FWIW I've ran the test (make installcheck-parallel in a loop) on four
different machines - two x86_64 ones, and two rpi4. The x86 boxes did
~1000 rounds each (and one of them had 5 local replicas) without any
issue. The rpi4 machines did ~50 rounds each, also without failures.

Yeah, I have also spent a fair amount of time trying to reproduce it
elsewhere, without success so far. Notably, I've been trying on a
PPC Mac laptop that has a fairly similar CPU to what's in the G4,
though a far slower disk drive. So that seems to exclude theories
based on it being PPC-specific.

I suppose that if we're unable to reproduce it on at least one other box,
we have to write it off as hardware flakiness. I'm not entirely
comfortable with that answer, but I won't push for reversion of the WAL
patches without more evidence that there's a real issue.

regards, tom lane

#118

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2021-04-23%2022%3A27%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-04-21%2005%3A15%3A24
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-20%2002%3A03%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-05-04%2004%3A07%3A41
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-20%2021%3A08%3A59

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#117)

Re: WIP: WAL prefetch (another approach)

I wrote:

I suppose that if we're unable to reproduce it on at least one other box,
we have to write it off as hardware flakiness.

BTW, that conclusion shouldn't distract us from the very real bug
that Andres identified. I was just scraping the buildfarm logs
concerning recent failures, and I found several recent cases
that match the symptom he reported:

They all show the standby in recovery/019_replslot_limit.pl failing
with symptoms like

2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down in recovery at 2021-05-04 07:41:39 UTC
2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode
2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8
2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state reached at 0/1D00000
2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request size 1476397045
2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to accept read only connections
2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) exited with exit code 1

(BTW, the behavior seen here where the failure occurs *immediately*
after reporting "consistent recovery state reached" is seen in the
other reports as well, including Andres' version. I wonder if that
means anything.)

regards, tom lane

#119

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#118)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-05-04 15:47:41 -0400, Tom Lane wrote:

BTW, that conclusion shouldn't distract us from the very real bug
that Andres identified. I was just scraping the buildfarm logs
concerning recent failures, and I found several recent cases
that match the symptom he reported:
[...]
They all show the standby in recovery/019_replslot_limit.pl failing
with symptoms like

2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down in recovery at 2021-05-04 07:41:39 UTC
2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode
2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8
2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state reached at 0/1D00000
2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request size 1476397045
2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to accept read only connections
2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) exited with exit code 1

Yea, that's the pre-existing end-of-log-issue that got more likely as
well as more consequential (by accident) in Thomas' patch. It's easy to
reach parity with the state in 13, it's just changing the order in one
place.

But I think we need to do something for all branches here. The bandaid
that was added to allocate_recordbuf() does doesn't really seems
sufficient to me. This is

commit 70b4f82a4b5cab5fc12ff876235835053e407155
Author: Michael Paquier <michael@paquier.xyz>
Date: 2018-06-18 10:43:27 +0900

Prevent hard failures of standbys caused by recycled WAL segments

In <= 13 the current state is that we'll allocate effectively random
bytes as long as the random number is below 1GB whenever we reach the
end of the WAL with the record on a page boundary (because there we
don't. That allocation is then not freed for the lifetime of the
xlogreader. And for FRONTEND uses of xlogreader we'll just happily
allocate 4GB. The specific problem here is that we don't validate the
record header before allocating when the record header is split across a
page boundary - without much need as far as I can tell? Until we've read
the entire header, we actually don't need to allocate the record buffer?

This seems like an issue that needs to be fixed to be more robust in
crash recovery scenarios where obviously we could just have failed with
half written records.

But the issue that 70b4f82a4b is trying to address seems bigger to
me. The reason it's so easy to hit the issue is that walreceiver does <
8KB writes into recycled WAL segments *without* zero-filling the tail
end of the page - which will commonly be filled with random older
contents, because we'll use a recycled segments. I think that
*drastically* increases the likelihood of finding something that looks
like a valid record header compared to the situation on a primary where
the zeroing pages before use makes that pretty unlikely.

(BTW, the behavior seen here where the failure occurs *immediately*
after reporting "consistent recovery state reached" is seen in the
other reports as well, including Andres' version. I wonder if that
means anything.)

That's to be expected, I think. There's not a lot of data that needs to
be replayed, and we'll always reach consistency before the end of the
WAL unless you're dealing with starting from an in-progress base-backup
that hasn't yet finished or such. The test causes replication to fail
shortly after that, so we'll always switch to doing recovery from
pg_wal, which then will hit the end of the WAL, hitting this issue with,
I think, ~25% likelihood (data from recycled WAL data is probably
*roughly* evenly distributed, and any 4byte value above 1GB will hit
this error in 14).

Greetings,

Andres Freund

#120

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#117)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-05-04 09:46:12 -0400, Tom Lane wrote:

Yeah, I have also spent a fair amount of time trying to reproduce it
elsewhere, without success so far. Notably, I've been trying on a
PPC Mac laptop that has a fairly similar CPU to what's in the G4,
though a far slower disk drive. So that seems to exclude theories
based on it being PPC-specific.

I suppose that if we're unable to reproduce it on at least one other box,
we have to write it off as hardware flakiness.

I wonder if there's a chance what we're seeing is an OS memory ordering
bug, or a race between walreceiver writing data and the startup process
reading it.

When the startup process is able to keep up, there often will be a very
small time delta between the startup process reading a page that the
walreceiver just wrote. And if the currently read page was the tail page
written to by a 'w' message, it'll often be written to again in short
order - potentially while the startup process is reading it.

It'd not terribly surprise me if an old OS version on an old processor
had some issues around that.

Were there any cases of walsender terminating and reconnecting around
the failures?

It looks suspicious that XLogPageRead() does not invalidate the
xlogreader state when retrying. Normally that's xlogreader's
responsibility, but there is that whole XLogReaderValidatePageHeader()
business. But I don't quite see how it'd actually cause problems.

Greetings,

Andres Freund

#121

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#119)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-05-04 18:08:35 -0700, Andres Freund wrote:

But the issue that 70b4f82a4b is trying to address seems bigger to
me. The reason it's so easy to hit the issue is that walreceiver does <
8KB writes into recycled WAL segments *without* zero-filling the tail
end of the page - which will commonly be filled with random older
contents, because we'll use a recycled segments. I think that
*drastically* increases the likelihood of finding something that looks
like a valid record header compared to the situation on a primary where
the zeroing pages before use makes that pretty unlikely.

I've written an experimental patch to deal with this and, as expected,
it does make the end-of-wal detection a lot more predictable and
reliable. There's only two types of possible errors outside of crashes:
A record length of 0 (the end of WAL is within a page), and the page
header LSN mismatching (the end of WAL is at a page boundary).

This seems like a significant improvement.

However: It's nontrivial to do this nicely and in a backpatchable way in
XLogWalRcvWrite(). Or at least I haven't found a good way:
- We can't extend the input buffer to XLogWalRcvWrite(), it's from
libpq.
- We don't want to copy the the entire buffer (commonly 128KiB) to a new
buffer that we then can extend by 0-BLCKSZ of zeroes to cover the
trailing part of the last page.
- In PG13+ we can do this utilizing pg_writev(), adding another IOV
entry covering the trailing space to be padded.
- It's nicer to avoid increasign the number of write() calls, but it's
not as crucial as the earlier points.

I'm also a bit uncomfortable with another aspect, although I can't
really see a problem: When switch to receiving WAL via walreceiver, we
always start at a segment boundary, even if we had received most of that
segment before. Currently that won't end up with any trailing space that
needs to be zeroed, because the server always will send 128KB chunks,
but there's no formal guarantee for that. It seems a bit odd that we
could end up zeroing trailing space that already contains valid data,
just to overwrite it with valid data again. But it ought to always be
fine.

The least offensive way I could come up with is for XLogWalRcvWrite() to
always write partial pages in a separate pg_pwrite(). When writing a
partial page, and the previous write position was not already on that
same page, copy the buffer into a local XLOG_BLCKSZ sized buffer
(although we'll never use more than XLOG_BLCKSZ-1 I think), and (re)zero
out the trailing part. One thing that does not yet handle is if we were
to get a partial write - we'd not again notice that we need to pad the
end of the page.

Does anybody have a better idea?

I really wish we had a version of pg_p{read,write}[v] that internally
handled partial IOs, retrying as long as they see > 0 bytes written.

Greetings,

Andres Freund

#122

thomas.munro@gmail.com

over 4 years ago

In reply to: Stephen Frost (#97)

Re: WIP: WAL prefetch (another approach)

On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost <sfrost@snowman.net> wrote:

On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Apr 22, 2021 at 8:16 AM Thomas Munro <thomas.munro@gmail.com> wrote:
... Personally I think the right thing to do now is to revert it
and re-propose for 15 early in the cycle, supported with some better
testing infrastructure.

I tend to agree with the idea to revert it, perhaps a +0 on that, but if others argue it should be fixed in-place, I wouldn’t complain about it.

Reverted.

Note: eelpout may return a couple of failures because it's set up to
run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
hours before I can access that machine to adjust that...

I very much encourage the idea of improving testing in this area and would be happy to try and help do so in the 15 cycle.

Cool. I'm going to try out some ideas.

#123

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: Thomas Munro (#122)

Re: WIP: WAL prefetch (another approach)

On 10 May 2021, at 06:11, Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Apr 22, 2021 at 11:22 AM Stephen Frost <sfrost@snowman.net> wrote:

I tend to agree with the idea to revert it, perhaps a +0 on that, but if others argue it should be fixed in-place, I wouldn’t complain about it.

Reverted.

Note: eelpout may return a couple of failures because it's set up to
run with recovery_prefetch=on (now an unknown GUC), and it'll be a few
hours before I can access that machine to adjust that...

I very much encourage the idea of improving testing in this area and would be happy to try and help do so in the 15 cycle.

Cool. I'm going to try out some ideas.

Skimming this thread without all the context it's not entirely clear which
patch the CF entry relates to (I assume it's the one from April 7 based on
attached mail-id but there is a revert from May?), and the CF app and CF bot
are also in disagreement which is the latest one.

Could you post an updated version of the patch which is for review?

--
Daniel Gustafsson https://vmware.com/

#124

thomas.munro@gmail.com

about 4 years ago

In reply to: Daniel Gustafsson (#123)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <daniel@yesql.se> wrote:

Could you post an updated version of the patch which is for review?

Sorry for taking so long to come back; I learned some new things that
made me want to restructure this code a bit (see below). Here is an
updated pair of patches that I'm currently testing.

Old problems:

1. Last time around, an infinite loop was reported in pg_waldump. I
believe Horiguchi-san has fixed that[1]https://commitfest.postgresql.org/34/2113/, but I'm no longer depending
on that patch. I thought his patch set was a good idea, but it's
complicated and there's enough going on here already... let's consider
that independently.

This version goes back to what I had earlier, though (I hope) it is
better about how "nonblocking" states are communicated. In this
version, XLogPageRead() has a way to give up part way through a record
if it doesn't have enough data and there are queued up records that
could be replayed right now. In that case, we'll go back to the
beginning of the record (and occasionally, back a WAL page) next time
we try. That's the cost of not maintaining intra-record decoding
state.

2. Last time around, we could try to allocate a crazy amount of
memory when reading garbage past the end of the WAL. Fixed, by
validating first, like in master.

New work:

Since last time, I went away and worked on a "real" AIO version of
this feature. That's ongoing experimental work for a future proposal,
but I have a working prototype and I aim to share that soon, when that
branch is rebased to catch up with recent changes. In that version,
the prefetcher starts actual reads into the buffer pool, and recovery
receives already pinned buffers attached to the stream of records it's
replaying.

That inspired a couple of refactoring changes to this non-AIO version,
to minimise the difference and anticipate the future work better:

1. The logic for deciding which block to start prefetching next is
moved into a new callback function in a sort of standard form (this is
approximately how all/most prefetching code looks in the AIO project,
ie sequential scans, bitmap heap scan, etc).

2. The logic for controlling how many IOs are running and deciding
when to call the above is in a separate component. In this non-AIO
version, it works using a simple ring buffer of LSNs to estimate the
number of in flight I/Os, just like before. This part would be thrown
away and replaced with the AIO branch's centralised "streaming read"
mechanism which tracks I/O completions based on a stream of completion
events from the kernel (or I/O worker processes).

3. In this version, the prefetcher still doesn't pin buffers, for
simplicity. That work did force me to study places where WAL streams
need prefetching "barriers", though, so in this patch you can
see that it's now a little more careful than it probably needs to be.
(It doesn't really matter much if you call posix_fadvise() on a
non-existent file region, or the wrong file after OID wraparound and
reuse, but it would matter if you actually read it into a buffer, and
if an intervening record might be trying to drop something you have
pinned).

Some other changes:

1. I dropped the GUC recovery_prefetch_fpw. I think it was a
possibly useful idea but it's a niche concern and not worth worrying
about for now.

2. I simplified the stats. Coming up with a good running average
system seemed like a problem for another day (the numbers before were
hard to interpret). The new stats are super simple counters and
instantaneous values:

postgres=# select * from pg_stat_prefetch_recovery ;
-[ RECORD 1 ]--+------------------------------
stats_reset | 2021-11-10 09:02:08.590217+13
prefetch | 13605674 <- times we called posix_fadvise()
hit | 24185289 <- times we found pages already cached
skip_init | 217215 <- times we did nothing because init, not read
skip_new | 192347 <- times we skipped because relation too small
skip_fpw | 27429 <- times we skipped because fpw, not read
wal_distance | 10648 <- how far ahead in WAL bytes
block_distance | 134 <- how far ahead in block references
io_depth | 50 <- fadvise() calls not yet followed by pread()

I also removed the code to save and restore the stats via the stats
collector, for now. I figured that persistent stats could be a later
feature, perhaps after the shared memory stats stuff?

3. I dropped the code that was caching an SMgrRelation pointer to
avoid smgropen() calls that showed up in some profiles. That probably
lacked invalidation that could be done with some more WAL analysis,
but I decided to leave it out completely for now for simplicity.

4. I dropped the verbose logging. I think it might make sense to
integrate with the new "recovery progress" system, but I think that
should be a separate discussion. If you want to see the counters
after crash recovery finishes, you can look at the stats view.

[1]: https://commitfest.postgresql.org/34/2113/

Attachments:

v19-0001-Add-circular-WAL-decoding-buffer-take-II.patchtext/x-patch; charset=US-ASCII; name=v19-0001-Add-circular-WAL-decoding-buffer-take-II.patchDownload

From 92478e33be11841d3ea4333d049f2ba193109a64 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:33:10 +1300
Subject: [PATCH v19 1/2] Add circular WAL decoding buffer, take II.

Teach xlogreader.c to decode its output into a circular buffer, to
support upcoming optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros, and the traditional members like xlogreader->ReadRecPtr.

 * An alternative new interface XLogReadAhead()/XLogNextRecord() is
   added that returns pointers to DecodedXLogRecord
   objects so that it's possible to look ahead in the WAL stream.

 * In order to be able to use the new interface effectively, client
   code should provide a page_read() callback that response to
   a new nonblocking mode by returning XLREAD_WOULDBLOCK to avoid
   waiting.  No such implementation is included in this commit,
   and other code that is unaware of the new mechanism doesn't need
   to change.

The buffer's size can be set by the client of xlogreader.c.  Large
records that don't fit in the circular buffer are called "oversized" and
allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |   6 +-
 src/backend/access/transam/xlogreader.c   | 632 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  22 +-
 src/include/access/xlogreader.h           | 153 ++++--
 src/tools/pgindent/typedefs.list          |   2 +
 9 files changed, 652 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 63301a1ab1..0e9bcc7159 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 221e4cb34f..8b138ac680 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1453,7 +1453,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -10582,7 +10582,7 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10744,7 +10744,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..df942d27dd 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,7 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -53,6 +54,8 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+#define DEFAULT_DECODE_BUFFER_SIZE 0x10000
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -67,6 +70,24 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
+}
+
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
 }
 
 /*
@@ -89,8 +110,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -141,18 +160,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -248,7 +260,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
+}
+
+/*
+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	if (!state->record)
+		return;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest item
+	 * decoded, decode_queue_tail.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_tail);
+	state->record = NULL;
+	state->decode_queue_tail = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_head. */
+	if (state->decode_queue_head == record)
+		state->decode_queue_head = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the tail record in the decode buffer. */
+		Assert(state->decode_buffer_tail == (char *) record);
+
+		/*
+		 * We need to update tail to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust tail to release space up to the next record. */
+			state->decode_buffer_tail = (char *) record;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_tail = state->decode_buffer;
+			state->decode_buffer_head = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
+ * called before the first call to XLogNextRecord().  This functions returns
+ * records and errors that were put into an internal queue by XLogReadAhead().
+ *
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogNextRecord.
+ */
+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
+	/* Release the last record returned by XLogNextRecord(). */
+	XLogReleasePreviousRecord(state);
+
+	if (state->decode_queue_tail == NULL)
+	{
+		*errormsg = NULL;
+		if (state->errormsg_deferred)
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			state->errormsg_deferred = false;
+		}
+
+		/*
+		 * state->EndRecPtr is expected to have been set by the last call to
+		 * XLogBeginRead() or XLogNextRecord(), and is the location of the
+		 * error.
+		 */
+
+		return NULL;
+	}
+
+	/*
+	 * Record this as the most recent record returned, so that we'll release
+	 * it next time.  This also exposes it to the traditional
+	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
+	 * the record for historical reasons.
+	 */
+	state->record = state->decode_queue_tail;
+
+	/*
+	 * Update the pointers to the beginning and one-past-the-end of this
+	 * record, again for the benefit of historical code that expected the
+	 * decoder to track this rather than accessing these fields of the record
+	 * itself.
+	 */
+	state->ReadRecPtr = state->record->lsn;
+	state->EndRecPtr = state->record->next_lsn;
+
+	*errormsg = NULL;
+
+	return state->record;
 }
 
 /*
@@ -258,17 +395,125 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error; errormsg is
+ * set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadlRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *decoded;
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(state);
+
+	/*
+	 * Call XLogReadAhead() in blocking mode to make sure there is something
+	 * in the queue, though we don't use the result.
+	 */
+	if (!XLogReaderHasQueuedRecordOrError(state))
+		XLogReadAhead(state, false /* nonblocking */ );
+
+	/* Consume the tail record or error. */
+	decoded = XLogNextRecord(state, errormsg);
+	if (decoded)
+	{
+		/*
+		 * XLogReadRecord() returns a pointer to the record's header, not the
+		 * actual decoded record.  The caller will access the decoded record
+		 * through the XLogRecGetXXX() macros, which reach the decoded
+		 * recorded as xlogreader->record.
+		 */
+		Assert(state->record == decoded);
+		return &decoded->header;
+	}
+
+	return NULL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+static XLogPageReadResult
+XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -281,6 +526,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	bool		assembled;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg;		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -290,21 +537,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -315,7 +561,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -323,6 +569,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	}
 
 restart:
+	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
 
@@ -336,7 +583,9 @@ restart:
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	if (readOff == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readOff < 0)
 		goto err;
 
 	/*
@@ -392,7 +641,7 @@ restart:
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -411,6 +660,31 @@ restart:
 		gotheader = false;
 	}
 
+	/*
+	 * Find space to decode this record.  Don't allow oversized allocation if
+	 * the caller requested nonblocking.  Otherwise, we *have* to try to
+	 * decode the record now because the caller has nothing else to do, so
+	 * allow an oversized record to be palloc'd if that turns out to be
+	 * necessary.
+	 */
+	decoded = XLogReadRecordAlloc(state,
+								  total_len,
+								  !nonblocking /* allow_oversized */ );
+	if (decoded == NULL)
+	{
+		/*
+		 * There is no space in the decode buffer.  The caller should help
+		 * with that problem by consuming some records.
+		 */
+		if (nonblocking)
+			return XLREAD_WOULDBLOCK;
+
+		/* We failed to allocate memory for an  oversized record. */
+		report_invalid_record(state,
+							  "out of memory while trying to decode a record of length %u", total_len);
+		goto err;
+	}
+
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -450,7 +724,9 @@ restart:
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
 										   XLOG_BLCKSZ));
 
-			if (readOff < 0)
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
 				goto err;
 
 			Assert(SizeOfXLogShortPHD <= readOff);
@@ -468,7 +744,6 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = state->currRecPtr;
-				ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -523,7 +798,7 @@ restart:
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -537,8 +812,8 @@ restart:
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
@@ -546,16 +821,18 @@ restart:
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		if (readOff == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readOff < 0)
 			goto err;
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -565,14 +842,40 @@ restart:
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return XLREAD_SUCCESS;
+	}
 	else
-		return NULL;
+		return XLREAD_FAIL;
 
 err:
 	if (assembled)
@@ -590,14 +893,46 @@ err:
 		state->missingContrecPtr = targetPagePtr;
 	}
 
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+
 	/*
 	 * Invalidate the read state. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
+
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record, and return it.  The record will
+ * also be returned to XLogNextRecord(), which must be called to 'consume'
+ * each record.
+ *
+ * If nonblocking is true, may return NULL due to lack of data or WAL decoding
+ * space.
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, bool nonblocking)
+{
+	XLogPageReadResult result;
+
+	if (state->errormsg_deferred)
+		return NULL;
+
+	result = XLogDecodeNextRecord(state, nonblocking);
+	if (result == XLREAD_SUCCESS)
+	{
+		Assert(state->decode_queue_head != NULL);
+		return state->decode_queue_head;
+	}
 
 	return NULL;
 }
@@ -649,7 +984,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 
 		/* we can be sure to have enough WAL available, we scrolled back */
@@ -667,7 +1004,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
 									   state->readBuf);
-	if (readLen < 0)
+	if (readLen == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readLen < 0)
 		goto err;
 
 	Assert(readLen <= XLOG_BLCKSZ);
@@ -686,7 +1025,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 	}
 
@@ -704,8 +1045,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
+	return XLREAD_FAIL;
 }
 
 /*
@@ -1062,7 +1407,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg))
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1184,34 +1529,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
-
-	state->decoded_record = NULL;
+	DecodedXLogRecord *r;
 
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
+}
+
+/*
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t		size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
 }
 
 /*
- * Decode the previously read record.
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1226,17 +1620,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1254,7 +1651,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1265,18 +1662,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1284,7 +1681,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1292,9 +1693,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1437,17 +1838,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1456,58 +1858,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1533,10 +1914,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1556,10 +1938,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1587,12 +1970,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e0531ed..84109f1e48 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -370,7 +370,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index a2b69511b4..8783c88eff 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -123,7 +123,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..eb147cfdcc 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -432,7 +432,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 4690e0f515..3fc2acc4c8 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -393,10 +393,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -494,7 +494,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -525,7 +525,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -538,7 +538,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->blocks[block_id].bimg_info;
+				uint8		bimg_info = record->record->blocks[block_id].bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -555,11 +555,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len,
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len,
 						   method);
 				}
 				else
@@ -567,8 +567,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..372ba1cc45 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -144,6 +144,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next; /* decoded record queue link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -171,6 +195,9 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
@@ -192,27 +219,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
-
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer; /* need to free? */
+	char	   *decode_buffer_head; /* write head */
+	char	   *decode_buffer_tail; /* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
 	 * readLen bytes)
@@ -262,8 +305,25 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
+
+	/*
+	 * Flag to indicate to XLogPageReadCB that it should not block, during
+	 * read ahead.
+	 */
+	bool		nonblocking;
 };
 
+/*
+ * Check if the XLogNextRecord() has any more queued records or errors.  This
+ * can be used by a read_page callback to decide whether it should block.
+ */
+static inline bool
+XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
+{
+	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -274,16 +334,40 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResultResult
+{
+	XLREAD_SUCCESS = 0,			/* record is successfully read */
+	XLREAD_FAIL = -1,			/* failed during reading a record */
+	XLREAD_WOULDBLOCK = -2		/* nonblocking mode only, no data */
+} XLogPageReadResult;
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
+extern XLogRecord *XLogReadRecord(XLogReaderState *state,
+								  char **errormsg);
+
+/* Consume the next record or error. */
+extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Release the previously returned record, if necessary. */
+extern void XLogReleasePreviousRecord(XLogReaderState *state);
+
+/* Try to read ahead, if there is data and space. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										bool nonblocking);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -307,25 +391,32 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
-#define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
-#define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
-#define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
+#define XLogRecHasBlockRef(decoder, block_id)			\
+	(((decoder)->record->max_block_id >= (block_id)) &&	\
+	 ((decoder)->record->blocks[block_id].in_use))
+#define XLogRecHasBlockImage(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index da6ac8ed83..6d62dafdc2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -533,6 +533,7 @@ DeadLockState
 DeallocateStmt
 DeclareCursorStmt
 DecodedBkpBlock
+DecodedXLogRecord
 DecodingOutputState
 DefElem
 DefElemAction
@@ -2931,6 +2932,7 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPageReadResult
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.33.1

v19-0002-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v19-0002-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From badd0d6d9e64968df5ead2bf2e0638cb1259ed6d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:43:45 +1300
Subject: [PATCH v19 2/2] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  61 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             | 164 ++-
 src/backend/access/transam/xlogprefetcher.c   | 945 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  39 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  43 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 21 files changed, 1389 insertions(+), 58 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5..8829dab03f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3621,6 +3621,67 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times in some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is disabled by default.
+       </para>
+       <para>
+        This feature currently depends on an effective
+        <function>posix_fadvise</function> function, which some
+        operating systems lack.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index af6914872b..e5e7ad46b6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2950,6 +2957,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5082,8 +5152,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 24e1c89503..f5de473acd 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b138ac680..b9b6f9bac7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -114,6 +115,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
@@ -922,9 +924,12 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  bool nonblocking);
 static void XLogShutdownWalRcv(void);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
@@ -938,12 +943,12 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
 static bool PerformRecoveryXLogAction(void);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
 										TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI);
@@ -1484,7 +1489,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -3788,7 +3793,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			if (!RestoreArchivedFile(path, xlogfname,
 									 "RECOVERYXLOG",
 									 wal_segment_size,
@@ -4446,17 +4450,19 @@ CleanupBackupHistory(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher,
+		   int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -4472,7 +4478,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		ReadRecPtr = xlogreader->ReadRecPtr;
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
@@ -6689,6 +6695,7 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	XLogPrefetcher *xlogprefetcher;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
 	bool		promoted = false;
@@ -6868,6 +6875,15 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -6896,7 +6912,8 @@ StartupXLOG(void)
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0, true,
+		record = ReadCheckpointRecord(xlogprefetcher,
+									  checkPointLoc, 0, true,
 									  replayTLI);
 		if (record != NULL)
 		{
@@ -6915,8 +6932,9 @@ StartupXLOG(void)
 			 */
 			if (checkPoint.redo < checkPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher,
+								LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -7034,7 +7052,8 @@ StartupXLOG(void)
 		checkPointLoc = ControlFile->checkPoint;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		replayTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher,
+									  checkPointLoc, 1, true,
 									  replayTLI);
 		if (record != NULL)
 		{
@@ -7534,13 +7553,17 @@ StartupXLOG(void)
 		if (checkPoint.redo < RecPtr)
 		{
 			/* back up to find the record */
-			XLogBeginRead(xlogreader, checkPoint.redo);
-			record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+			XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+			record = ReadRecord(xlogprefetcher,
+								PANIC, false,
+								replayTLI);
 		}
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher,
+								LOG, false,
+								replayTLI);
 		}
 
 		if (record != NULL)
@@ -7767,6 +7790,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7777,7 +7803,9 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogprefetcher,
+									LOG, false,
+									replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7901,7 +7929,8 @@ StartupXLOG(void)
 	 * what we consider the valid portion of WAL.
 	 */
 	XLogBeginRead(xlogreader, LastRec);
-	record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+	record = ReadRecord(xlogprefetcher,
+						PANIC, false, replayTLI);
 	EndOfLog = EndRecPtr;
 
 	/*
@@ -8137,6 +8166,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -8564,7 +8595,8 @@ LocalSetXLogInsertAllowed(void)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
+					 XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -8589,8 +8621,9 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher,
+						LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
@@ -12420,6 +12453,9 @@ CancelBackup(void)
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -12469,19 +12505,30 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -12606,7 +12653,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -12634,11 +12681,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI)
+							TimeLineID replayTLI,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12692,6 +12743,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -12706,7 +12765,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -12714,7 +12773,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -12855,7 +12914,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -12979,6 +13038,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
@@ -13012,11 +13072,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -13024,13 +13088,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -13072,7 +13136,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState !=
 			RECOVERY_NOT_PAUSED)
+		{
 			recoveryPausesHere(false);
+		}
 
 		/*
 		 * This possibly-long loop needs to handle interrupts of startup
@@ -13081,7 +13147,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;			/* not reached */
 }
 
 /*
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..61b50fe400
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,945 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+bool		recovery_prefetch = false;
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (recovery_prefetch)
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!recovery_prefetch)
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that change the identity of buffer tags. These
+		 * must be treated as barriers that prevent prefetching for certain
+		 * ranges of buffer tags, so that we can't be confused by OID
+		 * wraparound (and later we might pin buffers).
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will use correct
+				 * this problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+#if 1
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+#endif
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+						XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate IO ahead of time unless asked not to.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+						 char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(recovery_prefetch ? maintenance_io_concurrency * 4 : 1,
+					  recovery_prefetch ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(bool *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value)
+	{
+		GUC_check_errdetail("recovery_prefetch must be set to off on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index df942d27dd..010a8d3545 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1705,6 +1705,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1911,6 +1913,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1925,6 +1936,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 84109f1e48..156957db88 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eb560955cd..12287e4876 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -903,6 +903,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 09d4b16067..d9b862c131 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e..2a6c07cea3 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -118,6 +119,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, CLOGShmemSize());
 	size = add_size(size, CommitTsShmemSize());
@@ -241,6 +243,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e91d5a3cfd..17477a2a00 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -211,6 +212,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1312,6 +1314,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		false,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2770,6 +2781,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3090,7 +3112,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -12127,6 +12150,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1cbc9feeb6..32c36271cd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 898df2ee03..62614769e1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -88,6 +88,7 @@ extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..f5bdb920d5
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 372ba1cc45..1e31b9987d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -427,5 +431,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index eebc91f3a5..c0eafdc517 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6412f369f1..3d03fa7c91 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6355,6 +6355,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index aa18d304ac..97af4dd97c 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -447,4 +447,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(bool *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2fa00a3c29..4543e758cd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1879,6 +1879,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6d62dafdc2..40f0f9edc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1411,6 +1411,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2933,6 +2936,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.33.1

#125

[1]: /messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com
/messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Thomas Munro (#124)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

Hi,

It's great you posted a new version of this patch, so I took a look a
brief look at it. The code seems in pretty good shape, I haven't found
any real issues - just two minor comments:

This seems a bit strange:

#define DEFAULT_DECODE_BUFFER_SIZE 0x10000

Why not to define this as a simple decimal value? Is there something
special about this particular value, or is it arbitrary? I guess it's
simply the minimum for wal_decode_buffer_size GUC, but why not to use
the GUC for all places decoding WAL?

FWIW I don't think we include updates to typedefs.list in patches.

I also repeated the benchmarks I did at the beginning of the year [1]/messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com.
Attached is a chart with four different configurations:

1) master (f79962d826)

2) patched (with prefetching disabled)

3) patched (with default configuration)

4) patched (with I/O concurrency 256 and 2MB decode buffer)

For all configs the shared buffers were set to 64GB, checkpoints every
20 minutes, etc.

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

prefetching.pngimage/png; name=prefetching.pngDownload

�PNG


IHDR��0.D IDATx���yxT����S����^{BX�((UBX�V%-Xeymj7@!������T��
��
T���%*�,�a.��D����`Y���s�$�0Y�339��u=093��ar����s<_�S[�~�`0��`0��`0�qx�n�Z��`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�`�:q��N�<�i��������zp���"?~\�����J@���*,,4�������p��4�~F�
��#Gt���POU8z�����?��,Y��5k�����~�i����g�:w��E�}�������������W����r�J�9����x���-�|��n��-_����j������0g�������eY����kt������?�^�z�}FF�,����_�6�?���������.��@����D��9��u���pC�
������];�Z�*�SA�{�9Y����^.Tvvv��9r�z��]����}�,����s�����Hm����=�{����K�v���6m�h��5�[0|�5k����v?�}^�>��'N(&&F�_�.\����}������C��S����<y�9�����;�hR��?X��'����s$�j�����
���4Y���h�:t�:t�P������x���G�e����d)E����^��3g���������������Y�r�.�������D��9�o�7�����?�X���Z�x������zK�O��$}���z��W5e�m��-�c���_�������)%%E�����M~~�V�^��s�j��%:r������LM�:U�e��W_�����������K�j��9Z�~��������G��+�(99Y���u}i"Bqq�233u��i�:uJ6l�G}Ti�sss�|�r;vL����&M���?��o���7�|So������[����"����_�~��������u����+I��n��F]w�uJKK�����������O?��y��l�2=z��Y�Bj����:u�&L����?���^x��v�m~����k����={��n�Z����.�|�������k-Z�H������OTRRR��N�>�
6h����5k���Yc������Y�t��-\�P))):t����TW�U����n����c��I:�����Wk�����;WnT�;{����o�)��4n�8���)///���+��;�%K�h��y���O�~�+*JJJ�|�r�?^��O���+�_~~�Z�h�M�6�����_�9|���*�������M�z���>�L�����7������|F��sU�����Zk��e?~��M��y���������
@��\��/��y�������?��������o����J�ei���JNNV�-��KEGG+&&Fk����[RR�Q�F��,�m�VW_}����eY������.##C;vT�-���X�h�BS�L�$��=[�[��eYj������:��.T�6m�+��B�e�7����������r�~���������j�WQ+E��++!��#g��5~n_����/��+�T�-dY�:t��w�}��v#G��_�A�I��?�?����,K�]v�Z�j����k�����|�h����7o.����M
6,��JR������h�i�Fo���$���C����=���X���*%%�o�P��-��kWu����{YY��k����^S�N���i�&{�+��B���=z�_)���m��C�����eY������io��������Rtt����j�m�V�e�k��v��{�{��Gm���e�]�-Z(&&�o������>�B��	*N���(��P��:t�����#F��,�l���{��������8W��7�>V�V���Meddu���n��ij����7o�N�:��,���G?���=��.�L���S�6mt����y������T��\�R���Sqq��3�C=d�u�����gO����R9P��������4x�`������������U]R�<>��z�'�������U=gM�W7�|�����[�[�V����V�ZUz�`� 5���HE� ���x�,K�<���m��N�:i������v�4`�������eY��?�i��������bcc��{�VBB�}N��'Oj��A���������M�6)::Z��~�}��m���k�����k��u���w���{�j���5~�"�7������#�������m���o����X�B{����]�$�.�|E]aa�^x�Y�e�`R�/5�o�CM}e��
g���'Oj�����������9���*--M��o�=���,K^��~���,Y�e/77W�:u���^k�&==�.��*�V�Z%������������=Z����K�.�����2���L����,K/����cGGGk���*++S~~����^Y�e�
T�����������x��*))������eK
>��vs#���������}���u�dY����~��Fiiij���F��7������6�W_}��-[j���~��_��
2����)S��(--Uaa��x�	Y���{]������oU�nNu������9�s��}eY��n�m�������{�`�?��-�do"^8o��[�z�,���I���������{w���V����?_��?�! ��a���t�M7��������4��@�����u�����g���,�.%|�P�pA��C�V�CJff�Z�li�E��y���|��}��J�����o���s�/5�o�������eYz��W��w��=��������9?��~�|����,K.�o{���u�UW����,����~�]�pa����dY��,Y�w��7�E��'4q���'����?�������������<p�����c(�C�V����'�|R�e�eiU�l�**��o��4h�.��2�C��3'�����9gee�m����.]�����[7�������w�����w~�����[�n~��u��`�M�x��3*����x�x���n���y���`� �Q��x�.���k�w�G}$���������������k5s�L=�������,���-7i�$Y�����k��aJNN�;lO
�����+��kW-Y��o��=[�e��������m��{}	��m��y�~��W/��7���n��i~���3�~-+�����dY�]�V��R��V,�|��x��@�6c������h�<��{�����ceY�}n$�o�����������������^7n��+���@_|��������^�w����h�0\�c�/}n��&���#��}]���Pu�v��WW���>�+U���`s'PQ�}����k��O:��]~��o������={���,���'N��eYz��'+������������������~��~��9k��
t���C�����u���&�[�(�D�Po���m�����-��<~��`N��G%%%i����,�>�Q��.]�����>��eYJJJ��S�$^D�n�Z111j��M��[����h�-����WW��w��+���o��'��"[����w�I����&��X����� ^��U�s�����T;v�����\v����j�7I���������������W]u��{s�������k�~����^Pll�����|�A�}8p��
��s�u
vC!���M7�T���?�\�ei��9�v�r#��	TT{����W�V�o��\���1c�]�Kg�0��*�#P��~�i�������&P���s��sU�Y��UU���5���HF� �=��~��/.����N�l�R�/VAA�}����Z�x�)**��l���w��@����;�����'}�E�o�?T����6���N�����K���o���Xb��?_�e��P�N�/5�o���w.�@�}������v����/�Ptt�}AI���*^At����,�������o��N�:�]�v*((���Utt������m��E�J�[�o��x�����W�9�����P�E��o����}Ch����v�r#��	TT{�@yw��W���;�a�]T�y��z����?���\\����$�������&6�o�>���aY�=���z*�_�:��+���z������=�����D2�7���om��
�(����dYgNt����=z���^��&??_�e�����'I���+};��T���u�����_�^			��;��[(�J�Q�F��~��a�j�J�?���v&�!<���4h��b_=�b�V���z��v�$I�v��eY3f����n�j�j�o/���n��f�m|��_=T�^{��j��W^yE�:u�t^�g�}V�u�\���MX�m��I�e�'�=����p}�p�8�`_�`��m�������&Ic���eY:|���]��6w��7P�
>\�Z����3d�]y��*--
��8}��bcc�f��m����+��B'O��o���W���~������� I#F�P����������v��U[�z�:������������nv�����HF�uP����������U���?h���9r�Z�h!������J:��+66V,�7�|����Z�<����V�����E�$���Nll�~�����>���b�
]y���������S�>����i����-[���2egg���oW�-�C9��f���>*������k�����}�{�1Y��)S�����iM���M��5����a���EM�<Y{���G}�n�A�;wVvvv��[�~�4a��m���t���*66Vs����={����v��U[��.�p�w��/�����b�
u���>��Wx�{��:x������t�R]}���������{������_�����n�:���C:t��
����������N�~�m�);;[eeez��w��U+�]T���N�����-�wj���n��&}�����w�}~L��y0E��
��U+�0~�����n��-[�����y��~��u��`�MX�d�,����C�e��_�^���W����-H�n
�����@�Y��UU�����n@$�x�:�K��k�.����>VLL�|�Am��E�e���_�$eee�������,K;v�+�����[������������{����W��j��W��s�=���U�������������x+,,�_��W�n��~���o��^{�o�@�[����������o�$�:uJ�F�RLL�}���{k��m���|�v��	5o���@_^NN��b?v��m���|�CM.\h_`�7n��v��L�0A-[����[7-Y�D#F�P��mu��)��'O��=z���p�
������&��5�}t�	����h�o���;������*7����..�}�������{�����[��K/�ds4��b��qv�[��%K������5j��J�����`�M(..��Q�m��O<���{��j��^����s��s����������)��-�ToBeee:x��222���[��'N���m��{�n�������c���C����tG�+_�|����?��,Yb�O|Mh��m��g�
+�����U������C��9R����~PFFF�C�����/��[��I:S�fdd�t.������o�������W����m��}���x�����}�������^W�v��P�><���h���JNN�����cU�;%%%:q�D�/�dVUy�w�^m����<i���{�<X���c��J��R�����	�������;?c0�z�"�����=gm�W�	f�k�����
�Z��8�6n�,���p@�p-�7�E���p@�p�����w��Z�bE������^�4�z��(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�(�)((�/���]��Y�f�����z�)�<y���s��U��]u����[�nZ�pa��v;������JHHPTT����S�j��a:���t�5������v��	�x<JLL��)S��Oy<���~��vo,Z�H�G3f������^���QJJ�$)++KM�6��A�������.��"��q�n��A�f��/��f��)//���}����������$��9S�G�6m��n����x<Z�xq��@��xs��e���x4u�TI�<�F�����~�eee�t�n��A�����u��I^x����$I�����/~p���k��!5������������5j������[�e�O�&M4`��m��>�L^���`0��`0������47�P�v��	%$$�q��z����~��o_]r�%�����h�`���[K�.Uff&��`�������c������`4�A�0�la0�1
��JL�x3h�������~��k��E�~>x�`���*++��=//O�G#G���v������i�j�7P5������PO@C�0�l�$�7C�o��K/�T�\r�>�����L�0A�G�����������h���5�.oL`���	d'Q���������W����������������I��n��'��h�`P�0�,�&�-�D�f���>*����n�Ic���4V�Xao��O5m�T�����C�4m�4��G?��������nw.oL`���	d'Q���uky<�*�c�=fo�����}��?k����R�$��nw.oL`���	d'Q�����k��-�������B���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io��&@ IDAT.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p���P�0�,�&�-�D��2oL`���	d'Q���X�0�l`��Io.C���L [�@�p������i����M�C=
X&�-L [8���e��vQ��DuII�T4 ,`�@�0�l�$�7��������������,�&�-�D��2��7�������,�&�-�D��2���u�6�z:�L [�@�p���t�r�x�����h X�0�l`��Io.C���L [�@�p�����+����
�t4,`�@�0�l�$�7��x`X&�-L [8���e���o]RC=
X&�-L [8���e*o�E�����,�&�-�D��2qqq��|�]�y�n
��4,`�@�0�l�$�7ddd���/��]�*�,==]�G�8������;w��v��K/�T��u���k<�7&��`�����(����Z�j%������W�����7y<]p������&L� �����DM�2E}������o�Q�����i�������)u�G`���	d'Q��x�b]|���x<Uo��z�Z�n]��dee�i��4h���}���E]��'O='�7&��`�����(�y�����x����1c�TY��j�J��~{��5s�Ly<m�������W���h���A�+..NO���]�%��}�*,`�@�0�l�$�7C^�u-_�\�4y����[vv�<�&N���K�j��q�8q�������<�F�����~�gee�������
z^]��i��{(��+�L [�@�p���*�RSS��x����LM�4Q�f���x������%&&���E��n�����\�w����M���KJb�v
�a���	d'Q�9����w�_���:uJ��}�vu��A�7����$���[�e|�&M�h��A�����85��x�z���`0��`0
f�������_o��x�����J�}��G�x<z��%����%�\�����4x����r���Wqj�n_�;�x���
���d0�Z��>�L;v��<F�d��01�����0���$�7TW�R\\���;O��$I��������2�������x4r������[g�Fi����~�����w�z9d@�#[�@�p���*��������_�����w�y�$i��	�x<��o��v^�W�G�g�z.���)�&,��.��g��|���L [�@�p���*�z�!y<}���~�O�:U�G��������J�G�&M����'�TTT�8�\|���K;R��7,`�@�0�l�$�7TU����C?��O��}{���j����2e��6m�=z������O�>j�������C�i��i���~�����Fs��r�x[�^�]�%�[/�	��X�0�l`��Io��o7n�W\!��#�����;O��������]NN����ko��qc
2��'�o�����
@�a���	d'Q�����k��-�
��nW���8�F){u�x�a���z,�a���	d'Q����x+N����.)������	d�N�xs��.�v�v��7�����}��3T�,�&�-�D��2q����m������{tk�� ���`�����(�\&.��>���*N���������}kC=5�,�&�-�D��2�����v����)�����	d�N�xs���n�=g��S���n;�x����PO
@c���	d'Q��L��m�����[��������	d�N�xs��.�9y%��z��	d�N�xs���[��&v��%%1�S�X�0�l`��Io.��K�^>W��Q*N��xP/X�0�l`��Io.��Ke%������~go;r��zz"X&�-L [8���e���)+!^�/Tqj��-��.��G��zz"X&�-L [8���e�����?Wqj���'�.���[���P,`�@�0�l�$�7��o���R��Qz}iG�x������P,`�@�0�l�$�7�������x�>�\��QJ~��]�=�ir�� B��`�����(�\�W��02V��Q����v���fl�� B��`�����(�\&��5�r9��z��	d�N�xs�.�W�e����(�F��[���PO@�b���	d'Q��L��oY	�o�
X&�-L [8���e������
~��-o�do;r��z�"X&�-L [8���e����m@e%��`��U��a����7�����"���	d�N�xs���8�x�_�6�������}kC=E�,�&�-�D��2qqqZ3�e%�+��Rqj�^_��.��g��z�"X&�-L [8���e�����o�(+!^��4�xP/X�0�l`��Io.�Y�_SVB�~���(m~�R�xKZ36�S�X�0�l`��Io.�i��)+!^9C.�xP/X�0�l`��Io.�+�rF&�xb'�F��U?�����)�@,`�@�0�l�$�7��o���)+!^��Q*N����.)���"���	d�N�xs_��?g���U���������PO@�a���	d'Q����x+�p���U0��*N���E�������PO@�a���	d'Q����x+����
@�a���	d'Q����x����x�>�\��Q���j�x��sy�g	 ���`�����(�\�|�vl`_�x{}iG�x����Y�4,`�@�0�l�$�7�)_���L�#cU���w�����/f�x�"
X&�-L [8���e�oyS&)g��*N����/�����;�������	d�N�xs���[���:��I��Q�v�O�����.�|c���Zhs�g
 ���`�����(�\�|�V��UVB��S�T��5�����wW*���$����zc�[��?�=�X�0�l`��Io.S�x+��SY	�*z�|�|�=��e��j��Q�.)��y�v����',`�@�0�l�$�7�)_�IRVB�
f��.�vm_k�,3��&~1K}������_�|����p��	d�N�xs������}�����5��l�Z%�[�a�\�p7�L [�@�p���T,�rF&)���v�����V{�9��������I_��<p���`�����(�\�b��7e�_���llP��[���)U^�������	d�N�xs�����Et���v��]�����l��j��=��>w@b���	d'Q��L���(���!��=��������G�V{8.�4\,`�@�0�l�$�7��X���T���v��}���}��N��������]>\ol}�1
X&�-L [8���e*o���oo��Q����^�+3��&~1+�y������`�����(�\&P��=���mNZa�>��BU�n���p@�c���	d'Q��L��-gd�
�_hoSo6��\�h�X�0�l`��Io.�x��3]3no��Ze|�]������aX�0�l`��Io.�x;�h����+�x�=g�c������1,��.b"X&�-L [8���eoE�^�>��.���U��V��{NeYK�����]	��1���-��1�L [�@�p���*���r�3�r�,T�Q��	g���-F���C��w!�9��<7��c���	d'Q��L��M�����w��`G��	*�9Je����2�\bX��E �X&�-L [8���e�*�rF&)+!^Y	���lm\?S�;G�����e\������
u���C���C����X�0�l`��Io.SU��;~�]�M>G#�OU�`�����t��g��u������m��9T5gC���=���1������`���	d'Q��LU�[���v����&k��S�]A�~�e�?S�����ckV�}�F%_���y�
����G�<�����"�X&�-L [8���e�*�
?\go��S�j������e��!*���fe����^�����-�������7j����7�:a���	d'Q��LU�[Q��.�>�?P=_���#����e��Uz��:�7.P�=��.������9X&�-L [8���e�*�$��[VB�z���-����L��}7������q�w�fow�mv�b���	d'Q��Lu����}������hNZ����������C���+���m���U�,�&�-�D��2�o9#������?����(U��R�8<����7���8�/����*e`X&�-L [8���e�+�r�����������U��SNjdr�=�/?��iE�X�e�]��(��e��#��[��co�>��U�8�^��`�����(�\���-�t�x��}��-��:*wc�)�+�}Z�W���BIk����g���*ePk,`�@�0�l�$�7���x+�p�]�}8x�F$��G���J�����]��1v����/57�H�,9S��X��_����(�SX�0�l`��Io.S]�V�������I�>��#���M�=V~Y�9i��xq�i��n���FK�k�O���k����K��+5g�K�d�=:��U�+��=w��+�1�WD�L [�@�p���TW������[VB��y>QVmq7eU�_q7�_��-�����5*��*�&�������5+��y�j}�l�V�?I�oX��iY�uw�Cf�H��	d�N�xs���7I~�[Y^��3����w�n\`o�;_s�����.[�2���Xmx��f���?P��������j�u77��\w�(��X�0�l`��Io.s��-gd�]��{�Y�x�n����5c��O�o�m�YR���F$�r�������z�
��X�0�l`��Io.s������v�V�r��3���o��w��]�����W�w� ]su���`���qGq�`��`�����(�\�\�[���~��������Q��#w��:9�
{|�\E����x�N��+3��]��]>����NY�z�}7G�;G����Fe����uh���������;zn����u��8��_q7~�i���weY�m�X�0�l`��Io.s�����s|��F����_����;�xa��;_��%%��/�9�[jU��x{|����;����V����v�O�9����`���	d'Q�����7I��2��\o�4�w9���+��o��."Q�UU��UM����O������vSW�m�b���	d'Q��L0�[UJg�(�k�����?g�=r��S��${���Rn��������~S�����7��q��.w�J�z�|��M��R���X�����{���J������X&�-L [8)������K����iii����R������r������w�o��F��;��x3��;�]�o�5D�}���e����v�f��+���vs��yA
�L [�@�pR�o\p�~����Y�f5�5Rrrr�w!���x�����t�_q����v�6c��5.�B}�l�s�5���.|W����v�K�p����.H���b��.� ��tpX�0�l`��IQ���<��}�H/��3=#�.��g�T�y�o�U,����;�`��v�v���ktU�����-D����[�eq�|��|iW��v�<D�,�&�-���[�=�b�
���P5��m���v�6��YF����.�t���>.PQ0��v��3�r��mWz��Hu*~�.R�$[������!�,`�@�0�l���/�P�r��=��.����
�t��9�N�����x{�m��y��t�_i�7uRx^�"���}'�����iKwk�Yw�,����c&�-�1�[^^�v�����"����W�������z�!8p �3�n)��X9:����{�����/������;�^�mW_��~7V{�l���)�j�&�����!��"[��v\E���@�pR�oeee3f�~������Y�fz����b�
5n�X�GQQQ�x<��/~�]�v�z�a�!o����KJb��R/�o�{���c���v��
ya�Y�kW759D6��k�Ud���1�N
�����_���Q��]5f����K^x���o���x}��7*--��3t�y���{�	����������PO����������g��Q�{g�����R*�N��(����!�/H����������l����"KQ�7>0�l���/��w���m����X��o�]s�5�x<Z�n���7�|�Z�h�YF��^�%�ko��[C=�:+�^�[]���+�N-Z`���5*����QT�\�p=DvNZ�������1�N
����������n���z�)y<eee����O����wrz��o�6�ho���
�t��`�]���8��q\Y^n��k�}�B^�����x1���;C�2��X�W�����
g|8`��Ia_���'?��>�w����wy<�����>f�y<'�qz�6=#�.��g��z:��.�R�B=��S�y�r����������l�^�"\���"����v�6"�T�����-"��Ia_�]p�z����n��?�A�VK
�x��s�]�=�ir��S/�7u>[��n	�t\#/Fq�+��<�]$-`�|SB�D�H���l�$�7�i�����V�xKZ36���%��v�V��>��A��E��vY	5�E���E��|�6l��")[D���"�xk����u�f���y<�����M�v
�x��?jo}��t6�� IDAT�E���v�V���PO�U,���E8]A��]���C�#��v��� |���	d'�}�����g5�ZC/�$��[���PO�^��}�l��sd���0V��Q�����s���|iW_���x"��@�pR�o�_n(��eo;r��z:uV���.�J�=C=4`�x1�����Y�7 2���	d'Q������5c��m�����N��n9{e���z6@@[�^��k>�����{5�_���m�7%�~�T��L [8)b�����������<c�%&&����9�����x���bo�3RB=�zao�Q��
P]�N���(���o@$��1�N���m��y��O��M����P�������������s�Pu�P�����.���49�����:seS��P-`�u^;�!���-J������m�����@p�p������x��q�7n��/�\K�.UI��o&<��������f��-���W^y%TS�n(��G���[�����N�(I��l����PO�$�������[�����Vhos�
C==U�l��N
�����o�%�\�c������#��o������U\\��S�(n(�r����)��p_����no�s�S�"��@d"[8)�����R���?����+���G�+�$i��j���


��b�qC�&I}�����l���+�"����b�����x�|*��P�p���l���.����{y<��1���^y��x��~�M�<Y�Gtj��-�[���i���XXq�gT�����a3(��p�� 2�-�����S���x4u��������.�����<�3�\n)���M�]t����C=�O�/`?\go9#�$�.�z��Ey�p�� 2�-����$�����}����}�����^jxF��-���}k��m��C=�zQ��y�YKB=�O�/`��������O����#�!�!�@�=[D&�����xKJJR��Mu���j�KOOWTT�}�Q�f��R����go}��t����>��~�}�x{���x[�eq�g �p���l���/��o���M��]�v���On��g��u������o_��H�$�o���l�[���q���p_��������������M�����@8
�l��N
��M��y�5i�D�5R�=4b�=��sz���t�5��Q�Fj�������z�a�M�X�	X_���/I\����-"��IQ�I��;��?�YM�4������~�#���_[�n
�#�����_�jxX�0��r��z:�-����������+�@���ly�N��������JOO��U�������~��"����u�6��[�����N�(I���y�����HX���L����t�$i��N�����%!�!��"![D�����x{����ys���}*7o��G�����x��y��
	�t[$,`O�mo+�K����<o@8��ly�N
����.Prr���m��T�IR�e������p���-��
a)��s���[������;���m���!�!��"![D���"�x���Wbbb��y��G�V����g6M����;��z:��x�E��
a'��������I�����v�DYg��H���l���/�bcc��Y�Z�E��z�����e��6���mr�<o{�
�tI���-=�io����������m����@E��-"��Ia_��~��xk��y+�Zr�p�Mq�� )r������������b7�T�d��B�p�����x���+G����C
�b�9g���(�8'�3"f[������o�;��������n��DJ��,d'Q������_����g6M�t�E���]��}7'��"f[�yS&������]����tg��H���l�$�7�qc��#g�]�����PO�^�x�.�J�o
�t��Y��{�?{�Wy�}mlS)&4.eBH3I��i�����IC�"'��fdX�}���M�!A`g ���qbQ"C9 �u���S�XX��'6Y[�AaY��aG�������>�=:���vW{�u~�_3g�fwe}�����O���j��H����m�S�MK�B+,��.����[����e��������Y�M��8n
K��6h����;$h�=������TX%�$���Bo`&�7tvv��%K����}�����#���g�#��|�����^ziN���[����:-x�xl��r�"z�6���2������nj�^�Y(�
)�@��z���0�����;'7�|�x<y��������O������R����{��G<�����������[�t�����������;-`����o���R�+�9��oMQEH�So`�f�U����oKiii���������w��]y��$U�GS}}�,Y�D<O��mxxX/^,����S^/))����N�����\.����l<��YC8n
��6>��oW-O9n���������[����l������>�)-����?,���'������#�������Z���/�G���/����{z{�1�x<�r�J�����1x���.�G�?���k��&�G�����\.��y�N�YD��L?�4��Cu9p1�-`/��d<n*���������[����l�%	Y�|�,[�LN�>����e���r�UW����������?����j��g�}V/�1�s�����?��?��y�dbb"����a�x<���}/��������Hys���t��.g�j�v��~\u9p1�-`��M������7�:��[����l�<xP���'���n��y�����k�����~Y�b�Y�e�-x+--�?��?��5�����k����\�=x��zs�]o�#�1d��m�>�4>�Z�~�����,�
�����z3�*x��O*7�pC����_��E�����u���_����([���/Y��O�4��\u�U��o~3������L�^������m��+��3���c����������������Z��N�������[Z������;���zyxxxxxxxxxxr���!�	l����g�*"���M7�����u�d���f��U�����D>��g���������r��/Y~��_K�k�c�?�����������<��-���������k�q�����N�������3�|P����S.t�>��G~������I@��
(����m�{��z�;Uw���U�6>>.�'"���'��W_�P($�x\����_��_������>*""��o�&W^y�TUU)�:{�v�����E�$�H�����HEEE^������&�t���o]��%�I��Z-x��T6�}�^{��Y��b�{��HI�����W��
0�]{k��0���7�������K���,����w'288(�G>����������oO?��x<��I
}�^�x<��/~���rA�6��ic��S[�o�sSC/�����e���
75j���5����K9r��UP%�^v�-����L��DD�����[��_/eee��#����G��C��tvvN�!�J����7���#��oOy��G�H___^���������Y�c����-L8�2v^����MDdk�DJ��=WP%�Nv�-����L��fb�n��o""��s�,^�X~��_�����g?��,\�P��__��fC�����3-|;r����
��L8�����"�����]oW-�D0��`X����������
0��{���0����.����}��7�)���7���TJKK�k_��|��_�k��Fu������~)))�����?_��];��\?7������eu����n�e$��cj�[���9�%�%�����z��L{`4�r�����%��o����[X���l�MLL��7�,�G/^,������������������i%���&###E�\6o�u�{R�|[VW*k�6���i��mZ�������z^y����,��q�7�dv�-����L�
�^}�U�x<RSsy7�����hSL�=*K�.��^zIe��G��] ������o��o�HHuy���Z��)������\��3�O�@��	����[��V����;��k���&TVV��>�9��}����>�1U�����^�5N;zzW�}���>���,e�����P��$8���n_���L�DD�:�����Qk���	����[��V�[uu��x�����W��k��V��K�.����w�yGEy�@�������7WN��V���4�;������+�`�,`C�5)�[����s��c8���z3�*x;x��\q�r��y9q��x<���|��s������o���LK#x����'������������-K$1���*8���#��h��H�����&5uDS�|+�
ISG��jgsRo`�f�U��e�����O~R:$�pX�,Y"w�}����k���~U-Z$�`Pu��E�V�����?�E����
q�6>�/W���o�m��IL�����<s(�i��sRo`�f�U�&"r��1����d���""����������������\q��F�V�@$$5�uY�]��YsC��D[oI�����*8����SNC�5Y?�=����v�GO��sZo`
�f�]�&"��dddD�������}�v9|�������m�f
��C�CC��L�a�[���"Q����N\�����o�����
�<�r8%|[�TH���X1�<N�-���0�-�7���x����k�x���R�|b����_b��S��h�IP]�����M9�o""�OD�=]������L�p��j�[�������{��g���Avo�h�y}��;tZu����"5|;�@b'W����.
6��l"H��K�60��
{���~�|%,�	�*�����Z�f�|�v��7j���� ;�7c9B��+3p�_�2\b�e��o��$v�8���B�7����V<�m�����o@.��[�Co`&�o]]]����������:�5���$��?��7-����6��l��ml��Y�n`4![&��o+�
IY��4uDM��/��j�[��������\��!y��Nkp�6�yWd�N����=�+�
�a�)|l�"�``��m��e<~�����T]���C6����|�f"xs�75�CCRuj���P���#-Y�h�v�aFnY�&��W���o#��J��������������&��2������\�f"xs�7����t�e
�����|jb�Ebg�f�/�h��%��(	�)����l`��������2^�R�__��l�X�c�{�F��s[o`z3����5"!y���q��c[U�wY�/��[2aH�	�����Q��jn\�������R8���������zu�S!Y����������x�f"xs�7�i�y=%�[��IuI�h���\�=�;�@��o��n��"Q���a"�.`���i��]\�<�������a"�1�d���I�d**�������-�d����w�������H$���"x�.��7+K�<
u��p� .v���;�������
����o�+��KkV�d�����c��w�e��C*_	KSG��pp,�����0�������x<��X����I���C>_o�������Q]Nnm��!�����p���'WJ�����cW�c�����-y�4��-���/E�-�K���q� �Bo`z3�*x���r��a����%W_}�x<����+���,CCj/���7�*o���7��i��$1�"��[�OG�2�!v����v���)��_�v�[2��I�@���1_L�6Ld��iG���i��K0\�0	���-�d��M/
��={d���2�|Y�`��Z�J���errRuy�E�f]�m������U�SZ��z���tO����w�1��6X��J��&���-yaH�=��'"������K�#n[�al����f�m��788(O<��\y���Q�����f@�f]5�uZ�V�Y��c��I�B��}���KQMn`g�����,>�/�m��p��?"��-E�^m�����������w���@�Bo`z3�:x������*Y�l���7O���*)++��|�;r�u��?�A9z���2-����\�e�i�|O����
l��3n����S�������Ey����U|f��w��u���#*��&����s��9�9���l�F��0����@  ���/�#��G>�����~�3�>w��EY�t��y����� �7�j�y]��8�Su9�D�QM���8S���]��+������)�W��_*j���W}hR6��?���5uD��/�P���-�d������r�UW�����K��������o����w�}�|�3�1�B�#x�.��i-x+o�T]����Ib�����|v���q��3��%w�]*+Q�%
�&��/&�G'ek���9��U}0�<�����9�����f�U������RZZ*���o$����7�xCz{{M��>����-OQj��������3�s3�zd�]p�����n6�	m�\���;�
9�:�����am����i��8���F��0���7���u"!-x���>���S��8�8���-�D0 ���YC���w��z��K"�'M�3=�{���l�;��P�|w�U�
�3�&��.�������-�@o`&�o{���g�}6���Y[2x[VW��� �3����pW�5k7R~��#�������K[oL�:�)GX��cn����w���������-�d����o��������M�"!��8�>�k_]�4U�8
X�E����.#����
��C.�Z����B�.=�Kg��w\�v����dP���.90��N-z#�[����[WW�tvv�� ;�7k+o���7��i���K�4�����?����T�4�ck�D0 ��G�
��a\���2^��D�OJ"P���(������.9���]t��u��G`��Z�F��0���7����Y�����-��7X���D����EyNGS�r�{���C.����zc��V+u�v����KEo`z3�2x{��7���O}�Sr�
7��?�iY�~�����~��Y��wj���F�� �����Bu��bk=�n���%X���0Nw\��j2��v�T�xE���w��F�}�5����|�������������Eo`z3�.x��o�\q��h�"���[�����[o�U���JY�h�����.�������N�j:�T��\������:z�]��*S�����@��3.�m��+�
��w��>������_U1i�L��.9�5}g�QS^��e��g�]v�F��0����@  ���d�����{������/_����������*������_��U�����+�6�������1X{�r�r��J�����l���k���Y�B���J�K�>�|�bU�v�F��0������f�x<r�����_�tI���/uu�������C������Ru9(�������u�H���PN��7���M���b�l�&�w�����
��e����������������0���l�<xP.\(�O$r����s�=gre�A�fmo����M��T�c����)njL	����Z��V���=�$|���;��I��.}����
{�����}���~�YC;&�(�f�U��������G���o�,^�X���L��>��-	i����R���1�
-xK���.�4,`�Mr�\z87���<��������F�OJ"P�����B;}`g��.�A�v�{�n`&[o���@���'_��Wd��]����F~��_��>(.���y��g��a����������%�-�����jk$T[�tF���KX������ig������3�&S�;}h�=h�ar����l����{��x�zn��v�e[
������^���=��A���
-x���������Y��[����SZ���.�{����K���>��FK��v���Ov�z�%*	 IDAT�w��� ���u3�*x�F��������;��.�R�����R��C�U��<��n!x,D�M���:��WE���.�$'�&}`�����Gb��-�Am���(���-�9�<�[�TH���,1Ap�-�d��
sG�f}O��o
=��.y�����������1
X8I���1��U��.��Md�e��
k�]vf���B��{�z������K��zc����a��L���|�My���S����p�
��OZ��_/o�e���lo�W�Y�o5�u��A�#-S��w��rL��$�~�j�]�=v���%��vK��KoM��v�OD,���;�����p�o����+��B-Z$��z��}��r�����W^)�-b��,���_��=q|��r��@��E����I�@�f�a���N��X��z���]SGT��X�z��U��C��,_�\�{��������K_��|�����QEZ���y�Nk�[ys��rP-x;�@u)�a�O?
v�]vfO�-4���������"[h���a-x{�e����-�d�����Y<�\�p!���.]����K]���!x�������4>�� z����-�W]�)X���:i��������������ii?��W�T[oL�6�W]`9�[��V����e���211���D"!�^{�<��s&Wfo�������.�yWh�[b�Eu9�`���hl������	��]Q�u�S��H����l��A�����G?�Q�����'�/���>�+��7{��a��u�{T��<��WOojU�c
�r1�N;;���2�(���y��u3�*x������2o�<��W�"�v�����7��_�R|�AY�p�������>�=[HE�f���Z��:���)~v����nQ]�)X�0���8-�3�x,���f�U���{��������v�L���=T���o5��Yh7��Z�k_��S��`�b�;�vs����
��u3�*x�F��������;��.�R�����N��N�V]��i�
��+T�c
��`��b�A�m����v�i�[�`\�o�+��a��-cc\ ;�7{�������J�����-zx��RL����[��i��(�s���(������Z���S�[�������.x�p��|����o~����o|CJKK���T�������E���kT�hio����������U��D�\7����.�p,`���%��v�����n�r����#��j_O�d����|�
�&&&���o��#�/��������e�����xd���r����.����#�-�+U]

����l:������`zKn����{9�oMQ���Co`&[o����x<���������G}TDD�=*K�.��^zIe��G�feM�ljcq_��&���`zKn2o�G'�����������0�����;w�5�\#�DBDD*++�s������}��c����l���>6��o
=��.yr�dS��@o����i��+""MQ-x��0��B�Z�-�d�����Zn��F����_���^����K����0�to���lZ�Y���I?�4z�v���,#�[r�����zcZ��a���
k��0������W\!����'N������9w��x<y��7U�iio�q��	&����&���`zKnF+7i�[��QDDFZ�VRR\!`-�f�U��e�����O~R:$�pX�,Y"w�}����k���~U-Z$�`Pu��E�f��!-x���>�������f�XF���&T[�o����d������&VX���l���;vLn��&��y���<���2�|�x<��x���W\��������i�[hHu9�S����};T�c(��@o�M��m��q-xk��)��z3�.x��b222���o�[��}�>|XaU�@�f/���Z�v��	�� O��S��U]��X�0�%7��_�����������l
dBo`&[o(���T����K���q���,#�[ri�j����\{]?�����
+����Lo.C�f/X���Q��r���-���u��d@o`&�7�!x�,�_��bj������,#�[r���������WR55`�{0��B�Z�-�D��2o������Q]��UL
X�U�.�0,`����d�6�����{9�o�ODUX����\���~6��o
=��.yJ�:nz�v���,#�[r��(���H�W{}����i�-�D��2o����Q�6����+�w�=o,`������MZ�nj�^Mh����B${�-LE��2o������7�K���B��r���-���h�[�z{�{=7�oMQE�Ao`&�7�!x��{�s�����n��������1XF���.�dS����-�D��2o�����Z�V�Y���+�6u������1XF���.d��qS ����\������?�ok�6�.��n���@��r��,#�[�s��$������L7���[����e��)	i����R�
�.	y��Y�o����):��@o�O`�f-x�����������1E�@o`&�7�!x�����j���������P;u������):��@o�O��1�=o""%US�M��b
*����Lo.C�f_
=�k�[I����A���������9�XF���'>����7����Z�V�����^�f"xs�7�
DBrg�}Z��:��$�)��z����-��)*��@o����5Z�6�z$���!m��z�;�[����e��M?����;U��<%�8v�)XF���/X�]��6O{k����z�f"xs�7{���0d����M#-��)��@o�_���oW-��>��zs������7W������Mcg��.�hX�0��0��J�7I�����q���0������w�t���.�����@�#�,��`zKaf;n:0�H�p��UP%�����\�����������r���w���,��`zKa���&��i��O8-�
I����"�f"xs�7g�
��z{����$�!e�B���_uIs���-��W�k�[��f����H��1-|�|��
�Ao`&�7�!xs���:���X��G�zc�����������U?��K���S�����\���Y��6j�[I����T��%.�:j�XF�����Uwi�[�)����C�GN��&W	����Lo.C��,��!���>-|[�����FRv��*T�3',`��27���Yw��"=7�2�����t�f"xs�7�I�r��i����T��$FZR'��T�T0��@o��D0������x��S�{���[����e���������������&�Nc'W�.�`,`��2w�]oW-�8�TD��#�r����	�+�Co`&�7�!xs�L�[C�����lm)���};TWT��@o��D0 ��J��-X�=�g����x*$��&M�0����\������=)w�-�+�M��q�����n���S��@o)���#Z�6���v��~vk��N�x�f"xs�7�����L;M�~;r����0����������n�)XF�����\�F��������N�4�f"xs�7wDBRujwJ����T���8��*|N�G���}o,`��R<����A�����
�E6�'|�c�[����e���;tZ�iX?-��uz�O-(1| ��i���K�XF�����S��F��3~>}�����%��S8����\���}���t�M�J������;��-����`zK�����4�4)=|+�
I[o��jc�[����e�����#�����v�{T�����)�[��������-��R�����;k���D$%|[�TH���T1P|�f"xs�74��>m����R�~j7�O-dZ���D�T��XF��#��M9r��e��i��IIUh������	�Eo`&�7�!x�H��w��'����.�K\�M�=�@����<z���-�	75��e���7�����\��
z��!��Z]�M�����~k��$FZTW��,#�[���9��M������oe�cr���o�z3���29r�D���?���_b��S������9���X�������{0�q�[��0�`y�f"xs�7d��S�����v�4zx��{~���)XF���#=|�e�BR��op�:z3���f3���������e���N��S��@o1O��->���>�>49-|#��U�[����e���L�O���Ov����4�Hb�Eb����+%qa����X�0��\�.�Z.�vo�_?0���
�u��KSG������[����e���l�O�4mb��E��E[�H������u��`z���M�rq�])���]y�3p%U!��8!��q�~`v�f"xs�7���#eM�p���H����������`z��n���m��^�v���w�\�;����������A?	�����\��
sQ�Y�r��;tZuIH�_��zK����u|��q�"MEe��u���+�S��Bv��\�n���������l�f"xs�7��~�����$���5���~��n�9����-���qZ�6R~o^w�����f����94)�}G�1�-�D��2o�+��o5�u��A�r
�/����%��x�Xj�AXF��XC��'#��L����<�4]0,����^g
��w�U���oD�ECo`&�7�!x�\�t�i�[�����A!��$��Cb��g�r�X�0��Z���8������D0P���k���#���0����a��C������Ru9(��H���n�>5� �,#�[�'>�/�����oW-��3�����]>�:��p��boX�94)MQv�!'�f"xs�7����%�\>�z�����G��K]�(����c+��{3_HA-��t�	m7�L��e���L�-�D��2o��.���4>��-��?��k �0�ql}3p��������t5]2���0!=7�s�����U�I����9�������%.Zi�z3����!�-�+U]
���4� @����>f
����&[[��}����Rk�N��+No\���lk��=G#���qd���-�D��2o(�7�%FZ��c�e��?�_���T����������0$��3*�K��q����)�K��MV?�Go`&�7�!xC1�Y�����T��/`�4��qt��������ed������!\1��f�=�c���q���_Ms�;��1gm�f"xs�7Cys��y�N�.0��SS�;C��'���r��$k7���o\/c{vI�����
�&��7&�OD����l�;>�]r�=$�=��9�>�Co`&�7�!xC1�!]>�|��������H������������]\�\FDI���%w�r�\.;�����ZQ|�f"xs�7C���Z����Qu9���,`�r�Z�H����'K\�?��-���Pm���Q�w���.������._D._�s[&d��q)�9V�`.��9����@������Lo���O��M��=uuu)�{��w�����|�c��~��������`^���
�P�Y�o5�u�����H��}=~[N��c��?��#y'\`��Yw��w�%��Z)�K�������\w�e
�������Lo�E�Q��?�#Y�h�|�Hy��_�}nxxXn��fY�t�<������O�u�]'���'%����P�����n���[�F���P+�3k%�z��A�Qg��c���kA\�;��;���h�T����C��s�i�ft�g�������st�>:'�IGo`&�7�:::�����={f��#�<".�o�������N���'?���r�~o(��i-x+o�T],��l����vH�}u��-1�b|
L��HJi�J��FF+7��Uw��%���q��-J��+T0<����#��t����C!r���N?8���^�-�D��Xmm�x<�����s7�t��}���^���>'�����#xC1�!��lb�%�n�x��k`�8�L��2�zDB�5��(���j�#��Pn��%����Ir0����~@�Y;�����|E}PGo`&�7��������^+===����T{�1����b1�3CCC��x��G�������(����p8��P�HH����Ou9�����-S���-Jj`�8F��;����(�����(/hw����fU�����F������r�J�t�����tfu����OzM���[���M����r�UW��E�d���r��W�����o�]���ED���S<���'?���[�l��#}}}9}?�7K2x[VW��X�#��1�%����r��7������QV��sI�#��|SA]�!5�^��O�$�Go`&�7����z��������wLD�Q��(�GV�Z%""���o�����]��}��m������o��������,���������gN��z�^-x;u�M����}~���IWW���}�k����E������x�������B��e��������U2�����?~[���%�K>������'�U���`�vh>$���������`�����'����!�������K���K�x�����~������;�u���������^�T���~����%i>5,�����7��w���3�grr�x���)��elll��_���d��y244$'O���#�<����=�����x���7��WVV&^���g�O��
Z��������>]�wi�����������q��f�9�����O�����~�Ay��2�P���s�����Z�]�F�]�F���E���E���)�_~IN���t�vP�������7e���d���dG}����������?�����}H����1%����������}X������_{[^~�mi8rZ�����'������E=�����x������+�G�|��i���i�x<�������]���X��+���;tZu9P��Uwd#1��o1�
%50����*y�59u59yu.����~�59��G]�u�3�0�~:�	��t���p�7������?���������w�+�G���%�H��W_-�����}��{���~��9O�7KMg���t��.�)��8������jj`�78I2�K�3��k���N?$B?(B]����_�)�a�}����'�R����0��t%U�y�4�U?@�3�)4<<,������W��>>>.��G������������
7� ����k������������'����
z��8&x�IuoT�v�$�����#Z@��Y��:�u.������B:;u���n:UC$���Kt���t���)�n�:�x<���}O�y�9~��,_�\���/�����N�8!�����������zK:::�_��|������������Z�V�\��(����h�-S�[�MY�Kuo�.��.[Hg�N���n��G,����20��x������y�OxM�EG@�A��X8��7����������?��?����i��������^���7�,���y}?�7�w�4�4��8�yWh�[b�EY�Kuo�$>��u��_�v��_QnZHwq���A]�`������}+��`X����#jJH�a����(:�7�����3g�����D"��s�HD:;;���[b�X����
��������S]S������
�.�*�@q��-�[�F�w���k!���5�u����n�%�N�d@W�s��
�0�7�!xC1%��eu��K�b����[��-~v��:��� 7V�f�M���A�����t����
���\��
�tO�z-x����.
���8��C�bg�*�@q��-��?�����b��~7]�����HX�n�B�j�������R��C�U��T�q�i�
��+����T������~?�h�&S����MgE�P IDATW�#�f�o���j�!xs�7S���Z����q�/�c)��8|N��-K��������?��m7����/K�t�`����~PC[o�����7�!xC1�t�i�[Mg��r���8�������x��[8����w�������to�Zo.C��b:r����7W�.
Y�������&���(�@qX��p�B{��n:UG^g��.>0����0�oM�b�
����e�PL���Z���i��r���8����
��(�@qX��p3zK�f�t���Z-����G'
��LG��2o(�d����Tu)P�
��n�����-JkPV�-��j�%��t�����������}�7@-�7�!xC��Y�����T�E���M\���l��Zi-��
�������w�e��������-y���/�o��+�	�!xs�7[ys��y�N�.�Xa�i��p�v��(+������?�i���H[/���������N����|����"VY�2�p�����������Pm���t���m��o���\��
�V�Y�oU�v�.�Xem�e*x��.�Y��p'��L���h����B
�����e�Pl���������"VY��L6�P��sd���Y��[�oe;����{0��B�}�\��
��������S]�������X��p'����/j�[`�f��
{�����7��B�}�\��
FHo��J%a��Ye�>05���Bu9��*���8��D��Z���(�^�>4�o�G'V������L6�e��sSZ����Y��p'��l���-x��0��B�}�\��
F�O6���S]��6z����s��0V�-����%[���c�)������
=�k���c[U�����yWL
X>��s`���9��[���]\�<�=&�j�������������U����e��V�-����%�
��#����c�������(Xp7+-`�8��z�pzoY�F���>�u,j������w��6���px��j���z�pzo�W�k�[���������
{��
0�����(Xp7�-`���L��6����Zo�N�-���Z�6��E������
������P�\����j����S����P]�Y��p���Pm������������{�3�����������S]Lf�l�B��=o��U��@V�-����%����7Ey�{�{��6L(�p�7�!x���iX��o]����D�[���M����Du5
d�����[�@�����q-x+���)`�7�!x��6��o/�U�Yq�=o��Y���?7��Ke%'�������)`"�7�!x�����m<�Uu90��)�������bo`n�-���2X��)`6�7�!x�����-�+U]Ld�lb���q����.@��[��z�����m�rS�{��M�aEE.A��2o0����#�O�.&��6��
�/	�S]�<Y���=7��h�O�.�Z>�}�q�����
� xs�7���n-x�:�[u90�U����S�M�v�.@���[��[z�L���?���u��+�p�7�!x���C������a���$V]���vh�[�}��r����������m���`�����a���O����!�Q�\��
f���>-|����.&��6|.��i���"y�lo`kn�-��G��m���i�om�����^��7�(o.C�3<q|'�M]��������-q�Vu9�`������[�����;$>��3�&Rv�u�U	8�����G������Xy��P�tS����[���z�h�&-x��L{_���b/��#�����tSw������mY���-7��pS��]Z�j�������
(>�7�!x�Yj:���m����.��6e���Bu9rd������[.��+�tS��]oL8����e�`�����-�+������` �/`��v��,Q]�Y���'���t������MHI��������*�"xs�7�I?d���;U��am��!���������H�W�.�Z.�``�gj�Nj�[IUH�\�
�����L���)����=�K�A�����������q����z�qco�TV��o���i��"e;������7�X�\��
f+o���7��6���������-z�qco�m���H[o,e�GN�� xs�7��������	��d�l��v�6b���^��[�C2�z�>�z��{0nr������T�O8���>	DB�KB��f���-��CuEf`���V��[B�5��z�Ezn<e�)��sC��2oP!	�=
���m��m�KB��i������@���da���>��[�@�����/f�\�`<e�)��sC��2oP%}�BC���KB�j�K��mS����UW [�����2���Y'���4uDS�{��8ar��s���T�:�;��)SN��v��9��6`�����[�N����~N�������T
DBR���������#�-KD��T� �{�s{o�l=�o�+��h�/�g�6L�sD��2oP���#w����ok�6�9�]��#���+U� �]{k����+���m���?��@�����e�`]����������@[�����-�+�c�����-"����A�����M�tJ�����e�`
=��9����SN���"Q�����{���\��0���L�������$=|+i|��6e�l��"%|��y@uI����5�[����^Z�*��S���[�+a	�M,�!�7�!x����ow��'�����B�l����%�z�`1��-,��2%�����?O����q���T1`?o.C�+:r�D���eu����7��B��
�I�}5�`!��-,���*}�i�z��_S}h2%|+�
�1_��j�!xs�7XU���eu����6�}�	'-`cg������:�-��jkR��pS��_�������{�8z
�"xs�7XY ������}����9m�1|c�`:���@o�,�ms��[�`\�v���oe�c�~t�\��
vPujwJ����Tv����,������-z�v�@���Wqbo��%�D0 #����e������7�!x�]d��mM�&v�Y�S�����%��P��,�5��[�Eo����MD��/6��iIUH��\5`mo.C�;�
Iys���o�O���7�q�6q�V�G�K=z��u��&pro��efs	���~+��������D��2o��|�w��:��
�I��m�v�������Gs|o��ev����3�O;Mj��M���nE��2o�����l<�5%|�u�b�K�W����[b�Euu�#���0�%7������D0�����Dd���d���w���\��
v����R]���lb�E���L�b'W�E����<���%��i�N/�Y%�vo��F0,R{t2cWR�m��=7���"xs�7����7-X����[���=�@bg	�S]�n�-�Go�_�z{J�6����+�c�n�S!�����(���8o.C��{��N-x{���%�0�k��s�'��(�����f���\\uWJ�6R~�D�}y�[MQy�������4o.C����������:���}�_���E��.�{TW���{C�[
��_Q�q�[>w�%���2NA�����IB8�������^�5j�[�������X��o��e��}Em��l�����������o����_��ij�Y����9!��������;tZ���+U�����M��2����'"Q��JK��0��8��w��7�/��iR�`\�6Ld�N���+�O���e�`wo��6��_�};2NA�����:GQ�,�-�@o)�pS���o�+����-�����/6k����T���oDd`4Q��
(�7�!x��u�{�����a���},`g�i�y\���y@��V]*`�F��_"�������bp"��pe�c��qB�:�q��7�!x�$��eu��K��X��!����Z��-�.�h��u|��q��9���[��b�l�O�\�w��k��I������OI�|%,{�F�����`>�7�!x��Y����zU���f���[��b�H�7kwi�*	l,h
j&�	�""���u7�~P�3�&��#�����e���4��������r ,`�"�&q_��!\��9�
���0��<3p�]p��-E��������O��bo�0� xs�78Ays��y�N�.�����$q�Vb��g�����T8���-���K`���C�����l��4�l������
��s�9q��q�'�c���B�������zX�(������3d�
����3z#�[�Inj��uk���32�	�/�W{tR{9��=q�w�%w���r R*���e��U�vk�[Mg��r ,`M�<�:�p�������[��b
�n��������!��U������.������8�q���V_�#��7�!x��t��YXE�GR3�����P]!0'�F��X�d����&���2^��a������d���T��
{�;������]r�c��r�A��2op�|�Z�Vuj��r ,`�"1�r�Hj��7�
�%sBo`z��%C��v�%w��>���6J|���y�w�mm��S �)�k��8��0o.C�'�������J��@X�ZJ�\�qS���-�@o���q����K������;n������C.�a���V�
���m�\{������e��o���Z���F���Sr0C.GR�GS���2����t��]r�c���#�i�C�+sVF��2op�@$�o��JU�ak5���$|Nu9@��-�@oq�h�O���(���r�.�Y%��?"c{vI����0.)��w����j�s�GYF�t� xs�78��������w��%FZT�����gJq���-��lm1����$������+���]s����s�E��2op�;���������r\�����)�-�@oq��@�L��`�v�W��������.��/Yfw\:3����s�'�k%�������)��+���;tZu9���Z��MjU�������v�$��Xpwq�r-�KW���L�zc��r�G'����E�c.����� �#xs�78�>x;r���r\�����n�����-��
Fo`z��a\��F��9�������
sH���z('"20�H�5��g��pn�t���A��2op���:-x���S]������78���-�M"�H�WB�5�����q��r�n��3g�#��!�c���KI�1����B����iw�5uD��.V��������)�����$�h�[��Bu9@��-�@oA���Z �<�Z������m[dl�.	l��n�t*��^�~�F��2op
��i-x+o�T]�������H�����!�����#)����%��:%�K�tM�9q��w\��3+�7�!x�S�YX�����-��Du5@��-�@o����\r�����9�r�`n��Gdl�.m�C�����(�����{��p�QT�#xs�78E ���eu���q=���o��.(���-�������/J��FF+7m�\�s�i��]sv�g.�~����������:X�ZO�����-���� �F���N��>�����w����p.9��;��`Yop����Z��:��Wck=1�
-xK���.(���-p����pSc�Q��uk��e;���=gv@���{0n��������I��+	�,����}Z�?�Eu9@A�-�@o��D���4V���b�3�mRk�t��a��������[2�����I�N������:���X����2��*T����� U"�������dVE�!�\�W�.����Yw������$5�uZ�Vuj��r\���$FZ��-�]��� �F���I�7�����=gt@7R~��jk�Z�Mj�[��I���������;tZ���+U��j,`-(�65������
Bo`z`���s�������2�_Q�}���o�(�78I��G�JV]�����&-x;�@u)@A�-�@o�A�59B��.}@�l��
N������.��X�ZS��mL6���[���G���1xk��i�����
+������i�iX�o]����Xk����
�.��.����-�}D�})��%�����4���Z�v��	���XkJ�lzv��r���[������iR�`\����V7;�7�!x���'��t��.��X�ZSb��Mak�F����)x-x[�THQe�!xs�78��F-x�xl��r\��E1�6Go`z`/�����cZ�60�PT���\��
N�:�o�����q-���2�4�W]�z#�[{���.-xK����k�[[oLa�3#xs�78M b�����.&����-�@o��_Q�o�v�����	-x�"�������������������S�v�.����-��d����h����	�����e��DL6U��u��vL
X8�Vu9@^�-�@o�E�M��^o��i������8�7�!x�1�T=���ia�l�����^B�5Z���Iy��M	�\��
Nt��	,(�������0������������7�!x�u�{�����a���Xk�yWL
X>�� g�F���i�j����<���C�Z�V{tRQ�3#xs�78�~�i b�m�N�������g��.����-�������{S�k��j����������e��TXP���%�L
X��P]�3z#�[�Io�+�Hy}`4a�{��\��
NUuj7bkqQ�=o�M�[������-x�v�R����v�g�{��\��
N���X�Z_����{�FZT�������~��Z�6�z$�=�=o[&�8�7�!x�S���R�y��X�Z_��Z�y���[����~�i�z{�{��q-x+�STavo.C�'��a��u�{T��*,`�/q��{�`;�F���3�z$�dS����{���
*����f���#���g�#��|�����^z)��'x��=q|����kT]�������{��~����0���D0�u������	�7%x����~��o������=�s�@�@0adF���Xq���J�)����@���Q:TP���"��A��"h�SA�R#�4E�V�D	9�s��������f���������ww���7{>����i��Q�����#dfz������Z��?oO�=�vw�n`#���~|�[�{nwhs'0�����1���v���u�>n�����nJ�!***�����c�-OOOWrr�N��9f�7�f���������6�����������w�;@��[8���LU9/����Y�����o��������E��_]f������~���L�W�k?oh��}����/��N��
l��,��q���r�7@��[8���L�����#��_���_7}"������-B<�����������*WTT��4}����C����������q�����VY�vw�F1�ps������2h�����������]����"��Q���S�z�y<��#Coh����.��!����fp9����7_�����Q�-���D�s��V�>�-���B]��)�[�����t�W��.!!Aa�g���z���5o�<�U�?,�
o��>�z�J�9s�^z�%��Ak�}�"#�mY��zh��s�Fs�1��h��>�2Y�n�7�o����Y��+����_�����-B����K�.�������q�����e�4p�@�F��h4�F��h�V��n�����E#x����S\\���������dfz��'\��C�!^|�E��<�|��m23-^�����LEAIDAT��^ob���23����A��M���������3���-��1B���Z�l����k��7O����4i��]�y�"������t���L�G���WMM��]�y�"���~���B;v����o��� x@�8��
p�[�i�&�u�]�������Z=������u�[\���/j��)!m��Au����c��wo���j��	:z�h����@�8y������E���>�{���L��-UUU���L�2E[�n
�en���6 //O�����A���?�Q�'OVtt����v�����3���W\\��^oP�4iR����Biii����233������d]}�����r�@�8u���.3�����>�{���L��-f�����{�e���1�p�[0h� ]s�5:u�T`��/�,3����]�7���t��F�~���)66V���,��u��������X��PTT����q���[uZ�p��9s���TVV����[�����;p���L�?�|���'O*..N�'Ov�g��d�������h]�^�4|����C�Q�~����m��A�GW]u�.\X���p�E�����3�H��q���[�F����mo����+ef��aC���}���[nq�WZ��\:t�����?�I�<���y����j���ef�6mZ��������xT]]��u"�����UWWk��]��8�^��:-O8s�$���_��w��������Y�f���>�an�6��V��W^�����(d���C��woz�%���[������8u��]�����i������������9sB��5kV�����y�q���[uZ������O+66V:tPLL����
y<����'��'�����Z���,�������uw�}�.��rz�%��������}�vI������{Nf����O���������k��l��/���g��f�y�q���[uZ������"y<�y���/����5j��L�g�����}o����|�����;����B��>�O'O�Y~��w***J�����}��L���C�233ef*--m�:��������U�ek�Q������-���R�N���o_I�-�G����}y���:�������w�WZ��3g���e���������R7e���N�8��u"OC?���q�@��X����n�M			��[�����+((��i��EA�}>�:t���c���1����o4{�l��{���ef*))���WRR�z����#F(55U���@�i��q��"n�h��[�n��^xA���!����_=z�����}o�\]]��v�������6l������r�g�TQQ���(��g?Z~��)����O�>�e#G�T��]u��������NIIIz��G�Y�q���[uZ����U�V������"EGGk��	��[����
�����i���*++�|�=z������3g���L�8107���_���:t�<���[���e�<����^���[EEE������zURR�X�����`����U��jhn���V��������K����Tk��UZZ�RRRt���@-s7��~�_O=�����df23�z��*++s�k\T]]�'�|R������g��Z�zuH��+��s�@]ZZ�>��#��D����p�E���r56�:tH������f��n�I�v�
�cn�&��6���R���������
�����*..��}����������]�TRR����KV���^��:������c�>�hs7� x@�8��
p���7�o��� x@�8��
p���7�o��� x@�8��
p���7�o��� x@�8��
p���7�o��� x@����
�3f�r��[����&�����^�=��%����c��z��zu�
7\�c_jz��N�87����u�o�(%%E<��%?������xTUUX�����c�^�~ddd())I���Z�v�%=��v��z��m�����7����{Z
�7�f�V����O����77ddd��P�b�u[:G�uo���i��������
���C���k�*???d����/*((�+..����5s�L�Y�F>�/����@,��9s�r�J�����+**�< 3��u����_6z�����;W�����m���l_�;�y��i���Z�p�*++�<'
�J���U�4w�\�]�������D�/������������%K�h���*..���9���w�o�o��o��O~���RRR��x��_�B:t
cRRR����}Cu����}�������<x��L]�tQ�.]dfJLL����wI����+3�����	�=��c�4l�0��������$3�=���'N�e��1����:v��^�z��t��W�_��W����Pi��E������U��=���d}�������:=��c���R||�:v�(3��#������@�������dEEE�W���c��Z�3���h��.����������_�T^^�[n�Efv��������������K������xm��1P[PP����k��!�e�=jz������E�������i����x<������^z)�Ww�����)++���r~���w�)66V���/U[[+I*++SZZ���������������jkk������o+>>>�=uO=���L���������l�d�G���w���#�z�\���2��~���-?p�����.:xKMM�����3g�������w���/�<����������4q���}�7Nf��������������:�������\��Jg���������P�6m���W]]����5h����O�<Y���*//��3g��S'
6,����k���z��7�}��^�p���9���z��wefz���C����^t����^���~�v���5k���������u�e��s��������+W���j�������[A�RRR4r�������=��>�U_�4t���#��<��V�Z��������._�<���7����������+3��Y�<~s��B�uS�l�����
�-]�Tfx����������|������kdfj�����'j��A�MS����ef���OB�������o4�������P���S�����7����h��������y��z����u999
���y!���q6v�@�D�p���_/3���KC�u��%(����������������&��~�A�;wVjj�6m�������;��������{Of�����>����23�_����4��\M�J'N�����5h� ��6m�������y��F���W_5X�q�Fm�����y!���q�E�@�A�p�*++�{��'h����C>��������m��1��0l��=���s?~\�:uR���f��)3Seee�����Pll��2��o�]qqq:~�x�}il����6l��n���|�s��u23-_�\~�_={��u�]�0�Y�f��M7��������[�n����;�N�>�n������o�q�{��gC��^oa���������W���C��m��I23=���*,,��o��^�z�}��Mo'N�Pbb����j��������O�S���*666@��=[f��S��<�S�L�����Vaa�������M��h_[~��C�c��)%%E}������7���*//O7�x��^��=*IZ�h��L��{�6o�����+;;[���A!ZNN��L>��
UTT��#G*&&Fyyy��3�k�8�;G��"x�>�O��MS||��L����2e��9rDC�������}{-X�@��ok��Q�.]�_v�e�9s�,X 3���~*I:x�����J��n���z�YWW�3f������������� k�GM7o��~���if��������-X�@]�v
�������
z|V������C�@]�����;�86�p�u��$x�� x�TVVj�����h����T;v����'/�555��w�v���(��|>��9����F�w��ii���������4��P���#���/TZZ���>�O{�����;C�s���h���*))	z����{��^���I�@�A��fC��4�m��MFF��L^�W7�p���iQ�N�*������7��74���<���*77Wk��u�;-��m��f��Unw\o��� x@�8��
p���7�o��� x@�8��
p���7�o��� x@�8��
p���7�o��� x@�8��
p���7�o���O������IEND�B`�

#126

thomas.munro@gmail.com

about 4 years ago

In reply to: Tomas Vondra (#125)

Re: WIP: WAL prefetch (another approach)

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).

Thanks for testing! Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.

#127

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Thomas Munro (#126)

Re: WIP: WAL prefetch (another approach)

On 11/26/21 22:16, Thomas Munro wrote:

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The results are pretty good / similar to previous results. Replaying the
1h worth of work on a smaller machine takes ~5:30h without prefetching
(master or with prefetching disabled). With prefetching enabled this
drops to ~2h (default config) and ~1h (with tuning).

Thanks for testing! Wow, that's a nice graph.

This has bit-rotted already due to Robert's work on ripping out
globals, so I'll post a rebase early next week, and incorporate your
code feedback.

One thing that's not clear to me is what happened to the reasons why
this feature was reverted in the PG14 cycle?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#128

[1]: /messages/by-id/20211007.172820.1874635561738958207.horikyota.ntt@gmail.com
[2]: /messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de

thomas.munro@gmail.com

about 4 years ago

In reply to: Tomas Vondra (#127)

Re: WIP: WAL prefetch (another approach)

On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

One thing that's not clear to me is what happened to the reasons why
this feature was reverted in the PG14 cycle?

Reasons for reverting:

1. A bug in commit 323cbe7c, "Remove read_page callback from
XLogReader.". I couldn't easily revert just that piece. This new
version doesn't depend on that change anymore, to try to keep things
simple. (That particular bug has been fixed in a newer version of
that patch[1]/messages/by-id/20211007.172820.1874635561738958207.horikyota.ntt@gmail.com, which I still think was a good idea incidentally.)
2. A bug where allocation for large records happened before
validation. Concretely, you can see that this patch does
XLogReadRecordAlloc() after validating the header (usually, same as
master), but commit f003d9f8 did it first. (Though Andres pointed
out[2]/messages/by-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de that more work is needed on that to make that logic more
robust, and I'm keen to look into that, but that's independent of this
work).
3. A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
machine. Tom eventually reproduced it with the patches reverted,
which seemed to exonerate them but didn't leave a good feeling: what
was happening, and why did the patches hugely increase the likelihood
of the failure mode? I have no new information on that, but I know
that several people spent a huge amount of time and effort trying to
reproduce it on various types of systems, as did I, so despite not
reaching a conclusion of a bug, this certainly contributed to a
feeling that the patch had run out of steam for the 14 cycle.

This week I'll have another crack at getting that TAP test I proposed
that runs the regression tests with a streaming replica to work on
Windows. That does approximately what Tom was doing when he saw
problem #3, which I'd like to have as standard across the build farm.

#129

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Thomas Munro (#128)

Re: WIP: WAL prefetch (another approach)

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Nov 27, 2021 at 12:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

One thing that's not clear to me is what happened to the reasons why
this feature was reverted in the PG14 cycle?

3. A wild goose chase for bugs on Tom Lane's antique 32 bit PPC
machine. Tom eventually reproduced it with the patches reverted,
which seemed to exonerate them but didn't leave a good feeling: what
was happening, and why did the patches hugely increase the likelihood
of the failure mode? I have no new information on that, but I know
that several people spent a huge amount of time and effort trying to
reproduce it on various types of systems, as did I, so despite not
reaching a conclusion of a bug, this certainly contributed to a
feeling that the patch had run out of steam for the 14 cycle.

Yeah ... on the one hand, that machine has shown signs of
hard-to-reproduce flakiness, so it's easy to write off the failures
I saw as hardware issues. On the other hand, the flakiness I've
seen has otherwise manifested as kernel crashes, which is nothing
like the consistent test failures I was seeing with the patch.

Andres speculated that maybe we were seeing a kernel bug that
affects consistency of concurrent reads and writes. That could
be an explanation; but it's just evidence-free speculation so far,
so I don't feel real convinced by that idea either.

Anyway, I hope to find time to see if the issue still reproduces
with Thomas' new patch set.

regards, tom lane

#130

Ashutosh Sharma

ashu.coek88@gmail.com

about 4 years ago

In reply to: Thomas Munro (#124)

Re: WIP: WAL prefetch (another approach)

Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.

--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Show quoted text

On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <daniel@yesql.se>
wrote:

Could you post an updated version of the patch which is for review?

Sorry for taking so long to come back; I learned some new things that
made me want to restructure this code a bit (see below). Here is an
updated pair of patches that I'm currently testing.

Old problems:

1. Last time around, an infinite loop was reported in pg_waldump. I
believe Horiguchi-san has fixed that[1], but I'm no longer depending
on that patch. I thought his patch set was a good idea, but it's
complicated and there's enough going on here already... let's consider
that independently.

This version goes back to what I had earlier, though (I hope) it is
better about how "nonblocking" states are communicated. In this
version, XLogPageRead() has a way to give up part way through a record
if it doesn't have enough data and there are queued up records that
could be replayed right now. In that case, we'll go back to the
beginning of the record (and occasionally, back a WAL page) next time
we try. That's the cost of not maintaining intra-record decoding
state.

2. Last time around, we could try to allocate a crazy amount of
memory when reading garbage past the end of the WAL. Fixed, by
validating first, like in master.

New work:

Since last time, I went away and worked on a "real" AIO version of
this feature. That's ongoing experimental work for a future proposal,
but I have a working prototype and I aim to share that soon, when that
branch is rebased to catch up with recent changes. In that version,
the prefetcher starts actual reads into the buffer pool, and recovery
receives already pinned buffers attached to the stream of records it's
replaying.

That inspired a couple of refactoring changes to this non-AIO version,
to minimise the difference and anticipate the future work better:

1. The logic for deciding which block to start prefetching next is
moved into a new callback function in a sort of standard form (this is
approximately how all/most prefetching code looks in the AIO project,
ie sequential scans, bitmap heap scan, etc).

2. The logic for controlling how many IOs are running and deciding
when to call the above is in a separate component. In this non-AIO
version, it works using a simple ring buffer of LSNs to estimate the
number of in flight I/Os, just like before. This part would be thrown
away and replaced with the AIO branch's centralised "streaming read"
mechanism which tracks I/O completions based on a stream of completion
events from the kernel (or I/O worker processes).

3. In this version, the prefetcher still doesn't pin buffers, for
simplicity. That work did force me to study places where WAL streams
need prefetching "barriers", though, so in this patch you can
see that it's now a little more careful than it probably needs to be.
(It doesn't really matter much if you call posix_fadvise() on a
non-existent file region, or the wrong file after OID wraparound and
reuse, but it would matter if you actually read it into a buffer, and
if an intervening record might be trying to drop something you have
pinned).

Some other changes:

1. I dropped the GUC recovery_prefetch_fpw. I think it was a
possibly useful idea but it's a niche concern and not worth worrying
about for now.

2. I simplified the stats. Coming up with a good running average
system seemed like a problem for another day (the numbers before were
hard to interpret). The new stats are super simple counters and
instantaneous values:

postgres=# select * from pg_stat_prefetch_recovery ;
-[ RECORD 1 ]--+------------------------------
stats_reset | 2021-11-10 09:02:08.590217+13
prefetch | 13605674 <- times we called posix_fadvise()
hit | 24185289 <- times we found pages already cached
skip_init | 217215 <- times we did nothing because init, not read
skip_new | 192347 <- times we skipped because relation too small
skip_fpw | 27429 <- times we skipped because fpw, not read
wal_distance | 10648 <- how far ahead in WAL bytes
block_distance | 134 <- how far ahead in block references
io_depth | 50 <- fadvise() calls not yet followed by pread()

I also removed the code to save and restore the stats via the stats
collector, for now. I figured that persistent stats could be a later
feature, perhaps after the shared memory stats stuff?

3. I dropped the code that was caching an SMgrRelation pointer to
avoid smgropen() calls that showed up in some profiles. That probably
lacked invalidation that could be done with some more WAL analysis,
but I decided to leave it out completely for now for simplicity.

4. I dropped the verbose logging. I think it might make sense to
integrate with the new "recovery progress" system, but I think that
should be a separate discussion. If you want to see the counters
after crash recovery finishes, you can look at the stats view.

[1] https://commitfest.postgresql.org/34/2113/

#131

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#129)

Re: WIP: WAL prefetch (another approach)

On Fri, Nov 26, 2021 at 9:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah ... on the one hand, that machine has shown signs of
hard-to-reproduce flakiness, so it's easy to write off the failures
I saw as hardware issues. On the other hand, the flakiness I've
seen has otherwise manifested as kernel crashes, which is nothing
like the consistent test failures I was seeing with the patch.

Andres speculated that maybe we were seeing a kernel bug that
affects consistency of concurrent reads and writes. That could
be an explanation; but it's just evidence-free speculation so far,
so I don't feel real convinced by that idea either.

Anyway, I hope to find time to see if the issue still reproduces
with Thomas' new patch set.

Honestly, all the reasons that Thomas articulated for the revert seem
relatively unimpressive from my point of view. Perhaps they are
sufficient justification for a revert so near to the end of the
development cycle, but that's just an argument for committing things a
little sooner so we have time to work out the kinks. This kind of work
is too valuable to get hung up for a year or three because of a couple
of minor preexisting bugs and/or preexisting maybe-bugs.

--
Robert Haas
EDB: http://www.enterprisedb.com

#132

stark@mit.edu

about 4 years ago

In reply to: Tom Lane (#129)

Re: WIP: WAL prefetch (another approach)

On Fri, 26 Nov 2021 at 21:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah ... on the one hand, that machine has shown signs of
hard-to-reproduce flakiness, so it's easy to write off the failures
I saw as hardware issues. On the other hand, the flakiness I've
seen has otherwise manifested as kernel crashes, which is nothing
like the consistent test failures I was seeing with the patch.

Hm. I asked around and found a machine I can use that can run PPC
binaries, but it's actually, well, confusing. I think this is an x86
machine running Leopard which uses JIT to transparently run PPC
binaries. I'm not sure this is really a good test.

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

IBUILD:~ gsstark$ uname -a
Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15
16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

IBUILD:~ gsstark$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.5.8
BuildVersion: 9L31a

#133

stark@mit.edu

about 4 years ago

In reply to: Greg Stark (#132)

Re: WIP: WAL prefetch (another approach)

The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not
really clear how the emulation is done and whether it makes a
reasonable test environment or not.

Hardware Overview:

Model Name: Mac mini
Model Identifier: Macmini2,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 2 GB
Bus Speed: 667 MHz
Boot ROM Version: MM21.009A.B00

#134

[1]: /messages/by-id/3502526.1619925367@sss.pgh.pa.us

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Greg Stark (#132)

Re: WIP: WAL prefetch (another approach)

Greg Stark <stark@mit.edu> writes:

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

I'm not sure that machine is close enough to prove much, but by all
means give it a go if you wish. My test setup was explained in [1]/messages/by-id/3502526.1619925367@sss.pgh.pa.us:

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

Remember also that the code in question is not in HEAD; you'd
need to apply Munro's patches, or check out some commit from
around 2021-04-22.

regards, tom lane

#135

stark@mit.edu

about 4 years ago

In reply to: Tom Lane (#134)

Re: WIP: WAL prefetch (another approach)

What tools and tool versions are you using to build? Is it just GCC for PPC?

There aren't any special build processes to make a fat binary involved?

On Thu, 16 Dec 2021 at 23:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

But if you're interested and can explain the tests to run I can try to
get the tests running on this machine:

I'm not sure that machine is close enough to prove much, but by all
means give it a go if you wish. My test setup was explained in [1]:

To recap, the test lashup is:
* 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive)
* Standard debug build (--enable-debug --enable-cassert)
* Out-of-the-box configuration, except add wal_consistency_checking = all
and configure a wal-streaming standby on the same machine
* Repeatedly run "make installcheck-parallel", but skip the tablespace
test to avoid issues with the standby trying to use the same directory
* Delay long enough after each installcheck-parallel to let the
standby catch up (the run proper is ~24 min, plus 2 min for catchup)

Remember also that the code in question is not in HEAD; you'd
need to apply Munro's patches, or check out some commit from
around 2021-04-22.

regards, tom lane

[1] /messages/by-id/3502526.1619925367@sss.pgh.pa.us

--
greg

#136

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Greg Stark (#135)

Re: WIP: WAL prefetch (another approach)

Greg Stark <stark@mit.edu> writes:

What tools and tool versions are you using to build? Is it just GCC for PPC?
There aren't any special build processes to make a fat binary involved?

Nope, just "configure; make" using that macOS version's regular gcc.

regards, tom lane

#137

stark@mit.edu

about 4 years ago

In reply to: Tom Lane (#136)

Re: WIP: WAL prefetch (another approach)

I have

IBUILD:postgresql gsstark$ ls /usr/bin/*gcc*
/usr/bin/gcc
/usr/bin/gcc-4.0
/usr/bin/gcc-4.2
/usr/bin/i686-apple-darwin9-gcc-4.0.1
/usr/bin/i686-apple-darwin9-gcc-4.2.1
/usr/bin/powerpc-apple-darwin9-gcc-4.0.1
/usr/bin/powerpc-apple-darwin9-gcc-4.2.1

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?

#138

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Greg Stark (#137)

Re: WIP: WAL prefetch (another approach)

Greg Stark <stark@mit.edu> writes:

I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1
or maybe 4.0.1. What version is on your G4?

$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin9
Configured with: /var/tmp/gcc/gcc-5493~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --program-prefix= --host=powerpc-apple-darwin9 --target=powerpc-apple-darwin9
Thread model: posix
gcc version 4.0.1 (Apple Inc. build 5493)

I see that gcc 4.2.1 is also present on this machine, but I've
never used it.

regards, tom lane

#139

stark@mit.edu

about 4 years ago

In reply to: Tom Lane (#138)

Re: WIP: WAL prefetch (another approach)

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.

2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG: all server processes
terminated; reinitializing

#140

tomas.vondra@enterprisedb.com

about 4 years ago

In reply to: Greg Stark (#139)

Re: WIP: WAL prefetch (another approach)

On 12/17/21 23:56, Greg Stark wrote:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some
incompatibility with the emulation and the IPC stuff parallel workers
use.

2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error
2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was
running: SELECT variance(unique1::int4), sum(unique1::int8),
regr_count(unique1::float8, unique1::float8)
FROM (SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1
UNION ALL SELECT * FROM tenk1) u;
2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active
server processes
2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in
recovery mode
2021-12-17 17:51:51.761 EST [50955] LOG: all server processes
terminated; reinitializing

Interesting. In my experience SIGBUS on PPC tends to be due to incorrect
alignment, but I'm not sure how that works with the emulation. Can you
get a backtrace?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#141

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Greg Stark (#139)

Re: WIP: WAL prefetch (another approach)

Greg Stark <stark@mit.edu> writes:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).

FWIW, I think that's the first one *after* the revert.

2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel
worker" (PID 54073) was terminated by signal 10: Bus error

I'm betting on weird emulation issue. None of my real PPC machines
showed such things.

regards, tom lane

#142

stark@mit.edu

about 4 years ago

In reply to: Tom Lane (#141)

Re: WIP: WAL prefetch (another approach)

On Fri, 17 Dec 2021 at 18:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

Hm. I seem to have picked a bad checkout. I took the last one before
the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc).

FWIW, I think that's the first one *after* the revert.

Doh

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

--
greg

#143

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Greg Stark (#142)

Re: WIP: WAL prefetch (another approach)

Greg Stark <stark@mit.edu> writes:

But the bigger question is. Are we really concerned about this flaky
problem? Is it worth investing time and money on? I can get money to
go buy a G4 or G5 and spend some time on it. It just seems a bit...
niche. But if it's a real bug that represents something broken on
other architectures that just happens to be easier to trigger here it
might be worthwhile.

TBH, I don't know. There seem to be three plausible explanations:

1. Flaky hardware in my unit.
2. Ancient macOS bug, as Andres suggested upthread.
3. Actual PG bug.

If it's #1 or #2 then we're just wasting our time here. I'm not
sure how to estimate the relative probabilities, but I suspect
#3 is the least likely of the lot.

FWIW, I did just reproduce the problem on that machine with current HEAD:

2021-12-17 18:40:40.293 EST [21369] FATAL: inconsistent page found, rel 1663/167772/2673, forknum 0, blkno 26
2021-12-17 18:40:40.293 EST [21369] CONTEXT: WAL redo at C/3DE3F658 for Btree/INSERT_LEAF: off 208; blkref #0: rel 1663/167772/2673, blk 26 FPW
2021-12-17 18:40:40.522 EST [21365] LOG: startup process (PID 21369) exited with exit code 1

That was after only five loops of the regression tests, so either
I got lucky or the failure probability has increased again.

In any case, it seems clear that the problem exists independently of
Munro's patches, so I don't really think this question should be
considered a blocker for those.

regards, tom lane

#144

thomas.munro@gmail.com

about 4 years ago

In reply to: Ashutosh Sharma (#130)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

[Replies to two emails]

On Fri, Dec 10, 2021 at 9:40 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

I am unable to apply these new set of patches on HEAD. Can you please share the rebased patch or if you have any work branch can you please point it out, I will refer to it for the changes.

Hi Ashutosh,

Sorry I missed this. Rebase attached, and I also have a public
working branch at
https://github.com/macdice/postgres/tree/recovery-prefetch-ii .

On Fri, Nov 26, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

It's great you posted a new version of this patch, so I took a look a
brief look at it. The code seems in pretty good shape, I haven't found
any real issues - just two minor comments:

This seems a bit strange:

#define DEFAULT_DECODE_BUFFER_SIZE 0x10000

Why not to define this as a simple decimal value?

Changed to (64 * 1024).

Is there something
special about this particular value, or is it arbitrary?

It should be large enough for most records, without being ridiculously
large. This means that typical users of XLogReader (pg_waldump, ...)
are unlikely to fall back to the "oversized" code path for records
that don't fit in the decoding buffer. Comment added.

I guess it's
simply the minimum for wal_decode_buffer_size GUC, but why not to use
the GUC for all places decoding WAL?

The GUC is used only by xlog.c for replay (and has a larger default
since it can usefully see into the future), but frontend tools and
other kinds of backend WAL decoding things (2PC, logical decoding)
don't or can't respect the GUC and it didn't seem worth choosing a
number for each user, so I needed to pick a default.

FWIW I don't think we include updates to typedefs.list in patches.

Seems pretty harmless? And useful to keep around in development
branches because I like to pgindent stuff...

Attachments:

v20-0001-Add-circular-WAL-decoding-buffer-take-II.patchtext/x-patch; charset=US-ASCII; name=v20-0001-Add-circular-WAL-decoding-buffer-take-II.patchDownload

From ee9e9adc2ec2087fac7f0d83ced59f1ddbbc216c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:33:10 +1300
Subject: [PATCH v20 1/2] Add circular WAL decoding buffer, take II.

Teach xlogreader.c to decode its output into a circular buffer, to
support upcoming optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros, and the traditional members like xlogreader->ReadRecPtr.

 * An alternative new interface XLogReadAhead()/XLogNextRecord() is
   added that returns pointers to DecodedXLogRecord
   objects so that it's possible to look ahead in the WAL stream.

 * In order to be able to use the new interface effectively, client
   code should provide a page_read() callback that response to
   a new nonblocking mode by returning XLREAD_WOULDBLOCK to avoid
   waiting.  No such implementation is included in this commit,
   and other code that is unaware of the new mechanism doesn't need
   to change.

The buffer's size can be set by the client of xlogreader.c.  Large
records that don't fit in the circular buffer are called "oversized" and
allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |   6 +-
 src/backend/access/transam/xlogreader.c   | 637 +++++++++++++++++-----
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  22 +-
 src/include/access/xlogreader.h           | 153 ++++--
 src/tools/pgindent/typedefs.list          |   2 +
 9 files changed, 657 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 63301a1ab1..0e9bcc7159 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..150b6803a8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1463,7 +1463,7 @@ checkXLogConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
@@ -10541,7 +10541,7 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
@@ -10703,7 +10703,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a7de02565..e89fe7d928 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,7 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -53,6 +54,12 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+/*
+ * Default size; large enough that typical users of XLogReader won't often need
+ * to use the 'oversized' memory allocation code path.
+ */
+#define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -67,6 +74,24 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
+}
+
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
 }
 
 /*
@@ -89,8 +114,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -141,18 +164,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -248,7 +264,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
+}
+
+/*
+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	if (!state->record)
+		return;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest item
+	 * decoded, decode_queue_tail.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_tail);
+	state->record = NULL;
+	state->decode_queue_tail = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_head. */
+	if (state->decode_queue_head == record)
+		state->decode_queue_head = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the tail record in the decode buffer. */
+		Assert(state->decode_buffer_tail == (char *) record);
+
+		/*
+		 * We need to update tail to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust tail to release space up to the next record. */
+			state->decode_buffer_tail = (char *) record;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_tail = state->decode_buffer;
+			state->decode_buffer_head = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
+ * called before the first call to XLogNextRecord().  This functions returns
+ * records and errors that were put into an internal queue by XLogReadAhead().
+ *
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogNextRecord.
+ */
+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
+	/* Release the last record returned by XLogNextRecord(). */
+	XLogReleasePreviousRecord(state);
+
+	if (state->decode_queue_tail == NULL)
+	{
+		*errormsg = NULL;
+		if (state->errormsg_deferred)
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			state->errormsg_deferred = false;
+		}
+
+		/*
+		 * state->EndRecPtr is expected to have been set by the last call to
+		 * XLogBeginRead() or XLogNextRecord(), and is the location of the
+		 * error.
+		 */
+
+		return NULL;
+	}
+
+	/*
+	 * Record this as the most recent record returned, so that we'll release
+	 * it next time.  This also exposes it to the traditional
+	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
+	 * the record for historical reasons.
+	 */
+	state->record = state->decode_queue_tail;
+
+	/*
+	 * Update the pointers to the beginning and one-past-the-end of this
+	 * record, again for the benefit of historical code that expected the
+	 * decoder to track this rather than accessing these fields of the record
+	 * itself.
+	 */
+	state->ReadRecPtr = state->record->lsn;
+	state->EndRecPtr = state->record->next_lsn;
+
+	*errormsg = NULL;
+
+	return state->record;
 }
 
 /*
@@ -258,17 +399,125 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error; errormsg is
+ * set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadlRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *decoded;
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(state);
+
+	/*
+	 * Call XLogReadAhead() in blocking mode to make sure there is something
+	 * in the queue, though we don't use the result.
+	 */
+	if (!XLogReaderHasQueuedRecordOrError(state))
+		XLogReadAhead(state, false /* nonblocking */ );
+
+	/* Consume the tail record or error. */
+	decoded = XLogNextRecord(state, errormsg);
+	if (decoded)
+	{
+		/*
+		 * XLogReadRecord() returns a pointer to the record's header, not the
+		 * actual decoded record.  The caller will access the decoded record
+		 * through the XLogRecGetXXX() macros, which reach the decoded
+		 * recorded as xlogreader->record.
+		 */
+		Assert(state->record == decoded);
+		return &decoded->header;
+	}
+
+	return NULL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+static XLogPageReadResult
+XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -281,6 +530,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	bool		assembled;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg;		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -290,21 +541,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -315,7 +565,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -323,6 +573,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	}
 
 restart:
+	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
 
@@ -336,7 +587,9 @@ restart:
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	if (readOff == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readOff < 0)
 		goto err;
 
 	/*
@@ -392,7 +645,7 @@ restart:
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -411,6 +664,31 @@ restart:
 		gotheader = false;
 	}
 
+	/*
+	 * Find space to decode this record.  Don't allow oversized allocation if
+	 * the caller requested nonblocking.  Otherwise, we *have* to try to
+	 * decode the record now because the caller has nothing else to do, so
+	 * allow an oversized record to be palloc'd if that turns out to be
+	 * necessary.
+	 */
+	decoded = XLogReadRecordAlloc(state,
+								  total_len,
+								  !nonblocking /* allow_oversized */ );
+	if (decoded == NULL)
+	{
+		/*
+		 * There is no space in the decode buffer.  The caller should help
+		 * with that problem by consuming some records.
+		 */
+		if (nonblocking)
+			return XLREAD_WOULDBLOCK;
+
+		/* We failed to allocate memory for an  oversized record. */
+		report_invalid_record(state,
+							  "out of memory while trying to decode a record of length %u", total_len);
+		goto err;
+	}
+
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -450,7 +728,9 @@ restart:
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
 										   XLOG_BLCKSZ));
 
-			if (readOff < 0)
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
 				goto err;
 
 			Assert(SizeOfXLogShortPHD <= readOff);
@@ -468,7 +748,7 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = RecPtr;
-				ResetDecoder(state);
+				//ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -523,7 +803,7 @@ restart:
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -537,8 +817,8 @@ restart:
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
@@ -546,16 +826,18 @@ restart:
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		if (readOff == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readOff < 0)
 			goto err;
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -565,14 +847,40 @@ restart:
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return XLREAD_SUCCESS;
+	}
 	else
-		return NULL;
+		return XLREAD_FAIL;
 
 err:
 	if (assembled)
@@ -590,14 +898,46 @@ err:
 		state->missingContrecPtr = targetPagePtr;
 	}
 
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+
 	/*
 	 * Invalidate the read state. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
+
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record, and return it.  The record will
+ * also be returned to XLogNextRecord(), which must be called to 'consume'
+ * each record.
+ *
+ * If nonblocking is true, may return NULL due to lack of data or WAL decoding
+ * space.
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, bool nonblocking)
+{
+	XLogPageReadResult result;
+
+	if (state->errormsg_deferred)
+		return NULL;
+
+	result = XLogDecodeNextRecord(state, nonblocking);
+	if (result == XLREAD_SUCCESS)
+	{
+		Assert(state->decode_queue_head != NULL);
+		return state->decode_queue_head;
+	}
 
 	return NULL;
 }
@@ -649,7 +989,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 
 		/* we can be sure to have enough WAL available, we scrolled back */
@@ -667,7 +1009,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
 									   state->readBuf);
-	if (readLen < 0)
+	if (readLen == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readLen < 0)
 		goto err;
 
 	Assert(readLen <= XLOG_BLCKSZ);
@@ -686,7 +1030,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 	}
 
@@ -704,8 +1050,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
+	return XLREAD_FAIL;
 }
 
 /*
@@ -1062,7 +1412,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg))
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1184,34 +1534,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
-
-	state->decoded_record = NULL;
+	DecodedXLogRecord *r;
 
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
+}
+
+/*
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t		size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
 }
 
 /*
- * Decode the previously read record.
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1226,17 +1625,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1254,7 +1656,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1265,18 +1667,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1284,7 +1686,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1292,9 +1698,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1437,17 +1843,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1456,58 +1863,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1533,10 +1919,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1556,10 +1943,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1587,12 +1975,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b33e0531ed..84109f1e48 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -370,7 +370,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59aed6cee6..0321e6a883 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -123,7 +123,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..eb147cfdcc 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -432,7 +432,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f66b5a8dba..7ac2d199a2 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -410,10 +410,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -511,7 +511,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -542,7 +542,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -555,7 +555,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->blocks[block_id].bimg_info;
+				uint8		bimg_info = record->record->blocks[block_id].bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -572,11 +572,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len,
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len,
 						   method);
 				}
 				else
@@ -584,8 +584,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..372ba1cc45 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -144,6 +144,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next; /* decoded record queue link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -171,6 +195,9 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
@@ -192,27 +219,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
-
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer; /* need to free? */
+	char	   *decode_buffer_head; /* write head */
+	char	   *decode_buffer_tail; /* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
 	 * readLen bytes)
@@ -262,8 +305,25 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
+
+	/*
+	 * Flag to indicate to XLogPageReadCB that it should not block, during
+	 * read ahead.
+	 */
+	bool		nonblocking;
 };
 
+/*
+ * Check if the XLogNextRecord() has any more queued records or errors.  This
+ * can be used by a read_page callback to decide whether it should block.
+ */
+static inline bool
+XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
+{
+	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -274,16 +334,40 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResultResult
+{
+	XLREAD_SUCCESS = 0,			/* record is successfully read */
+	XLREAD_FAIL = -1,			/* failed during reading a record */
+	XLREAD_WOULDBLOCK = -2		/* nonblocking mode only, no data */
+} XLogPageReadResult;
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
+extern XLogRecord *XLogReadRecord(XLogReaderState *state,
+								  char **errormsg);
+
+/* Consume the next record or error. */
+extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Release the previously returned record, if necessary. */
+extern void XLogReleasePreviousRecord(XLogReaderState *state);
+
+/* Try to read ahead, if there is data and space. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										bool nonblocking);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -307,25 +391,32 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
-#define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
-#define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
-#define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
+#define XLogRecHasBlockRef(decoder, block_id)			\
+	(((decoder)->record->max_block_id >= (block_id)) &&	\
+	 ((decoder)->record->blocks[block_id].in_use))
+#define XLogRecHasBlockImage(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f093605472..c7bec37355 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -533,6 +533,7 @@ DeadLockState
 DeallocateStmt
 DeclareCursorStmt
 DecodedBkpBlock
+DecodedXLogRecord
 DecodingOutputState
 DefElem
 DefElemAction
@@ -2936,6 +2937,7 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPageReadResult
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

v20-0002-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v20-0002-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From dc48c3eebadbfab83ea04d2838b1e35ec4847eee Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:43:45 +1300
Subject: [PATCH v20 2/2] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  61 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             | 168 +++-
 src/backend/access/transam/xlogprefetcher.c   | 945 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  39 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  43 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 21 files changed, 1391 insertions(+), 60 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e3..1d0366181c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3621,6 +3621,67 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times in some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is disabled by default.
+       </para>
+       <para>
+        This feature currently depends on an effective
+        <function>posix_fadvise</function> function, which some
+        operating systems lack.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 62f2a3332b..5f38d26f0e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2959,6 +2966,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5213,8 +5283,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 24e1c89503..f5de473acd 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogutils.o
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 150b6803a8..bcffb1e2d4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
@@ -114,6 +115,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
@@ -930,10 +932,13 @@ static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt, XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static void XLogShutdownWalRcv(void);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
@@ -947,12 +952,12 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
 							  int emode, bool fetching_ckpt,
 							  TimeLineID replayTLI);
 static void CheckRecoveryConsistency(void);
 static bool PerformRecoveryXLogAction(void);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
 										XLogRecPtr RecPtr, int whichChkpt, bool report,
 										TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI,
@@ -1494,7 +1499,7 @@ checkXLogConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG, InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -3798,7 +3803,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
 					 xlogfname);
 			set_ps_display(activitymsg);
-
 			if (!RestoreArchivedFile(path, xlogfname,
 									 "RECOVERYXLOG",
 									 wal_segment_size,
@@ -4456,17 +4460,19 @@ CleanupBackupHistory(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher,
+		   int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -4482,7 +4488,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -6697,6 +6703,7 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	XLogPrefetcher *xlogprefetcher;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
 	bool		promoted = false;
@@ -6876,6 +6883,15 @@ StartupXLOG(void)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -6904,7 +6920,8 @@ StartupXLOG(void)
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0, true,
+		record = ReadCheckpointRecord(xlogprefetcher,
+									  checkPointLoc, 0, true,
 									  replayTLI);
 		if (record != NULL)
 		{
@@ -6923,8 +6940,9 @@ StartupXLOG(void)
 			 */
 			if (checkPoint.redo < checkPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher,
+								LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -7042,7 +7060,8 @@ StartupXLOG(void)
 		checkPointLoc = ControlFile->checkPoint;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		replayTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher,
+									  checkPointLoc, 1, true,
 									  replayTLI);
 		if (record != NULL)
 		{
@@ -7539,13 +7558,17 @@ StartupXLOG(void)
 		if (checkPoint.redo < RecPtr)
 		{
 			/* back up to find the record */
-			XLogBeginRead(xlogreader, checkPoint.redo);
-			record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+			XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+			record = ReadRecord(xlogprefetcher,
+								PANIC, false,
+								replayTLI);
 		}
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher,
+								LOG, false,
+								replayTLI);
 		}
 
 		if (record != NULL)
@@ -7773,6 +7796,9 @@ StartupXLOG(void)
 					 */
 					if (AllowCascadeReplication())
 						WalSndWakeup();
+
+					/* Reset the prefetcher. */
+					XLogPrefetchReconfigure();
 				}
 
 				/* Exit loop if we reached inclusive recovery target */
@@ -7783,7 +7809,9 @@ StartupXLOG(void)
 				}
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(xlogreader, LOG, false, replayTLI);
+				record = ReadRecord(xlogprefetcher,
+									LOG, false,
+									replayTLI);
 			} while (record != NULL);
 
 			/*
@@ -7907,7 +7935,8 @@ StartupXLOG(void)
 	 * what we consider the valid portion of WAL.
 	 */
 	XLogBeginRead(xlogreader, LastRec);
-	record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+	record = ReadRecord(xlogprefetcher,
+						PANIC, false, replayTLI);
 	EndOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -8143,6 +8172,8 @@ StartupXLOG(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	/* Enable WAL writes for this backend only. */
 	LocalSetXLogInsertAllowed();
 
@@ -8546,7 +8577,8 @@ LocalSetXLogInsertAllowed(void)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
+					 XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -8571,8 +8603,9 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher,
+						LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
@@ -12379,6 +12412,9 @@ CancelBackup(void)
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -12428,20 +12464,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -12566,7 +12613,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -12598,11 +12645,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -12656,6 +12707,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -12670,7 +12729,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -12678,7 +12737,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -12819,7 +12878,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -12943,6 +13002,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 							havedata = false;
 					}
+
 					if (havedata)
 					{
 						/*
@@ -12976,11 +13036,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -12988,13 +13052,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -13036,7 +13100,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (((volatile XLogCtlData *) XLogCtl)->recoveryPauseState !=
 			RECOVERY_NOT_PAUSED)
+		{
 			recoveryPausesHere(false);
+		}
 
 		/*
 		 * This possibly-long loop needs to handle interrupts of startup
@@ -13045,7 +13111,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;			/* not reached */
 }
 
 /*
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..61b50fe400
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,945 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+bool		recovery_prefetch = false;
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (recovery_prefetch)
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!recovery_prefetch)
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that change the identity of buffer tags. These
+		 * must be treated as barriers that prevent prefetching for certain
+		 * ranges of buffer tags, so that we can't be confused by OID
+		 * wraparound (and later we might pin buffers).
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will use correct
+				 * this problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+#if 1
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+#endif
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+						XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate IO ahead of time unless asked not to.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+						 char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(recovery_prefetch ? maintenance_io_concurrency * 4 : 1,
+					  recovery_prefetch ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(bool *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value)
+	{
+		GUC_check_errdetail("recovery_prefetch must be set to off on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e89fe7d928..2d2ca7642b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1710,6 +1710,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1916,6 +1918,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1930,6 +1941,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 84109f1e48..156957db88 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 61b515cdb8..dfbf37b07b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -903,6 +903,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 839d81662f..26752e1551 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -210,7 +210,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 9fa3e0631e..2a6c07cea3 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -118,6 +119,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, CLOGShmemSize());
 	size = add_size(size, CommitTsShmemSize());
@@ -241,6 +243,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f9504d3aec..90c7c68867 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/storage.h"
@@ -211,6 +212,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1313,6 +1315,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		false,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2773,6 +2784,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3096,7 +3118,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -12140,6 +12163,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a1acd46b61..dfb4d08c27 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34f6c89f06..f03d834297 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -88,6 +88,7 @@ extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..f5bdb920d5
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 372ba1cc45..1e31b9987d 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -427,5 +431,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index eebc91f3a5..c0eafdc517 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4d992dc224..0d1a7b94f2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6355,6 +6355,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index aa18d304ac..97af4dd97c 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -447,4 +447,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(bool *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b58b062b10..efafec0b7d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1879,6 +1879,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7bec37355..d1f1cf6a95 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1407,6 +1407,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2938,6 +2941,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

#145

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Thomas Munro (#144)

Re: WIP: WAL prefetch (another approach)

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW I don't think we include updates to typedefs.list in patches.

Seems pretty harmless? And useful to keep around in development
branches because I like to pgindent stuff...

As far as that goes, my habit is to pull down
https://buildfarm.postgresql.org/cgi-bin/typedefs.pl
on a regular basis and pgindent against that. There have been
some discussions about formalizing that process a bit more,
but we've not come to any conclusions.

regards, tom lane

#146

andres@anarazel.de

about 4 years ago

In reply to: Thomas Munro (#144)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2021-12-29 17:29:52 +1300, Thomas Munro wrote:

FWIW I don't think we include updates to typedefs.list in patches.

Seems pretty harmless? And useful to keep around in development
branches because I like to pgindent stuff...

I think it's even helpful. As long as it's done with a bit of manual
oversight, I don't see a meaningful downside of doing so. One needs to be
careful to not remove platform dependant typedefs, but that's it. And
especially for long-lived feature branches it's much less work to keep the
typedefs.list changes in the tree, rather than coming up with them locally
over and over / across multiple people working on a branch.

Greetings,

Andres Freund

#147

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#144)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

https://github.com/macdice/postgres/tree/recovery-prefetch-ii

Here's a rebase. This mostly involved moving hunks over to the new
xlogrecovery.c file. One thing that seemed a little strange to me
with the new layout is that xlogreader is now a global variable. I
followed that pattern and made xlogprefetcher a global variable too,
for now.

There is one functional change: now I block readahead at records that
might change the timeline ID. This removes the need to think about
scenarios where "replay TLI" and "read TLI" might differ. I don't
know of a concrete problem in that area with the previous version, but
the recent introduction of the variable(s) "replayTLI" and associated
comments in master made me realise I hadn't analysed the hazards here
enough. Since timelines are tricky things and timeline changes are
extremely infrequent, it seemed better to simplify matters by putting
up a big road block there.

I'm now starting to think about committing this soon.

Attachments:

v21-0001-Add-circular-WAL-decoding-buffer-take-II.patchapplication/x-patch; name=v21-0001-Add-circular-WAL-decoding-buffer-take-II.patchDownload

From 21be9c98ab8ae7a6e3dbc406825dba0d99fe31f9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:33:10 +1300
Subject: [PATCH v21 1/2] Add circular WAL decoding buffer, take II.

Teach xlogreader.c to decode its output into a circular buffer, to
support upcoming optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros, and the traditional members like xlogreader->ReadRecPtr.

 * An alternative new interface XLogReadAhead()/XLogNextRecord() is
   added that returns pointers to DecodedXLogRecord
   objects so that it's possible to look ahead in the WAL stream.

 * In order to be able to use the new interface effectively, client
   code should provide a page_read() callback that response to
   a new nonblocking mode by returning XLREAD_WOULDBLOCK to avoid
   waiting.  No such implementation is included in this commit,
   and other code that is unaware of the new mechanism doesn't need
   to change.

The buffer's size can be set by the client of xlogreader.c.  Large
records that don't fit in the circular buffer are called "oversized" and
allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogreader.c   | 637 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |   4 +-
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  22 +-
 src/include/access/xlogreader.h           | 153 ++++--
 src/tools/pgindent/typedefs.list          |   2 +
 10 files changed, 657 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 4b0c63817f..bbb542b322 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..2b4e591736 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7736,7 +7736,7 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..cb491cb18d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,7 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -53,6 +54,12 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+/*
+ * Default size; large enough that typical users of XLogReader won't often need
+ * to use the 'oversized' memory allocation code path.
+ */
+#define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -67,6 +74,24 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
+}
+
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_head = buffer;
+	state->decode_buffer_tail = buffer;
 }
 
 /*
@@ -89,8 +114,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -141,18 +164,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -248,7 +264,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
+}
+
+/*
+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	if (!state->record)
+		return;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest item
+	 * decoded, decode_queue_tail.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_tail);
+	state->record = NULL;
+	state->decode_queue_tail = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_head. */
+	if (state->decode_queue_head == record)
+		state->decode_queue_head = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the tail record in the decode buffer. */
+		Assert(state->decode_buffer_tail == (char *) record);
+
+		/*
+		 * We need to update tail to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust tail to release space up to the next record. */
+			state->decode_buffer_tail = (char *) record;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_tail = state->decode_buffer;
+			state->decode_buffer_head = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
+ * called before the first call to XLogNextRecord().  This functions returns
+ * records and errors that were put into an internal queue by XLogReadAhead().
+ *
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogNextRecord.
+ */
+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
+	/* Release the last record returned by XLogNextRecord(). */
+	XLogReleasePreviousRecord(state);
+
+	if (state->decode_queue_tail == NULL)
+	{
+		*errormsg = NULL;
+		if (state->errormsg_deferred)
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			state->errormsg_deferred = false;
+		}
+
+		/*
+		 * state->EndRecPtr is expected to have been set by the last call to
+		 * XLogBeginRead() or XLogNextRecord(), and is the location of the
+		 * error.
+		 */
+
+		return NULL;
+	}
+
+	/*
+	 * Record this as the most recent record returned, so that we'll release
+	 * it next time.  This also exposes it to the traditional
+	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
+	 * the record for historical reasons.
+	 */
+	state->record = state->decode_queue_tail;
+
+	/*
+	 * Update the pointers to the beginning and one-past-the-end of this
+	 * record, again for the benefit of historical code that expected the
+	 * decoder to track this rather than accessing these fields of the record
+	 * itself.
+	 */
+	state->ReadRecPtr = state->record->lsn;
+	state->EndRecPtr = state->record->next_lsn;
+
+	*errormsg = NULL;
+
+	return state->record;
 }
 
 /*
@@ -258,17 +399,125 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error; errormsg is
+ * set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadlRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *decoded;
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(state);
+
+	/*
+	 * Call XLogReadAhead() in blocking mode to make sure there is something
+	 * in the queue, though we don't use the result.
+	 */
+	if (!XLogReaderHasQueuedRecordOrError(state))
+		XLogReadAhead(state, false /* nonblocking */ );
+
+	/* Consume the tail record or error. */
+	decoded = XLogNextRecord(state, errormsg);
+	if (decoded)
+	{
+		/*
+		 * XLogReadRecord() returns a pointer to the record's header, not the
+		 * actual decoded record.  The caller will access the decoded record
+		 * through the XLogRecGetXXX() macros, which reach the decoded
+		 * recorded as xlogreader->record.
+		 */
+		Assert(state->record == decoded);
+		return &decoded->header;
+	}
+
+	return NULL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+	if (state->decode_buffer_head >= state->decode_buffer_tail)
+	{
+		/* Empty, or head is to the right of tail. */
+		if (state->decode_buffer_head + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between head and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_tail)
+		{
+			/* There is space between start and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Head is to the left of tail. */
+		if (state->decode_buffer_head + required_space <
+			state->decode_buffer_tail)
+		{
+			/* There is space between head and tail. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return decoded;
+}
+
+static XLogPageReadResult
+XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -281,6 +530,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	bool		assembled;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg;		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -290,21 +541,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -315,7 +565,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -323,6 +573,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	}
 
 restart:
+	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
 
@@ -336,7 +587,9 @@ restart:
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	if (readOff == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readOff < 0)
 		goto err;
 
 	/*
@@ -392,7 +645,7 @@ restart:
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -411,6 +664,31 @@ restart:
 		gotheader = false;
 	}
 
+	/*
+	 * Find space to decode this record.  Don't allow oversized allocation if
+	 * the caller requested nonblocking.  Otherwise, we *have* to try to
+	 * decode the record now because the caller has nothing else to do, so
+	 * allow an oversized record to be palloc'd if that turns out to be
+	 * necessary.
+	 */
+	decoded = XLogReadRecordAlloc(state,
+								  total_len,
+								  !nonblocking /* allow_oversized */ );
+	if (decoded == NULL)
+	{
+		/*
+		 * There is no space in the decode buffer.  The caller should help
+		 * with that problem by consuming some records.
+		 */
+		if (nonblocking)
+			return XLREAD_WOULDBLOCK;
+
+		/* We failed to allocate memory for an  oversized record. */
+		report_invalid_record(state,
+							  "out of memory while trying to decode a record of length %u", total_len);
+		goto err;
+	}
+
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -450,7 +728,9 @@ restart:
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
 										   XLOG_BLCKSZ));
 
-			if (readOff < 0)
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
 				goto err;
 
 			Assert(SizeOfXLogShortPHD <= readOff);
@@ -468,7 +748,7 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = RecPtr;
-				ResetDecoder(state);
+				//ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -523,7 +803,7 @@ restart:
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -537,8 +817,8 @@ restart:
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
@@ -546,16 +826,18 @@ restart:
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		if (readOff == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readOff < 0)
 			goto err;
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -565,14 +847,40 @@ restart:
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_head = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_head += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_head != decoded);
+		if (state->decode_queue_head)
+			state->decode_queue_head->next = decoded;
+		state->decode_queue_head = decoded;
+		if (!state->decode_queue_tail)
+			state->decode_queue_tail = decoded;
+		return XLREAD_SUCCESS;
+	}
 	else
-		return NULL;
+		return XLREAD_FAIL;
 
 err:
 	if (assembled)
@@ -590,14 +898,46 @@ err:
 		state->missingContrecPtr = targetPagePtr;
 	}
 
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+
 	/*
 	 * Invalidate the read state. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
+
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record, and return it.  The record will
+ * also be returned to XLogNextRecord(), which must be called to 'consume'
+ * each record.
+ *
+ * If nonblocking is true, may return NULL due to lack of data or WAL decoding
+ * space.
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, bool nonblocking)
+{
+	XLogPageReadResult result;
+
+	if (state->errormsg_deferred)
+		return NULL;
+
+	result = XLogDecodeNextRecord(state, nonblocking);
+	if (result == XLREAD_SUCCESS)
+	{
+		Assert(state->decode_queue_head != NULL);
+		return state->decode_queue_head;
+	}
 
 	return NULL;
 }
@@ -649,7 +989,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 
 		/* we can be sure to have enough WAL available, we scrolled back */
@@ -667,7 +1009,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
 									   state->readBuf);
-	if (readLen < 0)
+	if (readLen == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readLen < 0)
 		goto err;
 
 	Assert(readLen <= XLOG_BLCKSZ);
@@ -686,7 +1030,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 	}
 
@@ -704,8 +1050,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
+	return XLREAD_FAIL;
 }
 
 /*
@@ -1062,7 +1412,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg))
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1184,34 +1534,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
-
-	state->decoded_record = NULL;
+	DecodedXLogRecord *r;
 
-	state->main_data_len = 0;
-
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_tail))
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_tail = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_head = NULL;
+	state->decode_queue_tail = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_head = state->decode_buffer;
+	state->decode_buffer_tail = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
+}
+
+/*
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t		size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
 }
 
 /*
- * Decode the previously read record.
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1226,17 +1625,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1254,7 +1656,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1265,18 +1667,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1284,7 +1686,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1292,9 +1698,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1437,17 +1843,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1456,58 +1863,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1533,10 +1919,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1556,10 +1943,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1587,12 +1975,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..9feea3e6ec 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2139,7 +2139,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -2271,7 +2271,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..0053cfea42 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -370,7 +370,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 18cf931822..b68c1174a7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c64f..7cfa169e9b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -432,7 +432,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..c129df44ac 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * add an accessor macro for this.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += record->record->blocks[block_id].bimg_len;
 	}
 
 	/*
@@ -508,7 +508,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -539,7 +539,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -552,7 +552,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->blocks[block_id].bimg_info;
+				uint8		bimg_info = record->record->blocks[block_id].bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -569,11 +569,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len,
+						   record->record->blocks[block_id].hole_length -
+						   record->record->blocks[block_id].bimg_len,
 						   method);
 				}
 				else
@@ -581,8 +581,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   record->record->blocks[block_id].hole_offset,
+						   record->record->blocks[block_id].hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..debd78545c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -144,6 +144,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next; /* decoded record queue link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -171,6 +195,9 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
@@ -192,27 +219,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
-
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer; /* need to free? */
+	char	   *decode_buffer_head; /* write head */
+	char	   *decode_buffer_tail; /* read head */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
 	 * readLen bytes)
@@ -262,8 +305,25 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
+
+	/*
+	 * Flag to indicate to XLogPageReadCB that it should not block, during
+	 * read ahead.
+	 */
+	bool		nonblocking;
 };
 
+/*
+ * Check if the XLogNextRecord() has any more queued records or errors.  This
+ * can be used by a read_page callback to decide whether it should block.
+ */
+static inline bool
+XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
+{
+	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -274,16 +334,40 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResultResult
+{
+	XLREAD_SUCCESS = 0,			/* record is successfully read */
+	XLREAD_FAIL = -1,			/* failed during reading a record */
+	XLREAD_WOULDBLOCK = -2		/* nonblocking mode only, no data */
+} XLogPageReadResult;
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
+extern XLogRecord *XLogReadRecord(XLogReaderState *state,
+								  char **errormsg);
+
+/* Consume the next record or error. */
+extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Release the previously returned record, if necessary. */
+extern void XLogReleasePreviousRecord(XLogReaderState *state);
+
+/* Try to read ahead, if there is data and space. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										bool nonblocking);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -307,25 +391,32 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
-#define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
-#define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
-#define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
+#define XLogRecHasBlockRef(decoder, block_id)			\
+	(((decoder)->record->max_block_id >= (block_id)) &&	\
+	 ((decoder)->record->blocks[block_id].in_use))
+#define XLogRecHasBlockImage(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b83f744f..a5cd996089 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -533,6 +533,7 @@ DeadLockState
 DeallocateStmt
 DeclareCursorStmt
 DecodedBkpBlock
+DecodedXLogRecord
 DecodingOutputState
 DefElem
 DefElemAction
@@ -2939,6 +2940,7 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPageReadResult
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

v21-0002-Prefetch-referenced-data-in-recovery-take-II.patchapplication/x-patch; name=v21-0002-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From 1564f2f06a6b385a0b6b8dc31e1266e813e9c4ac Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:43:45 +1300
Subject: [PATCH v21 2/2] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  61 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |   2 +
 src/backend/access/transam/xlogprefetcher.c   | 962 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogrecovery.c     | 160 ++-
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  39 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  43 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 22 files changed, 1400 insertions(+), 62 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed8c82a9d..a086efb5e3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3641,6 +3641,67 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times in some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is disabled by default.
+       </para>
+       <para>
+        This feature currently depends on an effective
+        <function>posix_fadvise</function> function, which some
+        operating systems lack.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..2e3b73f49e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2958,6 +2965,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5177,8 +5247,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..8566f297d3 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 79314c69ab..8c17c88dfc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2b4e591736..23ecf0a237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -59,6 +59,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -132,6 +133,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..ec24cbf386
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,962 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+bool		recovery_prefetch = false;
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping for readahead barriers. */
+	XLogRecPtr	no_readahead_until;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (recovery_prefetch)
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			/* Certain records act as barriers for all readahead. */
+			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+				return LRQ_NEXT_AGAIN;
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!recovery_prefetch)
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that require us to filter out block ranges, or
+		 * stop readahead completely.
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_XLOG_ID)
+			{
+				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+					record_type == XLOG_END_OF_RECOVERY)
+				{
+					/*
+					 * These records might change the TLI.  Avoid potential
+					 * bugs if we were to allow "read TLI" and "replay TLI" to
+					 * differ without more analysis.
+					 */
+					prefetcher->no_readahead_until = record->lsn;
+				}
+			}
+			else if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will correct this
+				 * problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	prefetcher->no_readahead_until = 0;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate I/O for blocks referenced in future WAL records.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(recovery_prefetch ? maintenance_io_concurrency * 4 : 1,
+					  recovery_prefetch ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(bool *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value)
+	{
+		GUC_check_errdetail("recovery_prefetch must be set to off on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb491cb18d..86a7b4c5c8 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1710,6 +1710,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1916,6 +1918,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1930,6 +1941,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..e5e7821c79 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -183,6 +184,9 @@ static bool doRequestWalReceiverReply;
 /* XLogReader object used to parse the WAL records */
 static XLogReaderState *xlogreader = NULL;
 
+/* XLogPrefetcher object used to consume WAL records with read-ahead */
+static XLogPrefetcher *xlogprefetcher = NULL;
+
 /* Parameters passed down from ReadRecord to the XLogPageRead callback. */
 typedef struct XLogPageReadPrivate
 {
@@ -404,18 +408,21 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
-							  int emode, bool fetching_ckpt, TimeLineID replayTLI);
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
+							  int emode, bool fetching_ckpt,
+							  TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt,
-										XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 										int whichChkpt, bool report, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
@@ -561,6 +568,15 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -589,7 +605,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 0, true, CheckPointTLI);
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 0, true,
+									  CheckPointTLI);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
@@ -607,8 +624,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -727,7 +744,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		RedoStartTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 1, true,
 									  CheckPointTLI);
 		if (record != NULL)
 		{
@@ -1403,8 +1420,8 @@ FinishWalRecovery(void)
 		lastRec = XLogRecoveryCtl->lastReplayedReadRecPtr;
 		lastRecTLI = XLogRecoveryCtl->lastReplayedTLI;
 	}
-	XLogBeginRead(xlogreader, lastRec);
-	(void) ReadRecord(xlogreader, PANIC, false, lastRecTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
+	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -1501,6 +1518,8 @@ ShutdownWalRecovery(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -1584,15 +1603,15 @@ PerformWalRecovery(void)
 	{
 		/* back up to find the record */
 		replayTLI = RedoStartTLI;
-		XLogBeginRead(xlogreader, RedoStartLSN);
-		record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
 	}
 	else
 	{
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1725,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1922,6 +1941,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+
+		/* Reset the prefetcher. */
+		XLogPrefetchReconfigure();
 	}
 }
 
@@ -2302,7 +2324,8 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG,
+									 InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -2914,17 +2937,18 @@ ConfirmRecoveryPaused(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -2940,7 +2964,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -3073,6 +3097,9 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -3122,20 +3149,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -3260,7 +3298,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -3292,11 +3330,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -3364,7 +3414,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -3372,7 +3422,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -3516,7 +3566,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -3671,11 +3721,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -3683,13 +3737,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -3740,7 +3794,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;				/* not reached */
 }
 
 
@@ -3785,7 +3839,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -3812,8 +3866,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher, LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 0053cfea42..44d9313422 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 40b7bca5a9..4608140bb5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -905,6 +905,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..d41ae37090 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -211,7 +211,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cd4ebe2fc5..17f54b153b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -119,6 +120,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6d11f9c71b..8a156250b7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
@@ -215,6 +216,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1318,6 +1320,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		false,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2789,6 +2800,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3112,7 +3134,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -12208,6 +12231,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4a094bb38b..e2838ad4cd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4b45ac64db..4aee501d64 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -50,6 +50,7 @@ extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 extern int	CheckPointSegments;
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..f5bdb920d5
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index debd78545c..86a26a9231 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -427,5 +431,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..ff40f96e42 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..534ad0a5fb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6360,6 +6360,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ea774968f0..de59b08772 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -450,4 +450,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(bool *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..8ad54191cd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a5cd996089..46c0c199e9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1408,6 +1408,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2941,6 +2944,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

#148

tomas.vondra@enterprisedb.com

almost 4 years ago

In reply to: Thomas Munro (#147)

Re: WIP: WAL prefetch (another approach)

On 3/8/22 06:15, Thomas Munro wrote:

On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

https://github.com/macdice/postgres/tree/recovery-prefetch-ii

Here's a rebase. This mostly involved moving hunks over to the new
xlogrecovery.c file. One thing that seemed a little strange to me
with the new layout is that xlogreader is now a global variable. I
followed that pattern and made xlogprefetcher a global variable too,
for now.

There is one functional change: now I block readahead at records that
might change the timeline ID. This removes the need to think about
scenarios where "replay TLI" and "read TLI" might differ. I don't
know of a concrete problem in that area with the previous version, but
the recent introduction of the variable(s) "replayTLI" and associated
comments in master made me realise I hadn't analysed the hazards here
enough. Since timelines are tricky things and timeline changes are
extremely infrequent, it seemed better to simplify matters by putting
up a big road block there.

I'm now starting to think about committing this soon.

+1. I don't have the capacity/hardware to do more testing at the moment,
but all of this looks reasonable.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#149

andres@anarazel.de

almost 4 years ago

In reply to: Thomas Munro (#147)

Re: WIP: WAL prefetch (another approach)

Hi,

On 2022-03-08 18:15:43 +1300, Thomas Munro wrote:

I'm now starting to think about committing this soon.

Are you thinking of committing both patches at once, or with a bit of
distance?

I think something in the regression tests ought to enable
recovery_prefetch. 027_stream_regress or 001_stream_rep seem like the obvious
candidates?

- Andres

#150

rjuju123@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#147)

Re: WIP: WAL prefetch (another approach)

Hi,

On Tue, Mar 08, 2022 at 06:15:43PM +1300, Thomas Munro wrote:

On Wed, Dec 29, 2021 at 5:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

https://github.com/macdice/postgres/tree/recovery-prefetch-ii

Here's a rebase. This mostly involved moving hunks over to the new
xlogrecovery.c file. One thing that seemed a little strange to me
with the new layout is that xlogreader is now a global variable. I
followed that pattern and made xlogprefetcher a global variable too,
for now.

I for now went through 0001, TL;DR the patch looks good to me. I have a few
minor comments though, mostly to make things a bit clearer (at least to me).

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..c129df44ac 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
     * add an accessor macro for this.
     */
    *fpi_len = 0;
+   for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
    {
        if (XLogRecHasBlockImage(record, block_id))
-           *fpi_len += record->blocks[block_id].bimg_len;
+           *fpi_len += record->record->blocks[block_id].bimg_len;
    }
(and similar in that file, xlogutils.c and xlogreader.c)

This could use XLogRecGetBlock? Note that this macro is for now never used.

xlogreader.c also has some similar forgotten code that could use
XLogRecMaxBlockId.

+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)

The comment seems a bit misleading, as I first understood it as it could be
optional even if the record exists. Maybe something more like "Release the
last record if any"?

+    * Remove it from the decoded record queue.  It must be the oldest item
+    * decoded, decode_queue_tail.
+    */
+   record = state->record;
+   Assert(record == state->decode_queue_tail);
+   state->record = NULL;
+   state->decode_queue_tail = record->next;

The naming is a bit counter intuitive to me, as before reading the rest of the
code I wasn't expecting the item at the tail of the queue to have a next
element. Maybe just inverting tail and head would make it clearer?

+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+       /*
+        * state->EndRecPtr is expected to have been set by the last call to
+        * XLogBeginRead() or XLogNextRecord(), and is the location of the
+        * error.
+        */
+
+       return NULL;

The comment should refer to XLogFindNextRecord, not XLogNextRecord?
Also, is it worth an assert (likely at the top of the function) for that?

 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+   if (decoded)
+   {
+       /*
+        * XLogReadRecord() returns a pointer to the record's header, not the
+        * actual decoded record.  The caller will access the decoded record
+        * through the XLogRecGetXXX() macros, which reach the decoded
+        * recorded as xlogreader->record.
+        */
+       Assert(state->record == decoded);
+       return &decoded->header;

I find it a bit weird to mention XLogReadRecord() as it's the current function.

+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)

Is it worth clearly stating that it's the reponsability of the caller to update
the decode_buffer_head (with the real size) after a successful decoding of this
buffer?

+   if (unlikely(state->decode_buffer == NULL))
+   {
+       if (state->decode_buffer_size == 0)
+           state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+       state->decode_buffer = palloc(state->decode_buffer_size);
+       state->decode_buffer_head = state->decode_buffer;
+       state->decode_buffer_tail = state->decode_buffer;
+       state->free_decode_buffer = true;
+   }

Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
the only caller is the recovery prefetching.

+ return decoded;
+}

I would find it a bit clearer to explicitly return NULL here.

    readOff = ReadPageInternal(state, targetPagePtr,
                               Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-   if (readOff < 0)
+   if (readOff == XLREAD_WOULDBLOCK)
+       return XLREAD_WOULDBLOCK;
+   else if (readOff < 0)

ReadPageInternal comment should be updated to mention the new XLREAD_WOULDBLOCK
possible return value.

It's also not particulary obvious why XLogFindNextRecord() doesn't check for
this value. AFAICS callers don't (and should never) call it with a
nonblocking == true state, maybe add an assert for that?

@@ -468,7 +748,7 @@ restart:
            if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
            {
                state->overwrittenRecPtr = RecPtr;
-               ResetDecoder(state);
+               //ResetDecoder(state);

AFAICS this is indeed not necessary anymore, so it can be removed?

 static void
 ResetDecoder(XLogReaderState *state)
 {
[...]
+   /* Reset the decoded record queue, freeing any oversized records. */
+   while ((r = state->decode_queue_tail))

nit: I think it's better to explicitly check for the assignment being != NULL,
and existing code is more frequently written this way AFAICS.

+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResultResult

typo

#151

thomas.munro@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#150)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

I for now went through 0001, TL;DR the patch looks good to me. I have a few
minor comments though, mostly to make things a bit clearer (at least to me).

Hi Julien,

Thanks for your review of 0001! It gave me a few things to think
about and some good improvements.

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..c129df44ac 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -407,10 +407,10 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
* add an accessor macro for this.
*/
*fpi_len = 0;
+   for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
{
if (XLogRecHasBlockImage(record, block_id))
-           *fpi_len += record->blocks[block_id].bimg_len;
+           *fpi_len += record->record->blocks[block_id].bimg_len;
}
(and similar in that file, xlogutils.c and xlogreader.c)

This could use XLogRecGetBlock? Note that this macro is for now never used.

Yeah, I think that is a good idea for pg_waldump.c and xlogutils.c. Done.

xlogreader.c also has some similar forgotten code that could use
XLogRecMaxBlockId.

That is true, but I was thinking of it like this: most of the existing
code that interacts with xlogreader.c is working with the old model,
where the XLogReader object holds only one "current" record. For that
reason the XLogRecXXX() macros continue to work as before, implicitly
referring to the record that XLogReadRecord() most recently returned.
For xlogreader.c code, I prefer not to use the XLogRecXXX() macros,
even when referring to the "current" record, since xlogreader.c has
switched to a new multi-record model. In other words, they're sort of
'old API' accessors provided for continuity. Does this make sense?

+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
The comment seems a bit misleading, as I first understood it as it could be
optional even if the record exists. Maybe something more like "Release the
last record if any"?

Done.

+    * Remove it from the decoded record queue.  It must be the oldest item
+    * decoded, decode_queue_tail.
+    */
+   record = state->record;
+   Assert(record == state->decode_queue_tail);
+   state->record = NULL;
+   state->decode_queue_tail = record->next;
The naming is a bit counter intuitive to me, as before reading the rest of the
code I wasn't expecting the item at the tail of the queue to have a next
element. Maybe just inverting tail and head would make it clearer?

Yeah, after mulling this over for a day, I agree. I've flipped it around.

Explanation: You're quite right, singly-linked lists traditionally
have a 'tail' that points to null, so it makes sense for new items to
be added there and older items to be consumed from the 'head' end, as
you expected. But... it's also typical (I think?) in ring buffers AKA
circular buffers to insert at the 'head', and remove from the 'tail'.
This code has both a linked-list (the chain of decoded records with a
->next pointer), and the underlying storage, which is a circular
buffer of bytes. I didn't want them to use opposite terminology, and
since I started by writing the ring buffer part, that's where I
finished up... I agree that it's an improvement to flip them.

+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+       /*
+        * state->EndRecPtr is expected to have been set by the last call to
+        * XLogBeginRead() or XLogNextRecord(), and is the location of the
+        * error.
+        */
+
+       return NULL;

The comment should refer to XLogFindNextRecord, not XLogNextRecord?

No, it does mean to refer to the XLogNextRecord() (ie the last time
you called XLogNextRecord and successfully dequeued a record, we put
its end LSN there, so if there is a deferred error, that's the
corresponding LSN). Make sense?

Also, is it worth an assert (likely at the top of the function) for that?

How could I assert that EndRecPtr has the right value?

XLogRecord *
XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+   if (decoded)
+   {
+       /*
+        * XLogReadRecord() returns a pointer to the record's header, not the
+        * actual decoded record.  The caller will access the decoded record
+        * through the XLogRecGetXXX() macros, which reach the decoded
+        * recorded as xlogreader->record.
+        */
+       Assert(state->record == decoded);
+       return &decoded->header;

I find it a bit weird to mention XLogReadRecord() as it's the current function.

Changed to "This function ...".

+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)

Is it worth clearly stating that it's the reponsability of the caller to update
the decode_buffer_head (with the real size) after a successful decoding of this
buffer?

Comment added.

+   if (unlikely(state->decode_buffer == NULL))
+   {
+       if (state->decode_buffer_size == 0)
+           state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+       state->decode_buffer = palloc(state->decode_buffer_size);
+       state->decode_buffer_head = state->decode_buffer;
+       state->decode_buffer_tail = state->decode_buffer;
+       state->free_decode_buffer = true;
+   }
Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
the only caller is the recovery prefetching.

I don't think it matters much?

+ return decoded;
+}

I would find it a bit clearer to explicitly return NULL here.

Done.

readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-   if (readOff < 0)
+   if (readOff == XLREAD_WOULDBLOCK)
+       return XLREAD_WOULDBLOCK;
+   else if (readOff < 0)
ReadPageInternal comment should be updated to mention the new XLREAD_WOULDBLOCK
possible return value.

Yeah. Done.

It's also not particulary obvious why XLogFindNextRecord() doesn't check for
this value. AFAICS callers don't (and should never) call it with a
nonblocking == true state, maybe add an assert for that?

Fair point. I have now explicitly cleared that flag. (I don't much
like state->nonblocking, which might be better as an argument to
page_read(), but in fact I don't like the fact that page_read
callbacks are blocking in the first place, which is why I liked
Horiguchi-san's patch to get rid of that... but that can be a subject
for later work.)

@@ -468,7 +748,7 @@ restart:
if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
{
state->overwrittenRecPtr = RecPtr;
-               ResetDecoder(state);
+               //ResetDecoder(state);

AFAICS this is indeed not necessary anymore, so it can be removed?

Oops, yeah I use C++ comments when there's something I intended to
remove. Done.

static void
ResetDecoder(XLogReaderState *state)
{
[...]
+   /* Reset the decoded record queue, freeing any oversized records. */
+   while ((r = state->decode_queue_tail))
nit: I think it's better to explicitly check for the assignment being != NULL,
and existing code is more frequently written this way AFAICS.

I think it's perfectly normal idiomatic C, but if you think it's
clearer that way, OK, done like that.

+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResultResult

typo

Fixed.

I realised that this version has broken -DWAL_DEBUG. I'll fix that
shortly, but I wanted to post this update ASAP, so here's a new
version. The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests. Right now you need to put
recovery_prefetch=on in a file and then run the tests with
"TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
excercise much of 0002.

Attachments:

v22-0001-Add-circular-WAL-decoding-buffer-take-II.patchtext/x-patch; charset=US-ASCII; name=v22-0001-Add-circular-WAL-decoding-buffer-take-II.patchDownload

From e08d29de5ae85746e7a9c5a5539aa2c531d9c6e5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:33:10 +1300
Subject: [PATCH v22 1/2] Add circular WAL decoding buffer, take II.

Teach xlogreader.c to decode its output into a circular buffer, to
support upcoming optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros, and the traditional members like xlogreader->ReadRecPtr.

 * An alternative new interface XLogReadAhead()/XLogNextRecord() is
   added that returns pointers to DecodedXLogRecord
   objects so that it's possible to look ahead in the WAL stream.

 * In order to be able to use the new interface effectively, client
   code should provide a page_read() callback that responds to
   a new nonblocking mode by returning XLREAD_WOULDBLOCK to avoid
   waiting.  No such implementation is included in this commit,
   and other code that is unaware of the new mechanism doesn't need
   to change.

The buffer's size can be set by the client of xlogreader.c.  Large
records that don't fit in the circular buffer are called "oversized" and
allocated separately with palloc().

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |   2 +-
 src/backend/access/transam/xlogreader.c   | 656 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |   4 +-
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  25 +-
 src/include/access/xlogreader.h           | 153 ++++-
 src/tools/pgindent/typedefs.list          |   2 +
 10 files changed, 675 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 4b0c63817f..bbb542b322 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..2b4e591736 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7736,7 +7736,7 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index b7c06da255..1a8c651767 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -45,6 +45,7 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -56,6 +57,12 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+/*
+ * Default size; large enough that typical users of XLogReader won't often need
+ * to use the 'oversized' memory allocation code path.
+ */
+#define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -70,6 +77,24 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
+}
+
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_tail = buffer;
+	state->decode_buffer_head = buffer;
 }
 
 /*
@@ -92,8 +117,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -144,18 +167,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -251,7 +267,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
+}
+
+/*
+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), if any, to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	if (!state->record)
+		return;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest item
+	 * decoded, decode_queue_head.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_head);
+	state->record = NULL;
+	state->decode_queue_head = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_tail. */
+	if (state->decode_queue_tail == record)
+		state->decode_queue_tail = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the head (oldest) record in the decode buffer. */
+		Assert(state->decode_buffer_head == (char *) record);
+
+		/*
+		 * We need to update head to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust head to release space up to the next record. */
+			state->decode_buffer_head = (char *) record;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_head = state->decode_buffer;
+			state->decode_buffer_tail = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
+ * called before the first call to XLogNextRecord().  This functions returns
+ * records and errors that were put into an internal queue by XLogReadAhead().
+ *
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogNextRecord.
+ */
+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
+	/* Release the last record returned by XLogNextRecord(). */
+	XLogReleasePreviousRecord(state);
+
+	if (state->decode_queue_head == NULL)
+	{
+		*errormsg = NULL;
+		if (state->errormsg_deferred)
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			state->errormsg_deferred = false;
+		}
+
+		/*
+		 * state->EndRecPtr is expected to have been set by the last call to
+		 * XLogBeginRead() or XLogNextRecord(), and is the location of the
+		 * error.
+		 */
+
+		return NULL;
+	}
+
+	/*
+	 * Record this as the most recent record returned, so that we'll release
+	 * it next time.  This also exposes it to the traditional
+	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
+	 * the record for historical reasons.
+	 */
+	state->record = state->decode_queue_head;
+
+	/*
+	 * Update the pointers to the beginning and one-past-the-end of this
+	 * record, again for the benefit of historical code that expected the
+	 * decoder to track this rather than accessing these fields of the record
+	 * itself.
+	 */
+	state->ReadRecPtr = state->record->lsn;
+	state->EndRecPtr = state->record->next_lsn;
+
+	*errormsg = NULL;
+
+	return state->record;
 }
 
 /*
@@ -261,17 +402,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error; errormsg is
+ * set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadlRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *decoded;
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(state);
+
+	/*
+	 * Call XLogReadAhead() in blocking mode to make sure there is something
+	 * in the queue, though we don't use the result.
+	 */
+	if (!XLogReaderHasQueuedRecordOrError(state))
+		XLogReadAhead(state, false /* nonblocking */ );
+
+	/* Consume the head record or error. */
+	decoded = XLogNextRecord(state, errormsg);
+	if (decoded)
+	{
+		/*
+		 * This function returns a pointer to the record's header, not the
+		 * actual decoded record.  The caller will access the decoded record
+		 * through the XLogRecGetXXX() macros, which reach the decoded
+		 * recorded as xlogreader->record.
+		 */
+		Assert(state->record == decoded);
+		return &decoded->header;
+	}
+
+	return NULL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * The caller is responsible for adjusting decode_buffer_tail with the real
+ * size after successfully decoding a record into this space.  This way, if
+ * decoding fails, then there is nothing to undo unless the 'oversized' flag
+ * was set and pfree() must be called.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+
+	/* Try to allocate space in the circular decode buffer. */
+	if (state->decode_buffer_tail >= state->decode_buffer_head)
+	{
+		/* Empty, or tail is to the right of head. */
+		if (state->decode_buffer_tail + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between tail and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_head)
+		{
+			/* There is space between start and head. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Tail is to the left of head. */
+		if (state->decode_buffer_tail + required_space <
+			state->decode_buffer_head)
+		{
+			/* There is space between tail and heade. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return NULL;
+}
+
+static XLogPageReadResult
+XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -284,6 +540,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	bool		assembled;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg;		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -293,21 +551,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -318,7 +575,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -326,6 +583,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	}
 
 restart:
+	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
 
@@ -339,7 +597,9 @@ restart:
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	if (readOff == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readOff < 0)
 		goto err;
 
 	/*
@@ -395,7 +655,7 @@ restart:
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -414,6 +674,31 @@ restart:
 		gotheader = false;
 	}
 
+	/*
+	 * Find space to decode this record.  Don't allow oversized allocation if
+	 * the caller requested nonblocking.  Otherwise, we *have* to try to
+	 * decode the record now because the caller has nothing else to do, so
+	 * allow an oversized record to be palloc'd if that turns out to be
+	 * necessary.
+	 */
+	decoded = XLogReadRecordAlloc(state,
+								  total_len,
+								  !nonblocking /* allow_oversized */ );
+	if (decoded == NULL)
+	{
+		/*
+		 * There is no space in the decode buffer.  The caller should help
+		 * with that problem by consuming some records.
+		 */
+		if (nonblocking)
+			return XLREAD_WOULDBLOCK;
+
+		/* We failed to allocate memory for an  oversized record. */
+		report_invalid_record(state,
+							  "out of memory while trying to decode a record of length %u", total_len);
+		goto err;
+	}
+
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -453,7 +738,9 @@ restart:
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
 										   XLOG_BLCKSZ));
 
-			if (readOff < 0)
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
 				goto err;
 
 			Assert(SizeOfXLogShortPHD <= readOff);
@@ -471,7 +758,6 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = RecPtr;
-				ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -526,7 +812,7 @@ restart:
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -540,8 +826,8 @@ restart:
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
@@ -549,16 +835,18 @@ restart:
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		if (readOff == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readOff < 0)
 			goto err;
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -568,14 +856,40 @@ restart:
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_tail = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_tail += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_tail != decoded);
+		if (state->decode_queue_tail)
+			state->decode_queue_tail->next = decoded;
+		state->decode_queue_tail = decoded;
+		if (!state->decode_queue_head)
+			state->decode_queue_head = decoded;
+		return XLREAD_SUCCESS;
+	}
 	else
-		return NULL;
+		return XLREAD_FAIL;
 
 err:
 	if (assembled)
@@ -593,14 +907,46 @@ err:
 		state->missingContrecPtr = targetPagePtr;
 	}
 
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+
 	/*
 	 * Invalidate the read state. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
+
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record, and return it.  The record will
+ * also be returned to XLogNextRecord(), which must be called to 'consume'
+ * each record.
+ *
+ * If nonblocking is true, may return NULL due to lack of data or WAL decoding
+ * space.
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, bool nonblocking)
+{
+	XLogPageReadResult result;
+
+	if (state->errormsg_deferred)
+		return NULL;
+
+	result = XLogDecodeNextRecord(state, nonblocking);
+	if (result == XLREAD_SUCCESS)
+	{
+		Assert(state->decode_queue_tail != NULL);
+		return state->decode_queue_tail;
+	}
 
 	return NULL;
 }
@@ -609,8 +955,14 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the page_read() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLREAD_FAIL if the required page cannot be read for some
+ * reason; errormsg_buf is set in that case (unless the error occurs in the
+ * page_read callback).
+ *
+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read callback
+ * respects the state->nonblocking flag, and cannot read the requested data
+ * immediately.
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -652,7 +1004,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 
 		/* we can be sure to have enough WAL available, we scrolled back */
@@ -670,7 +1024,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
 									   state->readBuf);
-	if (readLen < 0)
+	if (readLen == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readLen < 0)
 		goto err;
 
 	Assert(readLen <= XLOG_BLCKSZ);
@@ -689,7 +1045,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 	}
 
@@ -707,8 +1065,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
+	return XLREAD_FAIL;
 }
 
 /*
@@ -987,6 +1349,9 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
+	/* Make sure ReadPageInternal() can't return XLREAD_WOULDBLOCK. */
+	state->nonblocking = false;
+
 	/*
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
@@ -1065,7 +1430,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg))
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1187,34 +1552,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
-
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
+	DecodedXLogRecord *r;
 
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_head) != NULL)
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_head = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_tail = NULL;
+	state->decode_queue_head = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_tail = state->decode_buffer;
+	state->decode_buffer_head = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t		size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1229,17 +1643,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1257,7 +1674,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1268,18 +1685,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1287,7 +1704,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1295,9 +1716,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1440,17 +1861,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1459,58 +1881,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1536,10 +1937,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1559,10 +1961,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1590,12 +1993,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..9feea3e6ec 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2139,7 +2139,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -2271,7 +2271,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..511f2f186f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -370,7 +370,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (XLogRecGetBlock(record, block_id)->flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8c00a73cb9..77bc7aea7a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c64f..7cfa169e9b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -432,7 +432,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f128050b4e..fc081adfb8 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -403,14 +403,13 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * Calculate the amount of FPI data in the record.
 	 *
 	 * XXX: We peek into xlogreader's private decoded backup blocks for the
-	 * bimg_len indicating the length of FPI data. It doesn't seem worth it to
-	 * add an accessor macro for this.
+	 * bimg_len indicating the length of FPI data.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
 	}
 
 	/*
@@ -508,7 +507,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -539,7 +538,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -552,7 +551,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->blocks[block_id].bimg_info;
+				uint8		bimg_info = XLogRecGetBlock(record, block_id)->bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -571,11 +570,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len,
+						   XLogRecGetBlock(record, block_id)->hole_length -
+						   XLogRecGetBlock(record, block_id)->bimg_len,
 						   method);
 				}
 				else
@@ -583,8 +582,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..d1f364f4e8 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -144,6 +144,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next; /* decoded record queue link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -171,6 +195,9 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
@@ -192,27 +219,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
-
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer; /* need to free? */
+	char	   *decode_buffer_head; /* data is read from the head */
+	char	   *decode_buffer_tail; /* new data is written at the tail */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* oldest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* newest decoded record */
+
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
 	 * readLen bytes)
@@ -262,8 +305,25 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
+
+	/*
+	 * Flag to indicate to XLogPageReadCB that it should not block, during
+	 * read ahead.
+	 */
+	bool		nonblocking;
 };
 
+/*
+ * Check if the XLogNextRecord() has any more queued records or errors.  This
+ * can be used by a read_page callback to decide whether it should block.
+ */
+static inline bool
+XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
+{
+	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -274,16 +334,40 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResult
+{
+	XLREAD_SUCCESS = 0,			/* record is successfully read */
+	XLREAD_FAIL = -1,			/* failed during reading a record */
+	XLREAD_WOULDBLOCK = -2		/* nonblocking mode only, no data */
+} XLogPageReadResult;
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
+extern XLogRecord *XLogReadRecord(XLogReaderState *state,
+								  char **errormsg);
+
+/* Consume the next record or error. */
+extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Release the previously returned record, if necessary. */
+extern void XLogReleasePreviousRecord(XLogReaderState *state);
+
+/* Try to read ahead, if there is data and space. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										bool nonblocking);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -307,25 +391,32 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
-#define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
-#define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
-#define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
+#define XLogRecHasBlockRef(decoder, block_id)			\
+	(((decoder)->record->max_block_id >= (block_id)) &&	\
+	 ((decoder)->record->blocks[block_id].in_use))
+#define XLogRecHasBlockImage(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a8d4..f57f7e0f53 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -533,6 +533,7 @@ DeadLockState
 DeallocateStmt
 DeclareCursorStmt
 DecodedBkpBlock
+DecodedXLogRecord
 DecodingOutputState
 DefElem
 DefElemAction
@@ -2939,6 +2940,7 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPageReadResult
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

v22-0002-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v22-0002-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From 05fc2981d5c6d5b1603080c45b11d96562e9f44d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:43:45 +1300
Subject: [PATCH v22 2/2] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  61 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |   2 +
 src/backend/access/transam/xlogprefetcher.c   | 962 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogrecovery.c     | 160 ++-
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  39 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  43 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 22 files changed, 1400 insertions(+), 62 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5612e80453..4244e0a7bb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3644,6 +3644,67 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Prefetching blocks
+        that will soon be needed can reduce I/O wait times in some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+        This setting is disabled by default.
+       </para>
+       <para>
+        This feature currently depends on an effective
+        <function>posix_fadvise</function> function, which some
+        operating systems lack.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..2e3b73f49e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2958,6 +2965,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5177,8 +5247,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..8566f297d3 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 79314c69ab..8c17c88dfc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2b4e591736..23ecf0a237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -59,6 +59,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -132,6 +133,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..ec24cbf386
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,962 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+bool		recovery_prefetch = false;
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping for readahead barriers. */
+	XLogRecPtr	no_readahead_until;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (recovery_prefetch)
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			/* Certain records act as barriers for all readahead. */
+			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+				return LRQ_NEXT_AGAIN;
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!recovery_prefetch)
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that require us to filter out block ranges, or
+		 * stop readahead completely.
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_XLOG_ID)
+			{
+				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+					record_type == XLOG_END_OF_RECOVERY)
+				{
+					/*
+					 * These records might change the TLI.  Avoid potential
+					 * bugs if we were to allow "read TLI" and "replay TLI" to
+					 * differ without more analysis.
+					 */
+					prefetcher->no_readahead_until = record->lsn;
+				}
+			}
+			else if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will correct this
+				 * problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	prefetcher->no_readahead_until = 0;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate I/O for blocks referenced in future WAL records.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(recovery_prefetch ? maintenance_io_concurrency * 4 : 1,
+					  recovery_prefetch ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(bool *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value)
+	{
+		GUC_check_errdetail("recovery_prefetch must be set to off on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(bool new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 1a8c651767..94681c3d80 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1728,6 +1728,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1934,6 +1936,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1948,6 +1959,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..e5e7821c79 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -183,6 +184,9 @@ static bool doRequestWalReceiverReply;
 /* XLogReader object used to parse the WAL records */
 static XLogReaderState *xlogreader = NULL;
 
+/* XLogPrefetcher object used to consume WAL records with read-ahead */
+static XLogPrefetcher *xlogprefetcher = NULL;
+
 /* Parameters passed down from ReadRecord to the XLogPageRead callback. */
 typedef struct XLogPageReadPrivate
 {
@@ -404,18 +408,21 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
-							  int emode, bool fetching_ckpt, TimeLineID replayTLI);
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
+							  int emode, bool fetching_ckpt,
+							  TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt,
-										XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 										int whichChkpt, bool report, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
@@ -561,6 +568,15 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -589,7 +605,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 0, true, CheckPointTLI);
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 0, true,
+									  CheckPointTLI);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
@@ -607,8 +624,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -727,7 +744,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		RedoStartTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 1, true,
 									  CheckPointTLI);
 		if (record != NULL)
 		{
@@ -1403,8 +1420,8 @@ FinishWalRecovery(void)
 		lastRec = XLogRecoveryCtl->lastReplayedReadRecPtr;
 		lastRecTLI = XLogRecoveryCtl->lastReplayedTLI;
 	}
-	XLogBeginRead(xlogreader, lastRec);
-	(void) ReadRecord(xlogreader, PANIC, false, lastRecTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
+	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -1501,6 +1518,8 @@ ShutdownWalRecovery(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -1584,15 +1603,15 @@ PerformWalRecovery(void)
 	{
 		/* back up to find the record */
 		replayTLI = RedoStartTLI;
-		XLogBeginRead(xlogreader, RedoStartLSN);
-		record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
 	}
 	else
 	{
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1725,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1922,6 +1941,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+
+		/* Reset the prefetcher. */
+		XLogPrefetchReconfigure();
 	}
 }
 
@@ -2302,7 +2324,8 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG,
+									 InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -2914,17 +2937,18 @@ ConfirmRecoveryPaused(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -2940,7 +2964,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -3073,6 +3097,9 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -3122,20 +3149,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -3260,7 +3298,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -3292,11 +3330,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -3364,7 +3414,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -3372,7 +3422,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -3516,7 +3566,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -3671,11 +3721,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -3683,13 +3737,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -3740,7 +3794,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;				/* not reached */
 }
 
 
@@ -3785,7 +3839,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -3812,8 +3866,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher, LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..ea22577b41 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 40b7bca5a9..4608140bb5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -905,6 +905,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..d41ae37090 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -211,7 +211,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cd4ebe2fc5..17f54b153b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -119,6 +120,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e7f0a380e6..38f93f204f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
@@ -215,6 +216,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -1321,6 +1323,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		false,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
 
 	{
 		{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2792,6 +2803,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3115,7 +3137,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -12211,6 +12234,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..0a6c7bd83e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 09f6464331..1df9dd2fbe 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -50,6 +50,7 @@ extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 extern int	CheckPointSegments;
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..f5bdb920d5
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern bool recovery_prefetch;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d1f364f4e8..8446050225 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -427,5 +431,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..ff40f96e42 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..534ad0a5fb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6360,6 +6360,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ea774968f0..de59b08772 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -450,4 +450,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(bool *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(bool new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..8ad54191cd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f57f7e0f53..3a008ef433 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1408,6 +1408,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2941,6 +2944,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

#152

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#151)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Fri, Mar 11, 2022 at 6:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks for your review of 0001! It gave me a few things to think
about and some good improvements.

And just in case it's useful, here's what changed between v21 and v22..

Attachments:

change-after-juliens-review.txttext/plain; charset=US-ASCII; name=change-after-juliens-review.txtDownload

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 86a7b4c5c8..0d0c556b7c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -90,8 +90,8 @@ XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
 
 	state->decode_buffer = buffer;
 	state->decode_buffer_size = size;
-	state->decode_buffer_head = buffer;
 	state->decode_buffer_tail = buffer;
+	state->decode_buffer_head = buffer;
 }
 
 /*
@@ -271,7 +271,7 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 /*
  * See if we can release the last record that was returned by
- * XLogNextRecord(), to free up space.
+ * XLogNextRecord(), if any, to free up space.
  */
 void
 XLogReleasePreviousRecord(XLogReaderState *state)
@@ -283,16 +283,16 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 
 	/*
 	 * Remove it from the decoded record queue.  It must be the oldest item
-	 * decoded, decode_queue_tail.
+	 * decoded, decode_queue_head.
 	 */
 	record = state->record;
-	Assert(record == state->decode_queue_tail);
+	Assert(record == state->decode_queue_head);
 	state->record = NULL;
-	state->decode_queue_tail = record->next;
+	state->decode_queue_head = record->next;
 
-	/* It might also be the newest item decoded, decode_queue_head. */
-	if (state->decode_queue_head == record)
-		state->decode_queue_head = NULL;
+	/* It might also be the newest item decoded, decode_queue_tail. */
+	if (state->decode_queue_tail == record)
+		state->decode_queue_tail = NULL;
 
 	/* Release the space. */
 	if (unlikely(record->oversized))
@@ -302,11 +302,11 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 	}
 	else
 	{
-		/* It must be the tail record in the decode buffer. */
-		Assert(state->decode_buffer_tail == (char *) record);
+		/* It must be the head (oldest) record in the decode buffer. */
+		Assert(state->decode_buffer_head == (char *) record);
 
 		/*
-		 * We need to update tail to point to the next record that is in the
+		 * We need to update head to point to the next record that is in the
 		 * decode buffer, if any, being careful to skip oversized ones
 		 * (they're not in the decode buffer).
 		 */
@@ -316,8 +316,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 
 		if (record)
 		{
-			/* Adjust tail to release space up to the next record. */
-			state->decode_buffer_tail = (char *) record;
+			/* Adjust head to release space up to the next record. */
+			state->decode_buffer_head = (char *) record;
 		}
 		else
 		{
@@ -327,8 +327,8 @@ XLogReleasePreviousRecord(XLogReaderState *state)
 			 * we'll keep overwriting the same piece of memory if we're not
 			 * doing any prefetching.
 			 */
-			state->decode_buffer_tail = state->decode_buffer;
 			state->decode_buffer_head = state->decode_buffer;
+			state->decode_buffer_tail = state->decode_buffer;
 		}
 	}
 }
@@ -351,7 +351,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	/* Release the last record returned by XLogNextRecord(). */
 	XLogReleasePreviousRecord(state);
 
-	if (state->decode_queue_tail == NULL)
+	if (state->decode_queue_head == NULL)
 	{
 		*errormsg = NULL;
 		if (state->errormsg_deferred)
@@ -376,7 +376,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
 	 * the record for historical reasons.
 	 */
-	state->record = state->decode_queue_tail;
+	state->record = state->decode_queue_head;
 
 	/*
 	 * Update the pointers to the beginning and one-past-the-end of this
@@ -428,12 +428,12 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	if (!XLogReaderHasQueuedRecordOrError(state))
 		XLogReadAhead(state, false /* nonblocking */ );
 
-	/* Consume the tail record or error. */
+	/* Consume the head record or error. */
 	decoded = XLogNextRecord(state, errormsg);
 	if (decoded)
 	{
 		/*
-		 * XLogReadRecord() returns a pointer to the record's header, not the
+		 * This function returns a pointer to the record's header, not the
 		 * actual decoded record.  The caller will access the decoded record
 		 * through the XLogRecGetXXX() macros, which reach the decoded
 		 * recorded as xlogreader->record.
@@ -451,6 +451,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
  * decoded record wouldn't fit in the decode buffer and must eventually be
  * freed explicitly.
  *
+ * The caller is responsible for adjusting decode_buffer_tail with the real
+ * size after successfully decoding a record into this space.  This way, if
+ * decoding fails, then there is nothing to undo unless the 'oversized' flag
+ * was set and pfree() must be called.
+ *
  * Return NULL if there is no space in the decode buffer and allow_oversized
  * is false, or if memory allocation fails for an oversized buffer.
  */
@@ -470,21 +475,23 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 		state->decode_buffer_tail = state->decode_buffer;
 		state->free_decode_buffer = true;
 	}
-	if (state->decode_buffer_head >= state->decode_buffer_tail)
+
+	/* Try to allocate space in the circular decode buffer. */
+	if (state->decode_buffer_tail >= state->decode_buffer_head)
 	{
-		/* Empty, or head is to the right of tail. */
-		if (state->decode_buffer_head + required_space <=
+		/* Empty, or tail is to the right of head. */
+		if (state->decode_buffer_tail + required_space <=
 			state->decode_buffer + state->decode_buffer_size)
 		{
-			/* There is space between head and end. */
-			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			/* There is space between tail and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
 			decoded->oversized = false;
 			return decoded;
 		}
 		else if (state->decode_buffer + required_space <
-				 state->decode_buffer_tail)
+				 state->decode_buffer_head)
 		{
-			/* There is space between start and tail. */
+			/* There is space between start and head. */
 			decoded = (DecodedXLogRecord *) state->decode_buffer;
 			decoded->oversized = false;
 			return decoded;
@@ -492,12 +499,12 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 	}
 	else
 	{
-		/* Head is to the left of tail. */
-		if (state->decode_buffer_head + required_space <
-			state->decode_buffer_tail)
+		/* Tail is to the left of head. */
+		if (state->decode_buffer_tail + required_space <
+			state->decode_buffer_head)
 		{
-			/* There is space between head and tail. */
-			decoded = (DecodedXLogRecord *) state->decode_buffer_head;
+			/* There is space between tail and heade. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
 			decoded->oversized = false;
 			return decoded;
 		}
@@ -513,7 +520,7 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 		return decoded;
 	}
 
-	return decoded;
+	return NULL;
 }
 
 static XLogPageReadResult
@@ -748,7 +755,6 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = RecPtr;
-				//ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -865,18 +871,18 @@ restart:
 			/* The new decode buffer head must be MAXALIGNed. */
 			Assert(decoded->size == MAXALIGN(decoded->size));
 			if ((char *) decoded == state->decode_buffer)
-				state->decode_buffer_head = state->decode_buffer + decoded->size;
+				state->decode_buffer_tail = state->decode_buffer + decoded->size;
 			else
-				state->decode_buffer_head += decoded->size;
+				state->decode_buffer_tail += decoded->size;
 		}
 
 		/* Insert it into the queue of decoded records. */
-		Assert(state->decode_queue_head != decoded);
-		if (state->decode_queue_head)
-			state->decode_queue_head->next = decoded;
-		state->decode_queue_head = decoded;
-		if (!state->decode_queue_tail)
-			state->decode_queue_tail = decoded;
+		Assert(state->decode_queue_tail != decoded);
+		if (state->decode_queue_tail)
+			state->decode_queue_tail->next = decoded;
+		state->decode_queue_tail = decoded;
+		if (!state->decode_queue_head)
+			state->decode_queue_head = decoded;
 		return XLREAD_SUCCESS;
 	}
 	else
@@ -935,8 +941,8 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
 	result = XLogDecodeNextRecord(state, nonblocking);
 	if (result == XLREAD_SUCCESS)
 	{
-		Assert(state->decode_queue_head != NULL);
-		return state->decode_queue_head;
+		Assert(state->decode_queue_tail != NULL);
+		return state->decode_queue_tail;
 	}
 
 	return NULL;
@@ -946,8 +952,14 @@ XLogReadAhead(XLogReaderState *state, bool nonblocking)
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the page_read() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLREAD_FAIL if the required page cannot be read for some
+ * reason; errormsg_buf is set in that case (unless the error occurs in the
+ * page_read callback).
+ *
+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read callback
+ * respects the state->nonblocking flag, and cannot read the requested data
+ * immediately.
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -1334,6 +1346,9 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
+	/* Make sure ReadPageInternal() can't return XLREAD_WOULDBLOCK. */
+	state->nonblocking = false;
+
 	/*
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
@@ -1544,19 +1559,19 @@ ResetDecoder(XLogReaderState *state)
 	DecodedXLogRecord *r;
 
 	/* Reset the decoded record queue, freeing any oversized records. */
-	while ((r = state->decode_queue_tail))
+	while ((r = state->decode_queue_head) != NULL)
 	{
-		state->decode_queue_tail = r->next;
+		state->decode_queue_head = r->next;
 		if (r->oversized)
 			pfree(r);
 	}
-	state->decode_queue_head = NULL;
 	state->decode_queue_tail = NULL;
+	state->decode_queue_head = NULL;
 	state->record = NULL;
 
 	/* Reset the decode buffer to empty. */
-	state->decode_buffer_head = state->decode_buffer;
 	state->decode_buffer_tail = state->decode_buffer;
+	state->decode_buffer_head = state->decode_buffer;
 
 	/* Clear error state. */
 	state->errormsg_buf[0] = '\0';
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 44d9313422..ea22577b41 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -373,7 +373,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (XLogRecGetBlock(record, block_id)->flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index c129df44ac..a33ad034c0 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -403,14 +403,13 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * Calculate the amount of FPI data in the record.
 	 *
 	 * XXX: We peek into xlogreader's private decoded backup blocks for the
-	 * bimg_len indicating the length of FPI data. It doesn't seem worth it to
-	 * add an accessor macro for this.
+	 * bimg_len indicating the length of FPI data.
 	 */
 	*fpi_len = 0;
 	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->record->blocks[block_id].bimg_len;
+			*fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
 	}
 
 	/*
@@ -552,7 +551,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->record->blocks[block_id].bimg_info;
+				uint8		bimg_info = XLogRecGetBlock(record, block_id)->bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -569,11 +568,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->record->blocks[block_id].hole_offset,
-						   record->record->blocks[block_id].hole_length,
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length,
 						   BLCKSZ -
-						   record->record->blocks[block_id].hole_length -
-						   record->record->blocks[block_id].bimg_len,
+						   XLogRecGetBlock(record, block_id)->hole_length -
+						   XLogRecGetBlock(record, block_id)->bimg_len,
 						   method);
 				}
 				else
@@ -581,8 +580,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->record->blocks[block_id].hole_offset,
-						   record->record->blocks[block_id].hole_length);
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 86a26a9231..8446050225 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -249,16 +249,16 @@ struct XLogReaderState
 	char	   *decode_buffer;
 	size_t		decode_buffer_size;
 	bool		free_decode_buffer; /* need to free? */
-	char	   *decode_buffer_head; /* write head */
-	char	   *decode_buffer_tail; /* read head */
+	char	   *decode_buffer_head; /* data is read from the head */
+	char	   *decode_buffer_tail; /* new data is written at the tail */
 
 	/*
 	 * Queue of records that have been decoded.  This is a linked list that
 	 * usually consists of consecutive records in decode_buffer, but may also
 	 * contain oversized records allocated with palloc().
 	 */
-	DecodedXLogRecord *decode_queue_head;	/* newest decoded record */
-	DecodedXLogRecord *decode_queue_tail;	/* oldest decoded record */
+	DecodedXLogRecord *decode_queue_head;	/* oldest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* newest decoded record */
 
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
@@ -350,7 +350,7 @@ extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
 /* Return values from XLogPageReadCB. */
-typedef enum XLogPageReadResultResult
+typedef enum XLogPageReadResult
 {
 	XLREAD_SUCCESS = 0,			/* record is successfully read */
 	XLREAD_FAIL = -1,			/* failed during reading a record */

#153

andres@anarazel.de

almost 4 years ago

In reply to: Thomas Munro (#151)

Re: WIP: WAL prefetch (another approach)

On March 10, 2022 9:31:13 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:

The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests.

Could a setting of "try" make sense?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#154

rjuju123@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#151)

Re: WIP: WAL prefetch (another approach)

On Fri, Mar 11, 2022 at 06:31:13PM +1300, Thomas Munro wrote:

On Wed, Mar 9, 2022 at 7:47 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

This could use XLogRecGetBlock? Note that this macro is for now never used.
xlogreader.c also has some similar forgotten code that could use
XLogRecMaxBlockId.

That is true, but I was thinking of it like this: most of the existing
code that interacts with xlogreader.c is working with the old model,
where the XLogReader object holds only one "current" record. For that
reason the XLogRecXXX() macros continue to work as before, implicitly
referring to the record that XLogReadRecord() most recently returned.
For xlogreader.c code, I prefer not to use the XLogRecXXX() macros,
even when referring to the "current" record, since xlogreader.c has
switched to a new multi-record model. In other words, they're sort of
'old API' accessors provided for continuity. Does this make sense?

Ah I see, it does make sense. I'm wondering if there should be some comment
somewhere on the top of the file to mention it, as otherwise someone may be
tempted to change it to avoid some record->record->xxx usage.

+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
[...]
+       /*
+        * state->EndRecPtr is expected to have been set by the last call to
+        * XLogBeginRead() or XLogNextRecord(), and is the location of the
+        * error.
+        */
+
+       return NULL;
The comment should refer to XLogFindNextRecord, not XLogNextRecord?
No, it does mean to refer to the XLogNextRecord() (ie the last time
you called XLogNextRecord and successfully dequeued a record, we put
its end LSN there, so if there is a deferred error, that's the
corresponding LSN). Make sense?

It does, thanks!

Also, is it worth an assert (likely at the top of the function) for that?

How could I assert that EndRecPtr has the right value?

Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid).
It can only make sure that the first call is done after XLogBeginRead /
XLogFindNextRecord, but that's better than nothing and consistent with the top
comment.

+   if (unlikely(state->decode_buffer == NULL))
+   {
+       if (state->decode_buffer_size == 0)
+           state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+       state->decode_buffer = palloc(state->decode_buffer_size);
+       state->decode_buffer_head = state->decode_buffer;
+       state->decode_buffer_tail = state->decode_buffer;
+       state->free_decode_buffer = true;
+   }
Maybe change XLogReaderSetDecodeBuffer to also handle allocation and use it
here too? Otherwise XLogReaderSetDecodeBuffer should probably go in 0002 as
the only caller is the recovery prefetching.
I don't think it matters much?

The thing is that for now the only caller to XLogReaderSetDecodeBuffer (in
0002) only uses it to set the length, so a buffer is actually never passed to
that function. Since frontend code can rely on a palloc emulation, is there
really a use case to use e.g. some stack buffer there, or something in a
specific memory context? It seems to be the only use cases for having
XLogReaderSetDecodeBuffer() rather than simply a
XLogReaderSetDecodeBufferSize(). But overall I agree it doesn't matter much,
so no objection to keep it as-is.

It's also not particulary obvious why XLogFindNextRecord() doesn't check for
this value. AFAICS callers don't (and should never) call it with a
nonblocking == true state, maybe add an assert for that?

Fair point. I have now explicitly cleared that flag. (I don't much
like state->nonblocking, which might be better as an argument to
page_read(), but in fact I don't like the fact that page_read
callbacks are blocking in the first place, which is why I liked
Horiguchi-san's patch to get rid of that... but that can be a subject
for later work.)

Agreed.

static void
ResetDecoder(XLogReaderState *state)
{
[...]
+   /* Reset the decoded record queue, freeing any oversized records. */
+   while ((r = state->decode_queue_tail))
nit: I think it's better to explicitly check for the assignment being != NULL,
and existing code is more frequently written this way AFAICS.
I think it's perfectly normal idiomatic C, but if you think it's
clearer that way, OK, done like that.

The thing I don't like about this form is that you can never be sure that an
assignment was really meant unless you read the rest of the nearby code. Other
than that agreed, if perfectly normal idiomatic C.

I realised that this version has broken -DWAL_DEBUG. I'll fix that
shortly, but I wanted to post this update ASAP, so here's a new
version.

+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read callback

typo: "the" requested data.

Other than that it all looks good to me!

The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests. Right now you need to put
recovery_prefetch=on in a file and then run the tests with
"TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
excercise much of 0002.

+1 with Andres' idea to have a "try" setting.

#155

thomas.munro@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#154)

2 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Fri, Mar 11, 2022 at 9:27 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Also, is it worth an assert (likely at the top of the function) for that?

How could I assert that EndRecPtr has the right value?

Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid).
It can only make sure that the first call is done after XLogBeginRead /
XLogFindNextRecord, but that's better than nothing and consistent with the top
comment.

Done.

+ * Returns XLREAD_WOULDBLOCK if he requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read callback

typo: "the" requested data.

Fixed.

Other than that it all looks good to me!

Thanks!

The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests. Right now you need to put
recovery_prefetch=on in a file and then run the tests with
"TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
excercise much of 0002.

+1 with Andres' idea to have a "try" setting.

Done. The default is still "off" for now, but in
027_stream_regress.pl I set it to "try".

I also fixed the compile failure with -DWAL_DEBUG, and checked that
output looks sane with wal_debug=on.

Attachments:

v23-0001-Add-circular-WAL-decoding-buffer-take-II.patchtext/x-patch; charset=US-ASCII; name=v23-0001-Add-circular-WAL-decoding-buffer-take-II.patchDownload

From eb7c00315df09a11b3af3ae2bde8dc8714df6384 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:33:10 +1300
Subject: [PATCH v23 1/2] Add circular WAL decoding buffer, take II.

Teach xlogreader.c to decode its output into a circular buffer, to
support upcoming optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros, and the traditional members like xlogreader->ReadRecPtr.

 * An alternative new interface XLogReadAhead()/XLogNextRecord() is
   added that returns pointers to DecodedXLogRecord
   objects so that it's possible to look ahead in the WAL stream.

 * In order to be able to use the new interface effectively, client
   code should provide a page_read() callback that responds to
   a new nonblocking mode by returning XLREAD_WOULDBLOCK to avoid
   waiting.  No such implementation is included in this commit,
   and other code that is unaware of the new mechanism doesn't need
   to change.

The buffer's size can be set by the client of xlogreader.c.  Large
records that don't fit in the circular buffer are called "oversized" and
allocated separately with palloc().

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
---
 src/backend/access/transam/generic_xlog.c |   6 +-
 src/backend/access/transam/xlog.c         |  18 +-
 src/backend/access/transam/xlogreader.c   | 657 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |   4 +-
 src/backend/access/transam/xlogutils.c    |   2 +-
 src/backend/replication/logical/decode.c  |   2 +-
 src/bin/pg_rewind/parsexlog.c             |   2 +-
 src/bin/pg_waldump/pg_waldump.c           |  25 +-
 src/include/access/xlogreader.h           | 153 ++++-
 src/tools/pgindent/typedefs.list          |   2 +
 10 files changed, 691 insertions(+), 180 deletions(-)

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 4b0c63817f..bbb542b322 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -482,10 +482,10 @@ generic_redo(XLogReaderState *record)
 	uint8		block_id;
 
 	/* Protect limited size of buffers[] array */
-	Assert(record->max_block_id < MAX_GENERIC_XLOG_PAGES);
+	Assert(XLogRecMaxBlockId(record) < MAX_GENERIC_XLOG_PAGES);
 
 	/* Iterate over blocks */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		XLogRedoAction action;
 
@@ -525,7 +525,7 @@ generic_redo(XLogReaderState *record)
 	}
 
 	/* Changes are done: unlock and release all buffers */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (BufferIsValid(buffers[block_id]))
 			UnlockReleaseBuffer(buffers[block_id]);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..6430b7b0dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -970,6 +970,8 @@ XLogInsertRecord(XLogRecData *rdata,
 	if (XLOG_DEBUG)
 	{
 		static XLogReaderState *debug_reader = NULL;
+		XLogRecord *record;
+		DecodedXLogRecord *decoded;
 		StringInfoData buf;
 		StringInfoData recordBuf;
 		char	   *errormsg = NULL;
@@ -989,6 +991,11 @@ XLogInsertRecord(XLogRecData *rdata,
 		for (; rdata != NULL; rdata = rdata->next)
 			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
 
+		/* We also need temporary space to decode the record. */
+		record = (XLogRecord *) recordBuf.data;
+		decoded = (DecodedXLogRecord *)
+			palloc(DecodeXLogRecordRequiredSpace(record->xl_tot_len));
+
 		if (!debug_reader)
 			debug_reader = XLogReaderAllocate(wal_segment_size, NULL,
 											  XL_ROUTINE(), NULL);
@@ -997,7 +1004,10 @@ XLogInsertRecord(XLogRecData *rdata,
 		{
 			appendStringInfoString(&buf, "error decoding record: out of memory while allocating a WAL reading processor");
 		}
-		else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
+		else if (!DecodeXLogRecord(debug_reader,
+								   decoded,
+								   record,
+								   EndPos,
 								   &errormsg))
 		{
 			appendStringInfo(&buf, "error decoding record: %s",
@@ -1006,10 +1016,14 @@ XLogInsertRecord(XLogRecData *rdata,
 		else
 		{
 			appendStringInfoString(&buf, " - ");
+
+			debug_reader->record = decoded;
 			xlog_outdesc(&buf, debug_reader);
+			debug_reader->record = NULL;
 		}
 		elog(LOG, "%s", buf.data);
 
+		pfree(decoded);
 		pfree(buf.data);
 		pfree(recordBuf.data);
 		MemoryContextSwitchTo(oldCxt);
@@ -7736,7 +7750,7 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 */
-		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (uint8 block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			Buffer		buffer;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index b7c06da255..36fbcfa326 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -45,6 +45,7 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 							 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -56,6 +57,12 @@ static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
 /* size of the buffer allocated for error message. */
 #define MAX_ERRORMSG_LEN 1000
 
+/*
+ * Default size; large enough that typical users of XLogReader won't often need
+ * to use the 'oversized' memory allocation code path.
+ */
+#define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
+
 /*
  * Construct a string in state->errormsg_buf explaining what's wrong with
  * the current record being read.
@@ -70,6 +77,24 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...)
 	va_start(args, fmt);
 	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
 	va_end(args);
+
+	state->errormsg_deferred = true;
+}
+
+/*
+ * Set the size of the decoding buffer.  A pointer to a caller supplied memory
+ * region may also be passed in, in which case non-oversized records will be
+ * decoded there.
+ */
+void
+XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
+{
+	Assert(state->decode_buffer == NULL);
+
+	state->decode_buffer = buffer;
+	state->decode_buffer_size = size;
+	state->decode_buffer_tail = buffer;
+	state->decode_buffer_head = buffer;
 }
 
 /*
@@ -92,8 +117,6 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 	/* initialize caller-provided support functions */
 	state->routine = *routine;
 
-	state->max_block_id = -1;
-
 	/*
 	 * Permanently allocate readBuf.  We do it this way, rather than just
 	 * making a static array, for two reasons: (1) no need to waste the
@@ -144,18 +167,11 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 void
 XLogReaderFree(XLogReaderState *state)
 {
-	int			block_id;
-
 	if (state->seg.ws_file != -1)
 		state->routine.segment_close(state);
 
-	for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
-	{
-		if (state->blocks[block_id].data)
-			pfree(state->blocks[block_id].data);
-	}
-	if (state->main_data)
-		pfree(state->main_data);
+	if (state->decode_buffer && state->free_decode_buffer)
+		pfree(state->decode_buffer);
 
 	pfree(state->errormsg_buf);
 	if (state->readRecordBuf)
@@ -251,7 +267,133 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	/* Begin at the passed-in record pointer. */
 	state->EndRecPtr = RecPtr;
+	state->NextRecPtr = RecPtr;
 	state->ReadRecPtr = InvalidXLogRecPtr;
+	state->DecodeRecPtr = InvalidXLogRecPtr;
+}
+
+/*
+ * See if we can release the last record that was returned by
+ * XLogNextRecord(), if any, to free up space.
+ */
+void
+XLogReleasePreviousRecord(XLogReaderState *state)
+{
+	DecodedXLogRecord *record;
+
+	if (!state->record)
+		return;
+
+	/*
+	 * Remove it from the decoded record queue.  It must be the oldest item
+	 * decoded, decode_queue_head.
+	 */
+	record = state->record;
+	Assert(record == state->decode_queue_head);
+	state->record = NULL;
+	state->decode_queue_head = record->next;
+
+	/* It might also be the newest item decoded, decode_queue_tail. */
+	if (state->decode_queue_tail == record)
+		state->decode_queue_tail = NULL;
+
+	/* Release the space. */
+	if (unlikely(record->oversized))
+	{
+		/* It's not in the the decode buffer, so free it to release space. */
+		pfree(record);
+	}
+	else
+	{
+		/* It must be the head (oldest) record in the decode buffer. */
+		Assert(state->decode_buffer_head == (char *) record);
+
+		/*
+		 * We need to update head to point to the next record that is in the
+		 * decode buffer, if any, being careful to skip oversized ones
+		 * (they're not in the decode buffer).
+		 */
+		record = record->next;
+		while (unlikely(record && record->oversized))
+			record = record->next;
+
+		if (record)
+		{
+			/* Adjust head to release space up to the next record. */
+			state->decode_buffer_head = (char *) record;
+		}
+		else
+		{
+			/*
+			 * Otherwise we might as well just reset head and tail to the
+			 * start of the buffer space, because we're empty.  This means
+			 * we'll keep overwriting the same piece of memory if we're not
+			 * doing any prefetching.
+			 */
+			state->decode_buffer_head = state->decode_buffer;
+			state->decode_buffer_tail = state->decode_buffer;
+		}
+	}
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
+ * called before the first call to XLogNextRecord().  This functions returns
+ * records and errors that were put into an internal queue by XLogReadAhead().
+ *
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogNextRecord.
+ */
+DecodedXLogRecord *
+XLogNextRecord(XLogReaderState *state, char **errormsg)
+{
+	/* Release the last record returned by XLogNextRecord(). */
+	XLogReleasePreviousRecord(state);
+
+	if (state->decode_queue_head == NULL)
+	{
+		*errormsg = NULL;
+		if (state->errormsg_deferred)
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			state->errormsg_deferred = false;
+		}
+
+		/*
+		 * state->EndRecPtr is expected to have been set by the last call to
+		 * XLogBeginRead() or XLogNextRecord(), and is the location of the
+		 * error.
+		 */
+		Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
+
+		return NULL;
+	}
+
+	/*
+	 * Record this as the most recent record returned, so that we'll release
+	 * it next time.  This also exposes it to the traditional
+	 * XLogRecXXX(xlogreader) macros, which work with the decoder rather than
+	 * the record for historical reasons.
+	 */
+	state->record = state->decode_queue_head;
+
+	/*
+	 * Update the pointers to the beginning and one-past-the-end of this
+	 * record, again for the benefit of historical code that expected the
+	 * decoder to track this rather than accessing these fields of the record
+	 * itself.
+	 */
+	state->ReadRecPtr = state->record->lsn;
+	state->EndRecPtr = state->record->next_lsn;
+
+	*errormsg = NULL;
+
+	return state->record;
 }
 
 /*
@@ -261,17 +403,132 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
  * to XLogReadRecord().
  *
  * If the page_read callback fails to read the requested data, NULL is
- * returned.  The callback is expected to have reported the error; errormsg
- * is set to NULL.
+ * returned.  The callback is expected to have reported the error; errormsg is
+ * set to NULL.
  *
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * On success, a record is returned.
+ *
+ * The returned record (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadlRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
+{
+	DecodedXLogRecord *decoded;
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(state);
+
+	/*
+	 * Call XLogReadAhead() in blocking mode to make sure there is something
+	 * in the queue, though we don't use the result.
+	 */
+	if (!XLogReaderHasQueuedRecordOrError(state))
+		XLogReadAhead(state, false /* nonblocking */ );
+
+	/* Consume the head record or error. */
+	decoded = XLogNextRecord(state, errormsg);
+	if (decoded)
+	{
+		/*
+		 * This function returns a pointer to the record's header, not the
+		 * actual decoded record.  The caller will access the decoded record
+		 * through the XLogRecGetXXX() macros, which reach the decoded
+		 * recorded as xlogreader->record.
+		 */
+		Assert(state->record == decoded);
+		return &decoded->header;
+	}
+
+	return NULL;
+}
+
+/*
+ * Allocate space for a decoded record.  The only member of the returned
+ * object that is initialized is the 'oversized' flag, indicating that the
+ * decoded record wouldn't fit in the decode buffer and must eventually be
+ * freed explicitly.
+ *
+ * The caller is responsible for adjusting decode_buffer_tail with the real
+ * size after successfully decoding a record into this space.  This way, if
+ * decoding fails, then there is nothing to undo unless the 'oversized' flag
+ * was set and pfree() must be called.
+ *
+ * Return NULL if there is no space in the decode buffer and allow_oversized
+ * is false, or if memory allocation fails for an oversized buffer.
+ */
+static DecodedXLogRecord *
+XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
+{
+	size_t		required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
+	DecodedXLogRecord *decoded = NULL;
+
+	/* Allocate a circular decode buffer if we don't have one already. */
+	if (unlikely(state->decode_buffer == NULL))
+	{
+		if (state->decode_buffer_size == 0)
+			state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
+		state->decode_buffer = palloc(state->decode_buffer_size);
+		state->decode_buffer_head = state->decode_buffer;
+		state->decode_buffer_tail = state->decode_buffer;
+		state->free_decode_buffer = true;
+	}
+
+	/* Try to allocate space in the circular decode buffer. */
+	if (state->decode_buffer_tail >= state->decode_buffer_head)
+	{
+		/* Empty, or tail is to the right of head. */
+		if (state->decode_buffer_tail + required_space <=
+			state->decode_buffer + state->decode_buffer_size)
+		{
+			/* There is space between tail and end. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
+			decoded->oversized = false;
+			return decoded;
+		}
+		else if (state->decode_buffer + required_space <
+				 state->decode_buffer_head)
+		{
+			/* There is space between start and head. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+	else
+	{
+		/* Tail is to the left of head. */
+		if (state->decode_buffer_tail + required_space <
+			state->decode_buffer_head)
+		{
+			/* There is space between tail and heade. */
+			decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
+			decoded->oversized = false;
+			return decoded;
+		}
+	}
+
+	/* Not enough space in the decode buffer.  Are we allowed to allocate? */
+	if (allow_oversized)
+	{
+		decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
+		if (decoded == NULL)
+			return NULL;
+		decoded->oversized = true;
+		return decoded;
+	}
+
+	return NULL;
+}
+
+static XLogPageReadResult
+XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
 	XLogRecPtr	RecPtr;
 	XLogRecord *record;
@@ -284,6 +541,8 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	bool		assembled;
 	bool		gotheader;
 	int			readOff;
+	DecodedXLogRecord *decoded;
+	char	   *errormsg;		/* not used */
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -293,21 +552,20 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	randAccess = false;
 
 	/* reset error state */
-	*errormsg = NULL;
 	state->errormsg_buf[0] = '\0';
+	decoded = NULL;
 
-	ResetDecoder(state);
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
 
-	RecPtr = state->EndRecPtr;
+	RecPtr = state->NextRecPtr;
 
-	if (state->ReadRecPtr != InvalidXLogRecPtr)
+	if (state->DecodeRecPtr != InvalidXLogRecPtr)
 	{
 		/* read the record after the one we just read */
 
 		/*
-		 * EndRecPtr is pointing to end+1 of the previous WAL record.  If
+		 * NextRecPtr is pointing to end+1 of the previous WAL record.  If
 		 * we're at a page boundary, no more records can fit on the current
 		 * page. We must skip over the page header, but we can't do that until
 		 * we've read in the page, since the header size is variable.
@@ -318,7 +576,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 		/*
 		 * Caller supplied a position to start at.
 		 *
-		 * In this case, EndRecPtr should already be pointing to a valid
+		 * In this case, NextRecPtr should already be pointing to a valid
 		 * record starting position.
 		 */
 		Assert(XRecOffIsValid(RecPtr));
@@ -326,6 +584,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 	}
 
 restart:
+	state->nonblocking = nonblocking;
 	state->currRecPtr = RecPtr;
 	assembled = false;
 
@@ -339,7 +598,9 @@ restart:
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
 							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
-	if (readOff < 0)
+	if (readOff == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readOff < 0)
 		goto err;
 
 	/*
@@ -395,7 +656,7 @@ restart:
 	 */
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
 								   randAccess))
 			goto err;
 		gotheader = true;
@@ -414,6 +675,31 @@ restart:
 		gotheader = false;
 	}
 
+	/*
+	 * Find space to decode this record.  Don't allow oversized allocation if
+	 * the caller requested nonblocking.  Otherwise, we *have* to try to
+	 * decode the record now because the caller has nothing else to do, so
+	 * allow an oversized record to be palloc'd if that turns out to be
+	 * necessary.
+	 */
+	decoded = XLogReadRecordAlloc(state,
+								  total_len,
+								  !nonblocking /* allow_oversized */ );
+	if (decoded == NULL)
+	{
+		/*
+		 * There is no space in the decode buffer.  The caller should help
+		 * with that problem by consuming some records.
+		 */
+		if (nonblocking)
+			return XLREAD_WOULDBLOCK;
+
+		/* We failed to allocate memory for an  oversized record. */
+		report_invalid_record(state,
+							  "out of memory while trying to decode a record of length %u", total_len);
+		goto err;
+	}
+
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -453,7 +739,9 @@ restart:
 									   Min(total_len - gotlen + SizeOfXLogShortPHD,
 										   XLOG_BLCKSZ));
 
-			if (readOff < 0)
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
 				goto err;
 
 			Assert(SizeOfXLogShortPHD <= readOff);
@@ -471,7 +759,6 @@ restart:
 			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
 			{
 				state->overwrittenRecPtr = RecPtr;
-				ResetDecoder(state);
 				RecPtr = targetPagePtr;
 				goto restart;
 			}
@@ -526,7 +813,7 @@ restart:
 			if (!gotheader)
 			{
 				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
 										   record, randAccess))
 					goto err;
 				gotheader = true;
@@ -540,8 +827,8 @@ restart:
 			goto err;
 
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
-		state->ReadRecPtr = RecPtr;
-		state->EndRecPtr = targetPagePtr + pageHeaderSize
+		state->DecodeRecPtr = RecPtr;
+		state->NextRecPtr = targetPagePtr + pageHeaderSize
 			+ MAXALIGN(pageHeader->xlp_rem_len);
 	}
 	else
@@ -549,16 +836,18 @@ restart:
 		/* Wait for the record data to become available */
 		readOff = ReadPageInternal(state, targetPagePtr,
 								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff < 0)
+		if (readOff == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readOff < 0)
 			goto err;
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state->NextRecPtr = RecPtr + MAXALIGN(total_len);
 
-		state->ReadRecPtr = RecPtr;
+		state->DecodeRecPtr = RecPtr;
 	}
 
 	/*
@@ -568,14 +857,40 @@ restart:
 		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		state->EndRecPtr += state->segcxt.ws_segsize - 1;
-		state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
+		state->NextRecPtr += state->segcxt.ws_segsize - 1;
+		state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
 	}
 
-	if (DecodeXLogRecord(state, record, errormsg))
-		return record;
+	if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
+	{
+		/* Record the location of the next record. */
+		decoded->next_lsn = state->NextRecPtr;
+
+		/*
+		 * If it's in the decode buffer, mark the decode buffer space as
+		 * occupied.
+		 */
+		if (!decoded->oversized)
+		{
+			/* The new decode buffer head must be MAXALIGNed. */
+			Assert(decoded->size == MAXALIGN(decoded->size));
+			if ((char *) decoded == state->decode_buffer)
+				state->decode_buffer_tail = state->decode_buffer + decoded->size;
+			else
+				state->decode_buffer_tail += decoded->size;
+		}
+
+		/* Insert it into the queue of decoded records. */
+		Assert(state->decode_queue_tail != decoded);
+		if (state->decode_queue_tail)
+			state->decode_queue_tail->next = decoded;
+		state->decode_queue_tail = decoded;
+		if (!state->decode_queue_head)
+			state->decode_queue_head = decoded;
+		return XLREAD_SUCCESS;
+	}
 	else
-		return NULL;
+		return XLREAD_FAIL;
 
 err:
 	if (assembled)
@@ -593,14 +908,46 @@ err:
 		state->missingContrecPtr = targetPagePtr;
 	}
 
+	if (decoded && decoded->oversized)
+		pfree(decoded);
+
 	/*
 	 * Invalidate the read state. We might read from a different source after
 	 * failure.
 	 */
 	XLogReaderInvalReadState(state);
 
-	if (state->errormsg_buf[0] != '\0')
-		*errormsg = state->errormsg_buf;
+	/*
+	 * If an error was written to errmsg_buf, it'll be returned to the caller
+	 * of XLogReadRecord() after all successfully decoded records from the
+	 * read queue.
+	 */
+
+	return XLREAD_FAIL;
+}
+
+/*
+ * Try to decode the next available record, and return it.  The record will
+ * also be returned to XLogNextRecord(), which must be called to 'consume'
+ * each record.
+ *
+ * If nonblocking is true, may return NULL due to lack of data or WAL decoding
+ * space.
+ */
+DecodedXLogRecord *
+XLogReadAhead(XLogReaderState *state, bool nonblocking)
+{
+	XLogPageReadResult result;
+
+	if (state->errormsg_deferred)
+		return NULL;
+
+	result = XLogDecodeNextRecord(state, nonblocking);
+	if (result == XLREAD_SUCCESS)
+	{
+		Assert(state->decode_queue_tail != NULL);
+		return state->decode_queue_tail;
+	}
 
 	return NULL;
 }
@@ -609,8 +956,14 @@ err:
  * Read a single xlog page including at least [pageptr, reqLen] of valid data
  * via the page_read() callback.
  *
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLREAD_FAIL if the required page cannot be read for some
+ * reason; errormsg_buf is set in that case (unless the error occurs in the
+ * page_read callback).
+ *
+ * Returns XLREAD_WOULDBLOCK if the requested data can't be read without
+ * waiting.  This can be returned only if the installed page_read callback
+ * respects the state->nonblocking flag, and cannot read the requested data
+ * immediately.
  *
  * We fetch the page from a reader-local cache if we know we have the required
  * data and if there hasn't been any error since caching the data.
@@ -652,7 +1005,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 
 		/* we can be sure to have enough WAL available, we scrolled back */
@@ -670,7 +1025,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
 									   state->currRecPtr,
 									   state->readBuf);
-	if (readLen < 0)
+	if (readLen == XLREAD_WOULDBLOCK)
+		return XLREAD_WOULDBLOCK;
+	else if (readLen < 0)
 		goto err;
 
 	Assert(readLen <= XLOG_BLCKSZ);
@@ -689,7 +1046,9 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
 										   state->currRecPtr,
 										   state->readBuf);
-		if (readLen < 0)
+		if (readLen == XLREAD_WOULDBLOCK)
+			return XLREAD_WOULDBLOCK;
+		else if (readLen < 0)
 			goto err;
 	}
 
@@ -707,8 +1066,12 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	return readLen;
 
 err:
-	XLogReaderInvalReadState(state);
-	return -1;
+	if (state->errormsg_buf[0] != '\0')
+	{
+		state->errormsg_deferred = true;
+		XLogReaderInvalReadState(state);
+	}
+	return XLREAD_FAIL;
 }
 
 /*
@@ -987,6 +1350,9 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 
 	Assert(!XLogRecPtrIsInvalid(RecPtr));
 
+	/* Make sure ReadPageInternal() can't return XLREAD_WOULDBLOCK. */
+	state->nonblocking = false;
+
 	/*
 	 * skip over potential continuation data, keeping in mind that it may span
 	 * multiple pages
@@ -1065,7 +1431,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, &errormsg))
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1187,34 +1553,83 @@ WALRead(XLogReaderState *state,
  * ----------------------------------------
  */
 
-/* private function to reset the state between records */
+/*
+ * Private function to reset the state, forgetting all decoded records, if we
+ * are asked to move to a new read position.
+ */
 static void
 ResetDecoder(XLogReaderState *state)
 {
-	int			block_id;
-
-	state->decoded_record = NULL;
-
-	state->main_data_len = 0;
+	DecodedXLogRecord *r;
 
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	/* Reset the decoded record queue, freeing any oversized records. */
+	while ((r = state->decode_queue_head) != NULL)
 	{
-		state->blocks[block_id].in_use = false;
-		state->blocks[block_id].has_image = false;
-		state->blocks[block_id].has_data = false;
-		state->blocks[block_id].apply_image = false;
+		state->decode_queue_head = r->next;
+		if (r->oversized)
+			pfree(r);
 	}
-	state->max_block_id = -1;
+	state->decode_queue_tail = NULL;
+	state->decode_queue_head = NULL;
+	state->record = NULL;
+
+	/* Reset the decode buffer to empty. */
+	state->decode_buffer_tail = state->decode_buffer;
+	state->decode_buffer_head = state->decode_buffer;
+
+	/* Clear error state. */
+	state->errormsg_buf[0] = '\0';
+	state->errormsg_deferred = false;
 }
 
 /*
- * Decode the previously read record.
+ * Compute the maximum possible amount of padding that could be required to
+ * decode a record, given xl_tot_len from the record's header.  This is the
+ * amount of output buffer space that we need to decode a record, though we
+ * might not finish up using it all.
+ *
+ * This computation is pessimistic and assumes the maximum possible number of
+ * blocks, due to lack of better information.
+ */
+size_t
+DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
+{
+	size_t		size = 0;
+
+	/* Account for the fixed size part of the decoded record struct. */
+	size += offsetof(DecodedXLogRecord, blocks[0]);
+	/* Account for the flexible blocks array of maximum possible size. */
+	size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
+	/* Account for all the raw main and block data. */
+	size += xl_tot_len;
+	/* We might insert padding before main_data. */
+	size += (MAXIMUM_ALIGNOF - 1);
+	/* We might insert padding before each block's data. */
+	size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
+	/* We might insert padding at the end. */
+	size += (MAXIMUM_ALIGNOF - 1);
+
+	return size;
+}
+
+/*
+ * Decode a record.  "decoded" must point to a MAXALIGNed memory area that has
+ * space for at least DecodeXLogRecordRequiredSpace(record) bytes.  On
+ * success, decoded->size contains the actual space occupied by the decoded
+ * record, which may turn out to be less.
+ *
+ * Only decoded->oversized member must be initialized already, and will not be
+ * modified.  Other members will be initialized as required.
  *
  * On error, a human-readable error message is returned in *errormsg, and
  * the return value is false.
  */
 bool
-DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
+DecodeXLogRecord(XLogReaderState *state,
+				 DecodedXLogRecord *decoded,
+				 XLogRecord *record,
+				 XLogRecPtr lsn,
+				 char **errormsg)
 {
 	/*
 	 * read next _size bytes from record buffer, but check for overrun first.
@@ -1229,17 +1644,20 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	} while(0)
 
 	char	   *ptr;
+	char	   *out;
 	uint32		remaining;
 	uint32		datatotal;
 	RelFileNode *rnode = NULL;
 	uint8		block_id;
 
-	ResetDecoder(state);
-
-	state->decoded_record = record;
-	state->record_origin = InvalidRepOriginId;
-	state->toplevel_xid = InvalidTransactionId;
-
+	decoded->header = *record;
+	decoded->lsn = lsn;
+	decoded->next = NULL;
+	decoded->record_origin = InvalidRepOriginId;
+	decoded->toplevel_xid = InvalidTransactionId;
+	decoded->main_data = NULL;
+	decoded->main_data_len = 0;
+	decoded->max_block_id = -1;
 	ptr = (char *) record;
 	ptr += SizeOfXLogRecord;
 	remaining = record->xl_tot_len - SizeOfXLogRecord;
@@ -1257,7 +1675,7 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
 
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
@@ -1268,18 +1686,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			uint32		main_data_len;
 
 			COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
-			state->main_data_len = main_data_len;
+			decoded->main_data_len = main_data_len;
 			datatotal += main_data_len;
 			break;				/* by convention, the main data fragment is
 								 * always last */
 		}
 		else if (block_id == XLR_BLOCK_ID_ORIGIN)
 		{
-			COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
+			COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
 		}
 		else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
 		{
-			COPY_HEADER_FIELD(&state->toplevel_xid, sizeof(TransactionId));
+			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
 		else if (block_id <= XLR_MAX_BLOCK_ID)
 		{
@@ -1287,7 +1705,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
-			if (block_id <= state->max_block_id)
+			/* mark any intervening block IDs as not in use */
+			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
+				decoded->blocks[i].in_use = false;
+
+			if (block_id <= decoded->max_block_id)
 			{
 				report_invalid_record(state,
 									  "out-of-order block_id %u at %X/%X",
@@ -1295,9 +1717,9 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 									  LSN_FORMAT_ARGS(state->ReadRecPtr));
 				goto err;
 			}
-			state->max_block_id = block_id;
+			decoded->max_block_id = block_id;
 
-			blk = &state->blocks[block_id];
+			blk = &decoded->blocks[block_id];
 			blk->in_use = true;
 			blk->apply_image = false;
 
@@ -1440,17 +1862,18 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 	/*
 	 * Ok, we've parsed the fragment headers, and verified that the total
 	 * length of the payload in the fragments is equal to the amount of data
-	 * left. Copy the data of each fragment to a separate buffer.
-	 *
-	 * We could just set up pointers into readRecordBuf, but we want to align
-	 * the data for the convenience of the callers. Backup images are not
-	 * copied, however; they don't need alignment.
+	 * left.  Copy the data of each fragment to contiguous space after the
+	 * blocks array, inserting alignment padding before the data fragments so
+	 * they can be cast to struct pointers by REDO routines.
 	 */
+	out = ((char *) decoded) +
+		offsetof(DecodedXLogRecord, blocks) +
+		sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
 
 	/* block data first */
-	for (block_id = 0; block_id <= state->max_block_id; block_id++)
+	for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
 	{
-		DecodedBkpBlock *blk = &state->blocks[block_id];
+		DecodedBkpBlock *blk = &decoded->blocks[block_id];
 
 		if (!blk->in_use)
 			continue;
@@ -1459,58 +1882,37 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
 
 		if (blk->has_image)
 		{
-			blk->bkp_image = ptr;
+			/* no need to align image */
+			blk->bkp_image = out;
+			memcpy(out, ptr, blk->bimg_len);
 			ptr += blk->bimg_len;
+			out += blk->bimg_len;
 		}
 		if (blk->has_data)
 		{
-			if (!blk->data || blk->data_len > blk->data_bufsz)
-			{
-				if (blk->data)
-					pfree(blk->data);
-
-				/*
-				 * Force the initial request to be BLCKSZ so that we don't
-				 * waste time with lots of trips through this stanza as a
-				 * result of WAL compression.
-				 */
-				blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
-				blk->data = palloc(blk->data_bufsz);
-			}
+			out = (char *) MAXALIGN(out);
+			blk->data = out;
 			memcpy(blk->data, ptr, blk->data_len);
 			ptr += blk->data_len;
+			out += blk->data_len;
 		}
 	}
 
 	/* and finally, the main data */
-	if (state->main_data_len > 0)
+	if (decoded->main_data_len > 0)
 	{
-		if (!state->main_data || state->main_data_len > state->main_data_bufsz)
-		{
-			if (state->main_data)
-				pfree(state->main_data);
-
-			/*
-			 * main_data_bufsz must be MAXALIGN'ed.  In many xlog record
-			 * types, we omit trailing struct padding on-disk to save a few
-			 * bytes; but compilers may generate accesses to the xlog struct
-			 * that assume that padding bytes are present.  If the palloc
-			 * request is not large enough to include such padding bytes then
-			 * we'll get valgrind complaints due to otherwise-harmless fetches
-			 * of the padding bytes.
-			 *
-			 * In addition, force the initial request to be reasonably large
-			 * so that we don't waste time with lots of trips through this
-			 * stanza.  BLCKSZ / 2 seems like a good compromise choice.
-			 */
-			state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
-												  BLCKSZ / 2));
-			state->main_data = palloc(state->main_data_bufsz);
-		}
-		memcpy(state->main_data, ptr, state->main_data_len);
-		ptr += state->main_data_len;
+		out = (char *) MAXALIGN(out);
+		decoded->main_data = out;
+		memcpy(decoded->main_data, ptr, decoded->main_data_len);
+		ptr += decoded->main_data_len;
+		out += decoded->main_data_len;
 	}
 
+	/* Report the actual size we used. */
+	decoded->size = MAXALIGN(out - (char *) decoded);
+	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+		   decoded->size);
+
 	return true;
 
 shortdata_err:
@@ -1536,10 +1938,11 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	if (rnode)
 		*rnode = bkpb->rnode;
 	if (forknum)
@@ -1559,10 +1962,11 @@ XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
 {
 	DecodedBkpBlock *bkpb;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return NULL;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 
 	if (!bkpb->has_data)
 	{
@@ -1590,12 +1994,13 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
 	char	   *ptr;
 	PGAlignedBlock tmp;
 
-	if (!record->blocks[block_id].in_use)
+	if (block_id > record->record->max_block_id ||
+		!record->record->blocks[block_id].in_use)
 		return false;
-	if (!record->blocks[block_id].has_image)
+	if (!record->record->blocks[block_id].has_image)
 		return false;
 
-	bkpb = &record->blocks[block_id];
+	bkpb = &record->record->blocks[block_id];
 	ptr = bkpb->bkp_image;
 
 	if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..9feea3e6ec 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2139,7 +2139,7 @@ xlog_block_info(StringInfo buf, XLogReaderState *record)
 	int			block_id;
 
 	/* decode block references */
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
@@ -2271,7 +2271,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 
 	Assert((XLogRecGetInfo(record) & XLR_CHECK_CONSISTENCY) != 0);
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		Buffer		buf;
 		Page		page;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 54d5f20734..511f2f186f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -370,7 +370,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * going to initialize it. And vice versa.
 	 */
 	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
-	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
+	willinit = (XLogRecGetBlock(record, block_id)->flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
 	if (!willinit && zeromode)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 8c00a73cb9..77bc7aea7a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -111,7 +111,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 	{
 		ReorderBufferAssignChild(ctx->reorder,
 								 txid,
-								 record->decoded_record->xl_xid,
+								 XLogRecGetXid(record),
 								 buf.origptr);
 	}
 
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c64f..7cfa169e9b 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -432,7 +432,7 @@ extractPageInfo(XLogReaderState *record)
 				 RmgrNames[rmid], info);
 	}
 
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		RelFileNode rnode;
 		ForkNumber	forknum;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f128050b4e..fc081adfb8 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -403,14 +403,13 @@ XLogDumpRecordLen(XLogReaderState *record, uint32 *rec_len, uint32 *fpi_len)
 	 * Calculate the amount of FPI data in the record.
 	 *
 	 * XXX: We peek into xlogreader's private decoded backup blocks for the
-	 * bimg_len indicating the length of FPI data. It doesn't seem worth it to
-	 * add an accessor macro for this.
+	 * bimg_len indicating the length of FPI data.
 	 */
 	*fpi_len = 0;
-	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			*fpi_len += record->blocks[block_id].bimg_len;
+			*fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
 	}
 
 	/*
@@ -508,7 +507,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	if (!config->bkp_details)
 	{
 		/* print block references (short format) */
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -539,7 +538,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	{
 		/* print block references (detailed format) */
 		putchar('\n');
-		for (block_id = 0; block_id <= record->max_block_id; block_id++)
+		for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
 		{
 			if (!XLogRecHasBlockRef(record, block_id))
 				continue;
@@ -552,7 +551,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 				   blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-				uint8		bimg_info = record->blocks[block_id].bimg_info;
+				uint8		bimg_info = XLogRecGetBlock(record, block_id)->bimg_info;
 
 				if (BKPIMAGE_COMPRESSED(bimg_info))
 				{
@@ -571,11 +570,11 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 						   "compression saved: %u, method: %s",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length,
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length,
 						   BLCKSZ -
-						   record->blocks[block_id].hole_length -
-						   record->blocks[block_id].bimg_len,
+						   XLogRecGetBlock(record, block_id)->hole_length -
+						   XLogRecGetBlock(record, block_id)->bimg_len,
 						   method);
 				}
 				else
@@ -583,8 +582,8 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 					printf(" (FPW%s); hole: offset: %u, length: %u",
 						   XLogRecBlockImageApply(record, block_id) ?
 						   "" : " for WAL verification",
-						   record->blocks[block_id].hole_offset,
-						   record->blocks[block_id].hole_length);
+						   XLogRecGetBlock(record, block_id)->hole_offset,
+						   XLogRecGetBlock(record, block_id)->hole_length);
 				}
 			}
 			putchar('\n');
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..d1f364f4e8 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -144,6 +144,30 @@ typedef struct
 	uint16		data_bufsz;
 } DecodedBkpBlock;
 
+/*
+ * The decoded contents of a record.  This occupies a contiguous region of
+ * memory, with main_data and blocks[n].data pointing to memory after the
+ * members declared here.
+ */
+typedef struct DecodedXLogRecord
+{
+	/* Private member used for resource management. */
+	size_t		size;			/* total size of decoded record */
+	bool		oversized;		/* outside the regular decode buffer? */
+	struct DecodedXLogRecord *next; /* decoded record queue link */
+
+	/* Public members. */
+	XLogRecPtr	lsn;			/* location */
+	XLogRecPtr	next_lsn;		/* location of next record */
+	XLogRecord	header;			/* header */
+	RepOriginId record_origin;
+	TransactionId toplevel_xid; /* XID of top-level transaction */
+	char	   *main_data;		/* record's main data portion */
+	uint32		main_data_len;	/* main data portion's length */
+	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
+} DecodedXLogRecord;
+
 struct XLogReaderState
 {
 	/*
@@ -171,6 +195,9 @@ struct XLogReaderState
 	 * Start and end point of last record read.  EndRecPtr is also used as the
 	 * position to read next.  Calling XLogBeginRead() sets EndRecPtr to the
 	 * starting position and ReadRecPtr to invalid.
+	 *
+	 * Start and end point of last record returned by XLogReadRecord().  These
+	 * are also available as record->lsn and record->next_lsn.
 	 */
 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
@@ -192,27 +219,43 @@ struct XLogReaderState
 	 * Use XLogRecGet* functions to investigate the record; these fields
 	 * should not be accessed directly.
 	 * ----------------------------------------
+	 * Start and end point of the last record read and decoded by
+	 * XLogReadRecordInternal().  NextRecPtr is also used as the position to
+	 * decode next.  Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
+	 * the requested starting position.
 	 */
-	XLogRecord *decoded_record; /* currently decoded record */
+	XLogRecPtr	DecodeRecPtr;	/* start of last record decoded */
+	XLogRecPtr	NextRecPtr;		/* end+1 of last record decoded */
+	XLogRecPtr	PrevRecPtr;		/* start of previous record decoded */
 
-	char	   *main_data;		/* record's main data portion */
-	uint32		main_data_len;	/* main data portion's length */
-	uint32		main_data_bufsz;	/* allocated size of the buffer */
-
-	RepOriginId record_origin;
-
-	TransactionId toplevel_xid; /* XID of top-level transaction */
-
-	/* information about blocks referenced by the record. */
-	DecodedBkpBlock blocks[XLR_MAX_BLOCK_ID + 1];
-
-	int			max_block_id;	/* highest block_id in use (-1 if none) */
+	/* Last record returned by XLogReadRecord(). */
+	DecodedXLogRecord *record;
 
 	/* ----------------------------------------
 	 * private/internal state
 	 * ----------------------------------------
 	 */
 
+	/*
+	 * Buffer for decoded records.  This is a circular buffer, though
+	 * individual records can't be split in the middle, so some space is often
+	 * wasted at the end.  Oversized records that don't fit in this space are
+	 * allocated separately.
+	 */
+	char	   *decode_buffer;
+	size_t		decode_buffer_size;
+	bool		free_decode_buffer; /* need to free? */
+	char	   *decode_buffer_head; /* data is read from the head */
+	char	   *decode_buffer_tail; /* new data is written at the tail */
+
+	/*
+	 * Queue of records that have been decoded.  This is a linked list that
+	 * usually consists of consecutive records in decode_buffer, but may also
+	 * contain oversized records allocated with palloc().
+	 */
+	DecodedXLogRecord *decode_queue_head;	/* oldest decoded record */
+	DecodedXLogRecord *decode_queue_tail;	/* newest decoded record */
+
 	/*
 	 * Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
 	 * readLen bytes)
@@ -262,8 +305,25 @@ struct XLogReaderState
 
 	/* Buffer to hold error message */
 	char	   *errormsg_buf;
+	bool		errormsg_deferred;
+
+	/*
+	 * Flag to indicate to XLogPageReadCB that it should not block, during
+	 * read ahead.
+	 */
+	bool		nonblocking;
 };
 
+/*
+ * Check if the XLogNextRecord() has any more queued records or errors.  This
+ * can be used by a read_page callback to decide whether it should block.
+ */
+static inline bool
+XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
+{
+	return (state->decode_queue_head != NULL) || state->errormsg_deferred;
+}
+
 /* Get a new XLogReader */
 extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
 										   const char *waldir,
@@ -274,16 +334,40 @@ extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
 /* Free an XLogReader */
 extern void XLogReaderFree(XLogReaderState *state);
 
+/* Optionally provide a circular decoding buffer to allow readahead. */
+extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
+									  void *buffer,
+									  size_t size);
+
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
 #ifdef FRONTEND
 extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
 #endif							/* FRONTEND */
 
+/* Return values from XLogPageReadCB. */
+typedef enum XLogPageReadResult
+{
+	XLREAD_SUCCESS = 0,			/* record is successfully read */
+	XLREAD_FAIL = -1,			/* failed during reading a record */
+	XLREAD_WOULDBLOCK = -2		/* nonblocking mode only, no data */
+} XLogPageReadResult;
+
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
+extern XLogRecord *XLogReadRecord(XLogReaderState *state,
+								  char **errormsg);
+
+/* Consume the next record or error. */
+extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
 										 char **errormsg);
 
+/* Release the previously returned record, if necessary. */
+extern void XLogReleasePreviousRecord(XLogReaderState *state);
+
+/* Try to read ahead, if there is data and space. */
+extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
+										bool nonblocking);
+
 /* Validate a page */
 extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
 										 XLogRecPtr recptr, char *phdr);
@@ -307,25 +391,32 @@ extern bool WALRead(XLogReaderState *state,
 
 /* Functions for decoding an XLogRecord */
 
-extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record,
+extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
+extern bool DecodeXLogRecord(XLogReaderState *state,
+							 DecodedXLogRecord *decoded,
+							 XLogRecord *record,
+							 XLogRecPtr lsn,
 							 char **errmsg);
 
-#define XLogRecGetTotalLen(decoder) ((decoder)->decoded_record->xl_tot_len)
-#define XLogRecGetPrev(decoder) ((decoder)->decoded_record->xl_prev)
-#define XLogRecGetInfo(decoder) ((decoder)->decoded_record->xl_info)
-#define XLogRecGetRmid(decoder) ((decoder)->decoded_record->xl_rmid)
-#define XLogRecGetXid(decoder) ((decoder)->decoded_record->xl_xid)
-#define XLogRecGetOrigin(decoder) ((decoder)->record_origin)
-#define XLogRecGetTopXid(decoder) ((decoder)->toplevel_xid)
-#define XLogRecGetData(decoder) ((decoder)->main_data)
-#define XLogRecGetDataLen(decoder) ((decoder)->main_data_len)
-#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block_id >= 0)
-#define XLogRecHasBlockRef(decoder, block_id) \
-	((decoder)->blocks[block_id].in_use)
-#define XLogRecHasBlockImage(decoder, block_id) \
-	((decoder)->blocks[block_id].has_image)
-#define XLogRecBlockImageApply(decoder, block_id) \
-	((decoder)->blocks[block_id].apply_image)
+#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
+#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
+#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
+#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
+#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
+#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
+#define XLogRecGetData(decoder) ((decoder)->record->main_data)
+#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
+#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
+#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
+#define XLogRecHasBlockRef(decoder, block_id)			\
+	(((decoder)->record->max_block_id >= (block_id)) &&	\
+	 ((decoder)->record->blocks[block_id].in_use))
+#define XLogRecHasBlockImage(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].has_image)
+#define XLogRecBlockImageApply(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].apply_image)
 
 #ifndef FRONTEND
 extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eaf3e7a8d4..f57f7e0f53 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -533,6 +533,7 @@ DeadLockState
 DeallocateStmt
 DeclareCursorStmt
 DecodedBkpBlock
+DecodedXLogRecord
 DecodingOutputState
 DefElem
 DefElemAction
@@ -2939,6 +2940,7 @@ XLogPageHeader
 XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
+XLogPageReadResult
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

v23-0002-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v23-0002-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From 723e4a75ab267e82506b7549e5ba5fff175d5150 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 9 Nov 2021 16:43:45 +1300
Subject: [PATCH v23 2/2] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  64 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |   2 +
 src/backend/access/transam/xlogprefetcher.c   | 968 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogrecovery.c     | 160 ++-
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  53 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  51 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/recovery/t/027_stream_regress.pl     |   3 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 23 files changed, 1434 insertions(+), 62 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b763bf60f..55fe603dd2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3644,6 +3644,70 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.
+       </para>
+       <para>
+        Prefetching blocks that will soon be needed can reduce I/O wait times
+        during recovery with some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9fb62fec8e..2e3b73f49e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2958,6 +2965,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5177,8 +5247,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..8566f297d3 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 79314c69ab..8c17c88dfc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6430b7b0dd..96a26f0998 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -59,6 +59,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -132,6 +133,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..dc630ecc5a
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,968 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+int			recovery_prefetch = RECOVERY_PREFETCH_OFF;
+
+#ifdef USE_PREFETCH
+#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#else
+#define RecoveryPrefetchEnabled() false
+#endif
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping for readahead barriers. */
+	XLogRecPtr	no_readahead_until;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (RecoveryPrefetchEnabled())
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			/* Certain records act as barriers for all readahead. */
+			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+				return LRQ_NEXT_AGAIN;
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!RecoveryPrefetchEnabled())
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that require us to filter out block ranges, or
+		 * stop readahead completely.
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_XLOG_ID)
+			{
+				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+					record_type == XLOG_END_OF_RECOVERY)
+				{
+					/*
+					 * These records might change the TLI.  Avoid potential
+					 * bugs if we were to allow "read TLI" and "replay TLI" to
+					 * differ without more analysis.
+					 */
+					prefetcher->no_readahead_until = record->lsn;
+				}
+			}
+			else if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will correct this
+				 * problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	prefetcher->no_readahead_until = 0;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate I/O for blocks referenced in future WAL records.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(RecoveryPrefetchEnabled() ? maintenance_io_concurrency * 4 : 1,
+					  RecoveryPrefetchEnabled() ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(int *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value == RECOVERY_PREFETCH_ON)
+	{
+		GUC_check_errdetail("recovery_prefetch not supported on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 36fbcfa326..9b3bfa329f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1729,6 +1729,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1935,6 +1937,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1949,6 +1960,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..e5e7821c79 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -183,6 +184,9 @@ static bool doRequestWalReceiverReply;
 /* XLogReader object used to parse the WAL records */
 static XLogReaderState *xlogreader = NULL;
 
+/* XLogPrefetcher object used to consume WAL records with read-ahead */
+static XLogPrefetcher *xlogprefetcher = NULL;
+
 /* Parameters passed down from ReadRecord to the XLogPageRead callback. */
 typedef struct XLogPageReadPrivate
 {
@@ -404,18 +408,21 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
-							  int emode, bool fetching_ckpt, TimeLineID replayTLI);
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
+							  int emode, bool fetching_ckpt,
+							  TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt,
-										XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 										int whichChkpt, bool report, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
@@ -561,6 +568,15 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -589,7 +605,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 0, true, CheckPointTLI);
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 0, true,
+									  CheckPointTLI);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
@@ -607,8 +624,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -727,7 +744,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		RedoStartTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 1, true,
 									  CheckPointTLI);
 		if (record != NULL)
 		{
@@ -1403,8 +1420,8 @@ FinishWalRecovery(void)
 		lastRec = XLogRecoveryCtl->lastReplayedReadRecPtr;
 		lastRecTLI = XLogRecoveryCtl->lastReplayedTLI;
 	}
-	XLogBeginRead(xlogreader, lastRec);
-	(void) ReadRecord(xlogreader, PANIC, false, lastRecTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
+	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -1501,6 +1518,8 @@ ShutdownWalRecovery(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -1584,15 +1603,15 @@ PerformWalRecovery(void)
 	{
 		/* back up to find the record */
 		replayTLI = RedoStartTLI;
-		XLogBeginRead(xlogreader, RedoStartLSN);
-		record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
 	}
 	else
 	{
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1725,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1922,6 +1941,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+
+		/* Reset the prefetcher. */
+		XLogPrefetchReconfigure();
 	}
 }
 
@@ -2302,7 +2324,8 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG,
+									 InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -2914,17 +2937,18 @@ ConfirmRecoveryPaused(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -2940,7 +2964,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -3073,6 +3097,9 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -3122,20 +3149,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -3260,7 +3298,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -3292,11 +3330,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -3364,7 +3414,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -3372,7 +3422,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -3516,7 +3566,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -3671,11 +3721,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -3683,13 +3737,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -3740,7 +3794,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;				/* not reached */
 }
 
 
@@ -3785,7 +3839,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -3812,8 +3866,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher, LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..ea22577b41 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bb1ac30cd1..f7b4999caf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -905,6 +905,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..d41ae37090 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -211,7 +211,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cd4ebe2fc5..17f54b153b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -119,6 +120,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e7f0a380e6..e34821e98e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
@@ -215,6 +216,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -479,6 +481,19 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry recovery_prefetch_options[] = {
+	{"off", RECOVERY_PREFETCH_OFF, false},
+	{"on", RECOVERY_PREFETCH_ON, false},
+	{"try", RECOVERY_PREFETCH_TRY, false},
+	{"true", RECOVERY_PREFETCH_ON, true},
+	{"false", RECOVERY_PREFETCH_OFF, true},
+	{"yes", RECOVERY_PREFETCH_ON, true},
+	{"no", RECOVERY_PREFETCH_OFF, true},
+	{"1", RECOVERY_PREFETCH_ON, true},
+	{"0", RECOVERY_PREFETCH_OFF, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry force_parallel_mode_options[] = {
 	{"off", FORCE_PARALLEL_OFF, false},
 	{"on", FORCE_PARALLEL_ON, false},
@@ -2792,6 +2807,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3115,7 +3141,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -4975,6 +5002,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		RECOVERY_PREFETCH_OFF, recovery_prefetch_options,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
+
 	{
 		{"force_parallel_mode", PGC_USERSET, DEVELOPER_OPTIONS,
 			gettext_noop("Forces use of parallel query facilities."),
@@ -12211,6 +12248,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..0a6c7bd83e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 09f6464331..1df9dd2fbe 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -50,6 +50,7 @@ extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 extern int	CheckPointSegments;
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..8aaf8681b5
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int recovery_prefetch;
+
+/* Possible values for recovery_prefetch */
+typedef enum
+{
+	RECOVERY_PREFETCH_OFF,
+	RECOVERY_PREFETCH_ON,
+	RECOVERY_PREFETCH_TRY
+}			RecoveryPrefetchValue;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d1f364f4e8..8446050225 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -427,5 +431,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..ff40f96e42 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..534ad0a5fb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6360,6 +6360,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ea774968f0..c9b258508d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -450,4 +450,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(int *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(int new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index c40951b7ba..93ef4ef436 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -19,6 +19,9 @@ $node_primary->init(allows_streaming => 1);
 $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf', 'max_prepared_transactions = 10');
 
+# Enable recovery prefetch, if available on this platform
+$node_primary->append_conf('postgresql.conf', 'recovery_prefetch = try');
+
 # WAL consistency checking is resource intensive so require opt-in with the
 # PG_TEST_EXTRA environment variable.
 if ($ENV{PG_TEST_EXTRA} &&
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..8ad54191cd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f57f7e0f53..3a008ef433 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1408,6 +1408,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2941,6 +2944,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

#156

rjuju123@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#155)

Re: WIP: WAL prefetch (another approach)

On Mon, Mar 14, 2022 at 06:15:59PM +1300, Thomas Munro wrote:

On Fri, Mar 11, 2022 at 9:27 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Also, is it worth an assert (likely at the top of the function) for that?

How could I assert that EndRecPtr has the right value?

Sorry, I meant to assert that some value was assigned (!XLogRecPtrIsInvalid).
It can only make sure that the first call is done after XLogBeginRead /
XLogFindNextRecord, but that's better than nothing and consistent with the top
comment.

Done.

Just a small detail: I would move that assert at the top of the function as it
should always be valid.

I also fixed the compile failure with -DWAL_DEBUG, and checked that
output looks sane with wal_debug=on.

Great! I'm happy with 0001 and I think it's good to go!

The other thing I need to change is that I should turn on
recovery_prefetch for platforms that support it (ie Linux and maybe
NetBSD only for now), in the tests. Right now you need to put
recovery_prefetch=on in a file and then run the tests with
"TEMP_CONFIG=path_to_that make -C src/test/recovery check" to
excercise much of 0002.

+1 with Andres' idea to have a "try" setting.

Done. The default is still "off" for now, but in
027_stream_regress.pl I set it to "try".

Great too! Unless you want to commit both patches right now I'd like to review
0002 too (this week), as I barely look into it for now.

#157

thomas.munro@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#156)

Re: WIP: WAL prefetch (another approach)

On Mon, Mar 14, 2022 at 8:17 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Great! I'm happy with 0001 and I think it's good to go!

I'll push 0001 today to let the build farm chew on it for a few days
before moving to 0002.

#158

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#157)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I'll push 0001 today to let the build farm chew on it for a few days
before moving to 0002.

Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to
fail occasionally, but that predates the above commit. I didn't
follow the existing discussion on that, so I'll try to look into that
tomorrow.

Here's a rebase of the 0002 patch, now called 0001

Attachments:

v24-0001-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v24-0001-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From 3ac04122e635b98c50d6e48677fe74535d631388 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 20 Mar 2022 16:56:12 +1300
Subject: [PATCH v24] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Since not all OSes have that system call, "try" is provided so that
it can be enabled on operating systems where it is available, and that
is used in 027_stream_regress.pl so that we effectively exercise on and
off behaviors in the build farm.  Better mechanisms will follow in later
work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  64 ++
 doc/src/sgml/monitoring.sgml                  |  77 +-
 doc/src/sgml/wal.sgml                         |  12 +
 src/backend/access/transam/Makefile           |   1 +
 src/backend/access/transam/xlog.c             |   2 +
 src/backend/access/transam/xlogprefetcher.c   | 968 ++++++++++++++++++
 src/backend/access/transam/xlogreader.c       |  13 +
 src/backend/access/transam/xlogrecovery.c     | 160 ++-
 src/backend/access/transam/xlogutils.c        |  27 +-
 src/backend/catalog/system_views.sql          |  13 +
 src/backend/storage/freespace/freespace.c     |   3 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/utils/misc/guc.c                  |  53 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/access/xlog.h                     |   1 +
 src/include/access/xlogprefetcher.h           |  51 +
 src/include/access/xlogreader.h               |   8 +
 src/include/access/xlogutils.h                |   3 +-
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/utils/guc.h                       |   4 +
 src/test/recovery/t/027_stream_regress.pl     |   3 +
 src/test/regress/expected/rules.out           |  10 +
 src/tools/pgindent/typedefs.list              |   7 +
 23 files changed, 1434 insertions(+), 62 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a48973b3c..ce84f379a8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3644,6 +3644,70 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.
+       </para>
+       <para>
+        Prefetching blocks that will soon be needed can reduce I/O wait times
+        during recovery with some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 35b2923c5e..b78081e6d7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2967,6 +2974,69 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5186,8 +5256,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..8566f297d3 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,18 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 79314c69ab..8c17c88dfc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ac3871c74..a1544c052e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -59,6 +59,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -133,6 +134,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..537b0b192a
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,968 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/* GUCs */
+int			recovery_prefetch = RECOVERY_PREFETCH_OFF;
+
+#ifdef USE_PREFETCH
+#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#else
+#define RecoveryPrefetchEnabled() false
+#endif
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping required to avoid accessing non-existing blocks. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping for readahead barriers. */
+	XLogRecPtr	no_readahead_until;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (RecoveryPrefetchEnabled())
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			/* Certain records act as barriers for all readahead. */
+			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+				return LRQ_NEXT_AGAIN;
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!RecoveryPrefetchEnabled())
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that require us to filter out block ranges, or
+		 * stop readahead completely.
+		 *
+		 * XXX Perhaps this information could be derived automatically if we
+		 * had some standardized header flags and fields for these fields,
+		 * instead of special logic.
+		 *
+		 * XXX Are there other operations that need this treatment?
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_XLOG_ID)
+			{
+				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+					record_type == XLOG_END_OF_RECOVERY)
+				{
+					/*
+					 * These records might change the TLI.  Avoid potential
+					 * bugs if we were to allow "read TLI" and "replay TLI" to
+					 * differ without more analysis.
+					 */
+					prefetcher->no_readahead_until = record->lsn;
+				}
+			}
+			else if (rmid == RM_DBASE_ID)
+			{
+				if (record_type == XLOG_DBASE_CREATE)
+				{
+					xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *)
+					record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.  (We could use XLOG_DBASE_DROP instead, but
+					 * there shouldn't be any reference to blocks in a
+					 * database between DROP and CREATE for the same OID, and
+					 * doing it on CREATE avoids the more expensive
+					 * ENOENT-handling if we didn't treat CREATE as a
+					 * barrier).
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything for this whole relation until
+					 * it has been created, or we might confuse blocks on OID
+					 * wraparound.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+											record->lsn);
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't prefetch anything in the truncated range until
+					 * the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the block is past the end of the relation, filter out
+			 * further accesses until this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * Neither cached nor initiated.  The underlying segment file
+				 * doesn't exist. (ENOENT)
+				 *
+				 * It might be missing becaused it was unlinked, we crashed,
+				 * and now we're replaying WAL.  Recovery will correct this
+				 * problem or complain if something is wrong.
+				 */
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mod required, but it is not allowed in this context")));
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int32GetDatum(SharedStats->wal_distance);
+	values[7] = Int32GetDatum(SharedStats->block_distance);
+	values[8] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	prefetcher->no_readahead_until = 0;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate I/O for blocks referenced in future WAL records.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(RecoveryPrefetchEnabled() ? maintenance_io_concurrency * 4 : 1,
+					  RecoveryPrefetchEnabled() ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(int *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value == RECOVERY_PREFETCH_ON)
+	{
+		GUC_check_errdetail("recovery_prefetch not supported on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..8800e88ad0 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1727,6 +1727,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1933,6 +1935,15 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1947,6 +1958,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..e5e7821c79 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -183,6 +184,9 @@ static bool doRequestWalReceiverReply;
 /* XLogReader object used to parse the WAL records */
 static XLogReaderState *xlogreader = NULL;
 
+/* XLogPrefetcher object used to consume WAL records with read-ahead */
+static XLogPrefetcher *xlogprefetcher = NULL;
+
 /* Parameters passed down from ReadRecord to the XLogPageRead callback. */
 typedef struct XLogPageReadPrivate
 {
@@ -404,18 +408,21 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
-							  int emode, bool fetching_ckpt, TimeLineID replayTLI);
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
+							  int emode, bool fetching_ckpt,
+							  TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt,
-										XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 										int whichChkpt, bool report, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
@@ -561,6 +568,15 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -589,7 +605,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 0, true, CheckPointTLI);
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 0, true,
+									  CheckPointTLI);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
@@ -607,8 +624,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -727,7 +744,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		RedoStartTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 1, true,
 									  CheckPointTLI);
 		if (record != NULL)
 		{
@@ -1403,8 +1420,8 @@ FinishWalRecovery(void)
 		lastRec = XLogRecoveryCtl->lastReplayedReadRecPtr;
 		lastRecTLI = XLogRecoveryCtl->lastReplayedTLI;
 	}
-	XLogBeginRead(xlogreader, lastRec);
-	(void) ReadRecord(xlogreader, PANIC, false, lastRecTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
+	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -1501,6 +1518,8 @@ ShutdownWalRecovery(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -1584,15 +1603,15 @@ PerformWalRecovery(void)
 	{
 		/* back up to find the record */
 		replayTLI = RedoStartTLI;
-		XLogBeginRead(xlogreader, RedoStartLSN);
-		record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
 	}
 	else
 	{
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1725,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1922,6 +1941,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+
+		/* Reset the prefetcher. */
+		XLogPrefetchReconfigure();
 	}
 }
 
@@ -2302,7 +2324,8 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG,
+									 InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -2914,17 +2937,18 @@ ConfirmRecoveryPaused(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -2940,7 +2964,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -3073,6 +3097,9 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
  * and call XLogPageRead() again with the same arguments. This lets
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * return XLREAD_WOULDBLOCK if we'd otherwise have to wait.
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
@@ -3122,20 +3149,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -3260,7 +3298,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -3292,11 +3330,15 @@ next_record_is_invalid:
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -3364,7 +3414,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -3372,7 +3422,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -3516,7 +3566,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -3671,11 +3721,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -3683,13 +3737,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -3740,7 +3794,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;				/* not reached */
 }
 
 
@@ -3785,7 +3839,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -3812,8 +3866,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher, LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 511f2f186f..ea22577b41 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bb1ac30cd1..f7b4999caf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -905,6 +905,19 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..d41ae37090 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -211,7 +211,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cd4ebe2fc5..17f54b153b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -119,6 +120,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e7f0a380e6..e34821e98e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
@@ -215,6 +216,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -479,6 +481,19 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry recovery_prefetch_options[] = {
+	{"off", RECOVERY_PREFETCH_OFF, false},
+	{"on", RECOVERY_PREFETCH_ON, false},
+	{"try", RECOVERY_PREFETCH_TRY, false},
+	{"true", RECOVERY_PREFETCH_ON, true},
+	{"false", RECOVERY_PREFETCH_OFF, true},
+	{"yes", RECOVERY_PREFETCH_ON, true},
+	{"no", RECOVERY_PREFETCH_OFF, true},
+	{"1", RECOVERY_PREFETCH_ON, true},
+	{"0", RECOVERY_PREFETCH_OFF, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry force_parallel_mode_options[] = {
 	{"off", FORCE_PARALLEL_OFF, false},
 	{"on", FORCE_PARALLEL_ON, false},
@@ -2792,6 +2807,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3115,7 +3141,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -4975,6 +5002,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		RECOVERY_PREFETCH_OFF, recovery_prefetch_options,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
+
 	{
 		{"force_parallel_mode", PGC_USERSET, DEVELOPER_OPTIONS,
 			gettext_noop("Forces use of parallel query facilities."),
@@ -12211,6 +12248,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..0a6c7bd83e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+#recovery_prefetch = off		# prefetch pages referenced in the WAL?
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 09f6464331..1df9dd2fbe 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -50,6 +50,7 @@ extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 extern int	CheckPointSegments;
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..03f0cefecd
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int recovery_prefetch;
+
+/* Possible values for recovery_prefetch */
+typedef enum
+{
+	RECOVERY_PREFETCH_OFF,
+	RECOVERY_PREFETCH_ON,
+	RECOVERY_PREFETCH_TRY
+}			RecoveryPrefetchValue;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f4388cc9be..be266296d5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -430,5 +434,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..ff40f96e42 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d8e8715ed1..534ad0a5fb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6360,6 +6360,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ea774968f0..c9b258508d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -450,4 +450,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(int *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(int new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index c40951b7ba..93ef4ef436 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -19,6 +19,9 @@ $node_primary->init(allows_streaming => 1);
 $node_primary->adjust_conf('postgresql.conf', 'max_connections', '25');
 $node_primary->append_conf('postgresql.conf', 'max_prepared_transactions = 10');
 
+# Enable recovery prefetch, if available on this platform
+$node_primary->append_conf('postgresql.conf', 'recovery_prefetch = try');
+
 # WAL consistency checking is resource intensive so require opt-in with the
 # PG_TEST_EXTRA environment variable.
 if ($ENV{PG_TEST_EXTRA} &&
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..8ad54191cd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,16 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 93d5190508..7790573105 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1408,6 +1408,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2946,6 +2949,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.30.2

#159

thomas.munro@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#158)

Re: WIP: WAL prefetch (another approach)

On Sun, Mar 20, 2022 at 5:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Clearly 018_wal_optimize.pl is flapping

Correction, 019_replslot_limit.pl, discussed at
/messages/by-id/83b46e5f-2a52-86aa-fa6c-8174908174b8@iki.fi
.

#160

rjuju123@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#158)

Re: WIP: WAL prefetch (another approach)

Hi,

On Sun, Mar 20, 2022 at 05:36:38PM +1300, Thomas Munro wrote:

On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I'll push 0001 today to let the build farm chew on it for a few days
before moving to 0002.

Clearly 018_wal_optimize.pl is flapping and causing recoveryCheck to
fail occasionally, but that predates the above commit. I didn't
follow the existing discussion on that, so I'll try to look into that
tomorrow.

Here's a rebase of the 0002 patch, now called 0001

So I finally finished looking at this patch. Here again, AFAICS the feature is
working as expected and I didn't find any problem. I just have some minor
comments, like for the previous patch.

For the docs:

+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.

Is there any reason not to change it to try? I'm wondering if some system says
that the function exists but simply raise an error if you actually try to use
it. I think that at least WSL does that for some functions.

+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>

I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
wait time or something like that? Also, I don't know if we need to be that
precise, but maybe we should say that it's the underlying kernel that will
(asynchronously) initiate the reads, and postgres will simply notifies it.

+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>

That's not the implemented behavior as far as I can see. It just prints whatever is in SharedStats
regardless of the recovery state or the prefetch_wal setting (assuming that
there's no pending reset request). Similarly, there's a mention that
pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
calling XLogPrefetchRequestResetStats().

Finally, I think we should documented what are the cumulated counters in that
view (that should get reset) and the dynamic counters (that shouldn't get
reset).

For the code:

 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
                   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+                   RelFileNode *rnode, ForkNumber *forknum,
+                   BlockNumber *blknum,
+                   Buffer *prefetch_buffer)
 {

It's missing comments on that function. XLogRecGetBlockTag comments should
probably be reworded at the same time.

+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
           bool fetching_ckpt, TimeLineID replayTLI)
 {
    XLogRecord *record;
+   XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);

nit: maybe name it XLogPrefetcherGetReader()?

  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
  * false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,

The comment still mentions a couple of time returning true/false rather than
XLREAD_*, same for at least XLogPageRead().

@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
         */
        if (lastSourceFailed)
        {
+           /*
+            * Don't allow any retry loops to occur during nonblocking
+            * readahead.  Let the caller process everything that has been
+            * decoded already first.
+            */
+           if (nonblocking)
+               return XLREAD_WOULDBLOCK;

Is that really enough? I'm wondering if the code path in ReadRecord() that
forces lastSourceFailed to False while it actually failed when switching into
archive recovery (xlogrecovery.c around line 3044) can be problematic here.

{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
GUC_UNIT_BYTE
},
&wal_decode_buffer_size,
512 * 1024, 64 * 1024, INT_MAX,

Should the max be MaxAllocSize?

+   /* Do we have a clue where the buffer might be already? */
+   if (BufferIsValid(recent_buffer) &&
+       mode == RBM_NORMAL &&
+       ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+   {
+       buffer = recent_buffer;
+       goto recent_buffer_fast_path;
+   }

Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?

Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
so some comments would be helpful.

xlogprefetcher.c:

+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.

I'm not sure what you wanted to say here. Also, I don't see any
XLogRecBufferForRedo() anywhere, I'm assuming it's
XLogReadBufferForRedo?

+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)

Should there be a bit more comments about what this function is supposed to
enforce?

I'm wondering if it's a bit overkill to implement this as a callback. Do you
have near future use cases in mind? For now no other code could use the
infrastructure at all as the lrq is private, so some changes will be needed to
make it truly configurable anyway.

If we keep it as a callback, I think it would make sense to extract some part,
like the main prefetch filters / global-limit logic, so other possible
implementations can use it if needed. It would also help to reduce this
function a bit, as it's somewhat long.

Also, about those filters:

+           if (rmid == RM_XLOG_ID)
+           {
+               if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+                   record_type == XLOG_END_OF_RECOVERY)
+               {
+                   /*
+                    * These records might change the TLI.  Avoid potential
+                    * bugs if we were to allow "read TLI" and "replay TLI" to
+                    * differ without more analysis.
+                    */
+                   prefetcher->no_readahead_until = record->lsn;
+               }
+           }

Should there be a note that it's still ok to process this record in the loop
just after, as it won't contain any prefetchable data, or simply jump to the
end of that loop?

+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+   Assert(AmStartupProcess() || !IsUnderPostmaster);
+   pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}

I'm curious about this one. Is it to avoid expensive locking on platforms that
don't have a lockless pg_atomic_fetch_add_u64?

Also, it's only correct because there can only be a single prefetcher, so you
can't have concurrent increment of the same counter right?

+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
[...]

This function could use the new SetSingleFuncCall() function introduced in
9e98583898c.

And finally:

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..0a6c7bd83e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB

+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB        # lookahead window used for prefetching

This one should be documented as "(change requires restart)"

#161

/messages/by-id/CA+hUKGJ7OqpdnbSTq5oK=djSeVW2JMnrVPSm8JC-_dbN6Y7bpw@mail.gmail.com

thomas.munro@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#160)

1 attachment(s)

Re: WIP: WAL prefetch (another approach)

On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

So I finally finished looking at this patch. Here again, AFAICS the feature is
working as expected and I didn't find any problem. I just have some minor
comments, like for the previous patch.

Thanks very much for the review. I've attached a new version
addressing most of your feedback, and also rebasing over the new
WAL-logged CREATE DATABASE. I've also fixed a couple of bugs (see
end).

For the docs:

+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.

Yeah, we could just default it to try. Whether we should ship that
way is another question, but done for now.

I don't think there are any supported systems that have a
posix_fadvise() that fails with -1, or we'd know about it, because
we already use it in other places. We do support one OS that provides
a dummy function in libc that does nothing at all (Solaris/illumos),
and at least a couple that enter the kernel but are known to do
nothing at all for WILLNEED (AIX, FreeBSD).

+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can
+   be used to improve I/O performance during recovery by instructing
+   <productname>PostgreSQL</productname> to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.
+   By default, prefetching in recovery is disabled.
+  </para>
I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
wait time or something like that? Also, I don't know if we need to be that
precise, but maybe we should say that it's the underlying kernel that will
(asynchronously) initiate the reads, and postgres will simply notifies it.

Updated with this new text:

The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
I/O wait times during recovery by instructing the kernel to initiate reads
of disk blocks that will soon be needed but are not currently in
<productname>PostgreSQL</productname>'s buffer pool and will soon be read.

+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+   one row.  It is filled with nulls if recovery is not running or WAL
+   prefetching is not enabled.  See <xref linkend="guc-recovery-prefetch"/>
+   for more information.
+  </para>
That's not the implemented behavior as far as I can see. It just prints whatever is in SharedStats
regardless of the recovery state or the prefetch_wal setting (assuming that
there's no pending reset request).

Yeah. Updated text: "It is filled with nulls if recovery has not run
or ...".

Similarly, there's a mention that
pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
calling XLogPrefetchRequestResetStats().

It's 'prefetch_recovery', not 'wal', but yeah, oops, it looks like I
got carried away between v18 and v19 while simplifying the stats and
lost a hunk I should have kept. Fixed.

Finally, I think we should documented what are the cumulated counters in that
view (that should get reset) and the dynamic counters (that shouldn't get
reset).

OK, done.

For the code:

bool
XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+                   RelFileNode *rnode, ForkNumber *forknum,
+                   BlockNumber *blknum,
+                   Buffer *prefetch_buffer)
{

It's missing comments on that function. XLogRecGetBlockTag comments should
probably be reworded at the same time.

New comment added for XLogRecGetBlockInfo(). Wish I could come up
with a better name for that... Not quite sure what you thought I should
change about XLogRecGetBlockTag().

+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
bool fetching_ckpt, TimeLineID replayTLI)
{
XLogRecord *record;
+   XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);

nit: maybe name it XLogPrefetcherGetReader()?

OK.

* containing it (if not open already), and returns true. When end of standby
* mode is triggered by the user, and there is no more WAL available, returns
* false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
*/
-static bool
+static XLogPageReadResult
WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,

The comment still mentions a couple of time returning true/false rather than
XLREAD_*, same for at least XLogPageRead().

Fixed.

@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
if (lastSourceFailed)
{
+           /*
+            * Don't allow any retry loops to occur during nonblocking
+            * readahead.  Let the caller process everything that has been
+            * decoded already first.
+            */
+           if (nonblocking)
+               return XLREAD_WOULDBLOCK;
Is that really enough? I'm wondering if the code path in ReadRecord() that
forces lastSourceFailed to False while it actually failed when switching into
archive recovery (xlogrecovery.c around line 3044) can be problematic here.

I don't see the problem scenario, could you elaborate?

{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
GUC_UNIT_BYTE
},
&wal_decode_buffer_size,
512 * 1024, 64 * 1024, INT_MAX,

Should the max be MaxAllocSize?

Hmm. OK, done.

+   /* Do we have a clue where the buffer might be already? */
+   if (BufferIsValid(recent_buffer) &&
+       mode == RBM_NORMAL &&
+       ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+   {
+       buffer = recent_buffer;
+       goto recent_buffer_fast_path;
+   }

Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?

Hmm. I guess ReadRecentBuffer() should really do that. Done.

Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
so some comments would be helpful.

OK, I'll come back to that.

xlogprefetcher.c:
+ * data.  XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers ently.
I'm not sure what you wanted to say here. Also, I don't see any
XLogRecBufferForRedo() anywhere, I'm assuming it's
XLogReadBufferForRedo?

Yeah, typos. I rewrote that comment.

+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
Should there be a bit more comments about what this function is supposed to
enforce?

I have added a comment to explain.

I'm wondering if it's a bit overkill to implement this as a callback. Do you
have near future use cases in mind? For now no other code could use the
infrastructure at all as the lrq is private, so some changes will be needed to
make it truly configurable anyway.

Yeah. Actually, in the next step I want to throw away the lrq part,
and keep just the XLogPrefetcherNextBlock() function, with some small
modifications.

Admittedly the control flow is a little confusing, but the point of
this architecture is to separate "how to prefetch one more thing" from
"when to prefetch, considering I/O depth and related constraints".
The first thing, "how", is represented by XLogPrefetcherNextBlock().
The second thing, "when", is represented here by the
LsnReadQueue/lrq_XXX stuff that is private in this file for now, but
later I will propose to replace that second thing with the
pg_streaming_read facility of commitfest entry 38/3316. This is a way
of getting there step by step. I also wrote briefly about that here:

If we keep it as a callback, I think it would make sense to extract some part,
like the main prefetch filters / global-limit logic, so other possible
implementations can use it if needed. It would also help to reduce this
function a bit, as it's somewhat long.

I can't imagine reusing any of those filtering things anywhere else.
I admit that the function is kinda long...

Also, about those filters:

+           if (rmid == RM_XLOG_ID)
+           {
+               if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+                   record_type == XLOG_END_OF_RECOVERY)
+               {
+                   /*
+                    * These records might change the TLI.  Avoid potential
+                    * bugs if we were to allow "read TLI" and "replay TLI" to
+                    * differ without more analysis.
+                    */
+                   prefetcher->no_readahead_until = record->lsn;
+               }
+           }

Should there be a note that it's still ok to process this record in the loop
just after, as it won't contain any prefetchable data, or simply jump to the
end of that loop?

Comment added.

+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+   Assert(AmStartupProcess() || !IsUnderPostmaster);
+   pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}

I'm curious about this one. Is it to avoid expensive locking on platforms that
don't have a lockless pg_atomic_fetch_add_u64?

My goal here is only to make sure that systems without
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY don't see bogus/torn values. On
more typical systems, I just want plain old counter++, for the CPU to
feel free to reorder, without the overheads of LOCK XADD.

+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
[...]
This function could use the new SetSingleFuncCall() function introduced in
9e98583898c.

Oh, yeah, that looks much nicer!

+# - Prefetching during recovery -
+
+#wal_decode_buffer_size = 512kB        # lookahead window used for prefetching
This one should be documented as "(change requires restart)"

Done.

Other changes:

1. The logic for handling relations and blocks that don't exist
(presumably, yet) wasn't quite right. The previous version could
raise an error in smgrnblocks() if a referenced relation doesn't exist
at all on disk. I don't know how to actually reach that case
(considering the analysis this thing does of SMGR create etc to avoid
touching relations that haven't been created yet), but if it is
possible somehow, then it will handle this gracefully.

To check for missing relations I use smgrexists(). To make that fast,
I changed it to not close segments when in recovery, which is OK
because recovery already closes SMGR relations when replaying anything
that would unlink files.

2. The logic for filtering out access to an entire database wasn't
quite right. In this new version, that's necessary only for
file-based CREATE DATABASE, since that does bulk creation of relations
without any individual WAL records to analyse. This works by using
{inv, dbNode, inv} as a key in the filter hash table, but I was trying
to look things up by {spcNode, dbNode, inv}. Fixed.

3. The handling for XLOG_SMGR_CREATE was firing for every fork, but
it really only needed to fire for the main fork, for now. (There's no
reason at all this thing shouldn't prefetch other forks, that's just
left for later).

4. To make it easier to see the filtering logic at work, I added code
to log messages about that if you #define XLOGPREFETCHER_DEBUG_LEVEL.
Could be extended to show more internal state and events...

5. While retesting various scenarios, it bothered me that big seq
scan UPDATEs would repeatedly issue posix_fadvise() for the same block
(because multiple rows in a page are touched by consecutive records,
and the page doesn't make it into the buffer pool until a bit later).
I resurrected the defences I had against that a few versions back
using a small window of recent prefetches, which I'd originally
developed as a way to avoid explicit prefetches of sequential scans
(prefetch 1, 2, 3, ...). That turned out to be useless superstition
based on ancient discussions in this mailing list, but I think it's
still useful to avoid obviously stupid sequences of repeat system
calls (prefetch 1, 1, 1, ...). So now it has a little one-cache-line
sized window of history, to avoid doing that.

I need to re-profile a few workloads after these changes, and then
there are a couple of bikeshed-colour items:

1. It's completely arbitrary that it limits its lookahead to
maintenance_io_concurrency * 4 blockrefs ahead in the WAL. I have no
principled reason to choose 4. In the AIO version of this (to
follow), that number of blocks finishes up getting pinned at the same
time, so more thought might be needed on that, but that doesn't apply
here yet, so it's a bit arbitrary.

2. Defaults for wal_decode_buffer_size and maintenance_io_concurrency
are likewise arbitrary.

3. At some point in this long thread I was convinced to name the view
pg_stat_prefetch_recovery, but the GUC is called recovery_prefetch.
That seems silly...

Attachments:

v25-0001-Prefetch-referenced-data-in-recovery-take-II.patchtext/x-patch; charset=US-ASCII; name=v25-0001-Prefetch-referenced-data-in-recovery-take-II.patchDownload

From 26ff673dfaaec48fc4fa2858fd0fec953419d317 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 20 Mar 2022 16:56:12 +1300
Subject: [PATCH v25] Prefetch referenced data in recovery, take II.

Introduce a new GUC recovery_prefetch.  When enabled, look ahead in the
WAL and try to initiate asynchronous reading of referenced data blocks
that are not yet cached in our buffer pool, during recovery.

For now, this is done with posix_fadvise(), which has several caveats.
Since not all OSes have that system call, the GUC can be set to "try" so
that it is enabled on operating systems where it is available.  For now
"try" is the default.  Better prefetching mechanisms will follow in
later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size limits the maximum distance we are
prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   64 +
 doc/src/sgml/monitoring.sgml                  |   86 +-
 doc/src/sgml/wal.sgml                         |   11 +
 src/backend/access/transam/Makefile           |    1 +
 src/backend/access/transam/xlog.c             |    2 +
 src/backend/access/transam/xlogprefetcher.c   | 1078 +++++++++++++++++
 src/backend/access/transam/xlogreader.c       |   21 +
 src/backend/access/transam/xlogrecovery.c     |  174 ++-
 src/backend/access/transam/xlogutils.c        |   27 +-
 src/backend/catalog/system_views.sql          |   14 +
 src/backend/postmaster/pgstat.c               |    8 +-
 src/backend/storage/buffer/bufmgr.c           |    4 +
 src/backend/storage/freespace/freespace.c     |    3 +-
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/smgr/md.c                 |    6 +-
 src/backend/utils/misc/guc.c                  |   53 +-
 src/backend/utils/misc/postgresql.conf.sample |    6 +
 src/include/access/xlog.h                     |    1 +
 src/include/access/xlogprefetcher.h           |   51 +
 src/include/access/xlogreader.h               |    8 +
 src/include/access/xlogutils.h                |    3 +-
 src/include/catalog/pg_proc.dat               |    8 +
 src/include/utils/guc.h                       |    4 +
 src/test/regress/expected/rules.out           |   11 +
 src/tools/pgindent/typedefs.list              |    7 +
 25 files changed, 1582 insertions(+), 72 deletions(-)
 create mode 100644 src/backend/access/transam/xlogprefetcher.c
 create mode 100644 src/include/access/xlogprefetcher.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 43e4ade83e..1183cd0ed1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3650,6 +3650,70 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+   <sect2 id="runtime-config-wal-recovery">
+
+    <title>Recovery</title>
+
+     <indexterm>
+      <primary>configuration</primary>
+      <secondary>of recovery</secondary>
+      <tertiary>general settings</tertiary>
+     </indexterm>
+
+    <para>
+     This section describes the settings that apply to recovery in general,
+     affecting crash recovery, streaming replication and archive-based
+     replication.
+    </para>
+
+
+    <variablelist>
+     <varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
+      <term><varname>recovery_prefetch</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>recovery_prefetch</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.
+       </para>
+       <para>
+        Prefetching blocks that will soon be needed can reduce I/O wait times
+        during recovery with some workloads.
+        See also the <xref linkend="guc-wal-decode-buffer-size"/> and
+        <xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
+        prefetching activity.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
+      <term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        A limit on how far ahead the server can look in the WAL, to find
+        blocks to prefetch.  If this value is specified without units, it is
+        taken as bytes.
+        The default is 512kB.
+       </para>
+      </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </sect2>
+
   <sect2 id="runtime-config-wal-archive-recovery">
 
     <title>Archive Recovery</title>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3b9172f65b..4d11d6e292 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -328,6 +328,13 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+      <entry>Only one row, showing statistics about blocks prefetched during recovery.
+       See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
       <entry>At least one row per subscription, showing information about
@@ -2971,6 +2978,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    copy of the subscribed tables.
   </para>
 
+  <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+   <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>prefetch</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>hit</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_init</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they would be zero-initialized</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_new</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they didn't exist yet</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_fpw</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because a full page image was included in the WAL</entry>
+    </row>
+    <row>
+     <entry><structfield>skip_seq</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number of blocks not prefetched because they were already recently prefetched</entry>
+    </row>
+    <row>
+     <entry><structfield>wal_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many bytes ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>block_distance</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many blocks ahead the prefetcher is looking</entry>
+    </row>
+    <row>
+     <entry><structfield>io_depth</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+    </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_prefetch_recovery</structname> view will contain
+   only one row.  It is filled with nulls if recovery has not run or
+   <xref linkend="guc-recovery-prefetch"/> is not enabled.  The
+   columns <structfield>wal_distance</structfield>,
+   <structfield>block_distance</structfield>
+   and <structfield>io_depth</structfield> show current values, and the
+   other columns show cumulative counters that can be reset
+   with the <function>pg_stat_reset_shared</function> function.
+  </para>
+
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
    <title><structname>pg_stat_subscription</structname> View</title>
    <tgroup cols="1">
@@ -5190,8 +5269,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         all the counters shown in
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
-        the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
-        to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
+        the <structname>pg_stat_archiver</structname> view,
+        <literal>wal</literal> to reset all the counters shown in the
+        <structname>pg_stat_wal</structname> view or
+        <literal>prefetch_recovery</literal> to reset all the counters shown
+        in the <structname>pg_stat_prefetch_recovery</structname> view.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 2bb27a8468..02d2b65725 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -803,6 +803,17 @@
    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
    in <structname>pg_stat_wal</structname>, respectively.
   </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
+   I/O wait times during recovery by instructing the kernel to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool and will soon be read.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  By default, prefetching in
+   recovery is disabled.
+  </para>
  </sect1>
 
  <sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 79314c69ab..8c17c88dfc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
 	xlogarchive.o \
 	xlogfuncs.o \
 	xloginsert.o \
+	xlogprefetcher.o \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 17a56152f1..d1e8da52c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -59,6 +59,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
 #include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -133,6 +134,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
+int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..1398676b96
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,1078 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ *		Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogprefetcher.c
+ *
+ * This module provides a drop-in replacement for an XLogReader that tries to
+ * minimize I/O stalls by looking up future blocks in the buffer cache, and
+ * initiating I/Os that might complete before the caller eventually needs the
+ * data.  When referenced blocks are found in the buffer pool already, the
+ * buffer is recorded in the decoded record so that XLogReadBufferForRedo()
+ * can avoid a second buffer mapping table lookup.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/* Every time we process this much WAL, we update dynamic values in shm. */
+#define XLOGPREFETCHER_STATS_SHM_DISTANCE BLCKSZ
+
+/*
+ * To detect repeat access to the same block and skip useless extra system
+ * calls, we remember a small windows of recently prefetched blocks.
+ */
+#define XLOGPREFETCHER_SEQ_WINDOW_SIZE 4
+
+/* Define to log internal debugging messages. */
+/* #define XLOGPREFETCHER_DEBUG_LEVEL LOG */
+
+/* GUCs */
+int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
+
+#ifdef USE_PREFETCH
+#define RecoveryPrefetchEnabled() (recovery_prefetch != RECOVERY_PREFETCH_OFF)
+#else
+#define RecoveryPrefetchEnabled() false
+#endif
+
+static int	XLogPrefetchReconfigureCount = 0;
+
+/*
+ * Enum used to report whether an IO should be started.
+ */
+typedef enum
+{
+	LRQ_NEXT_NO_IO,
+	LRQ_NEXT_IO,
+	LRQ_NEXT_AGAIN
+} LsnReadQueueNextStatus;
+
+/*
+ * Type of callback that can decide which block to prefetch next.  For now
+ * there is only one.
+ */
+typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
+													   XLogRecPtr *lsn);
+
+/*
+ * A simple circular queue of LSNs, using to control the number of
+ * (potentially) inflight IOs.  This stands in for a later more general IO
+ * control mechanism, which is why it has the apparently unnecessary
+ * indirection through a function pointer.
+ */
+typedef struct LsnReadQueue
+{
+	LsnReadQueueNextFun next;
+	uintptr_t	lrq_private;
+	uint32		max_inflight;
+	uint32		inflight;
+	uint32		completed;
+	uint32		head;
+	uint32		tail;
+	uint32		size;
+	struct
+	{
+		bool		io;
+		XLogRecPtr	lsn;
+	}			queue[FLEXIBLE_ARRAY_MEMBER];
+} LsnReadQueue;
+
+/*
+ * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
+ * blocks that will be soon be referenced, to try to avoid IO stalls.
+ */
+struct XLogPrefetcher
+{
+	/* WAL reader and current reading state. */
+	XLogReaderState *reader;
+	DecodedXLogRecord *record;
+	int			next_block_id;
+
+	/* When to publish stats. */
+	XLogRecPtr	next_stats_shm_lsn;
+
+	/* Book-keeping to avoid accessing blocks that don't exist yet. */
+	HTAB	   *filter_table;
+	dlist_head	filter_queue;
+
+	/* Book-keeping to avoid repeat prefetches. */
+	RelFileNode recent_rnode[XLOGPREFETCHER_SEQ_WINDOW_SIZE];
+	BlockNumber recent_block[XLOGPREFETCHER_SEQ_WINDOW_SIZE];
+	int			recent_idx;
+
+	/* Book-keeping to disable prefetching temporarily. */
+	XLogRecPtr	no_readahead_until;
+
+	/* IO depth manager. */
+	LsnReadQueue *streaming_read;
+
+	XLogRecPtr	begin_ptr;
+
+	int			reconfigure_count;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that (we assume) have already been dropped, or will be created by bulk WAL
+ * operators.
+ */
+typedef struct XLogPrefetcherFilter
+{
+	RelFileNode rnode;
+	XLogRecPtr	filter_until_replayed;
+	BlockNumber filter_from_block;
+	dlist_node	link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+	pg_atomic_uint64 reset_time;	/* Time of last reset. */
+	pg_atomic_uint64 prefetch;	/* Prefetches initiated. */
+	pg_atomic_uint64 hit;		/* Blocks already in cache. */
+	pg_atomic_uint64 skip_init; /* Zero-inited blocks skipped. */
+	pg_atomic_uint64 skip_new;	/* New/missing blocks filtered. */
+	pg_atomic_uint64 skip_fpw;	/* FPWs skipped. */
+	pg_atomic_uint64 skip_seq;	/* Repeat accesses skipped. */
+
+	/* Reset counters */
+	pg_atomic_uint32 reset_request;
+	uint32		reset_handled;
+
+	/* Dynamic values */
+	int			wal_distance;	/* Number of WAL bytes ahead. */
+	int			block_distance; /* Number of block references ahead. */
+	int			io_depth;		/* Number of I/Os in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+										   RelFileNode rnode,
+										   BlockNumber blockno,
+										   XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+											RelFileNode rnode,
+											BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+												 XLogRecPtr replaying_lsn);
+static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
+													  XLogRecPtr *lsn);
+
+static XLogPrefetchStats *SharedStats;
+
+static inline LsnReadQueue *
+lrq_alloc(uint32 max_distance,
+		  uint32 max_inflight,
+		  uintptr_t lrq_private,
+		  LsnReadQueueNextFun next)
+{
+	LsnReadQueue *lrq;
+	uint32		size;
+
+	Assert(max_distance >= max_inflight);
+
+	size = max_distance + 1;	/* full ring buffer has a gap */
+	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
+	lrq->lrq_private = lrq_private;
+	lrq->max_inflight = max_inflight;
+	lrq->size = size;
+	lrq->next = next;
+	lrq->head = 0;
+	lrq->tail = 0;
+	lrq->inflight = 0;
+	lrq->completed = 0;
+
+	return lrq;
+}
+
+static inline void
+lrq_free(LsnReadQueue *lrq)
+{
+	pfree(lrq);
+}
+
+static inline uint32
+lrq_inflight(LsnReadQueue *lrq)
+{
+	return lrq->inflight;
+}
+
+static inline uint32
+lrq_completed(LsnReadQueue *lrq)
+{
+	return lrq->completed;
+}
+
+static inline void
+lrq_prefetch(LsnReadQueue *lrq)
+{
+	/* Try to start as many IOs as we can within our limits. */
+	while (lrq->inflight < lrq->max_inflight &&
+		   lrq->inflight + lrq->completed < lrq->size - 1)
+	{
+		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
+		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
+		{
+			case LRQ_NEXT_AGAIN:
+				return;
+			case LRQ_NEXT_IO:
+				lrq->queue[lrq->head].io = true;
+				lrq->inflight++;
+				break;
+			case LRQ_NEXT_NO_IO:
+				lrq->queue[lrq->head].io = false;
+				lrq->completed++;
+				break;
+		}
+		lrq->head++;
+		if (lrq->head == lrq->size)
+			lrq->head = 0;
+	}
+}
+
+static inline void
+lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
+{
+	/*
+	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
+	 * that any IOs that were started before then have finished.
+	 */
+	while (lrq->tail != lrq->head &&
+		   lrq->queue[lrq->tail].lsn < lsn)
+	{
+		if (lrq->queue[lrq->tail].io)
+			lrq->inflight--;
+		else
+			lrq->completed--;
+		lrq->tail++;
+		if (lrq->tail == lrq->size)
+			lrq->tail = 0;
+	}
+	if (RecoveryPrefetchEnabled())
+		lrq_prefetch(lrq);
+}
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+	return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+	pg_atomic_write_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+	pg_atomic_write_u64(&SharedStats->prefetch, 0);
+	pg_atomic_write_u64(&SharedStats->hit, 0);
+	pg_atomic_write_u64(&SharedStats->skip_init, 0);
+	pg_atomic_write_u64(&SharedStats->skip_new, 0);
+	pg_atomic_write_u64(&SharedStats->skip_fpw, 0);
+	pg_atomic_write_u64(&SharedStats->skip_seq, 0);
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+	bool		found;
+
+	SharedStats = (XLogPrefetchStats *)
+		ShmemInitStruct("XLogPrefetchStats",
+						sizeof(XLogPrefetchStats),
+						&found);
+
+	if (!found)
+	{
+		pg_atomic_init_u32(&SharedStats->reset_request, 0);
+		SharedStats->reset_handled = 0;
+
+		pg_atomic_init_u64(&SharedStats->reset_time, GetCurrentTimestamp());
+		pg_atomic_init_u64(&SharedStats->prefetch, 0);
+		pg_atomic_init_u64(&SharedStats->hit, 0);
+		pg_atomic_init_u64(&SharedStats->skip_init, 0);
+		pg_atomic_init_u64(&SharedStats->skip_new, 0);
+		pg_atomic_init_u64(&SharedStats->skip_fpw, 0);
+	}
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+	XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+	pg_atomic_fetch_add_u32(&SharedStats->reset_request, 1);
+}
+
+/*
+ * Increment a counter in shared memory.  This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+{
+	Assert(AmStartupProcess() || !IsUnderPostmaster);
+	pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL records.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogReaderState *reader)
+{
+	XLogPrefetcher *prefetcher;
+	static HASHCTL hash_table_ctl = {
+		.keysize = sizeof(RelFileNode),
+		.entrysize = sizeof(XLogPrefetcherFilter)
+	};
+
+	prefetcher = palloc0(sizeof(XLogPrefetcher));
+
+	prefetcher->reader = reader;
+	prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+										   &hash_table_ctl,
+										   HASH_ELEM | HASH_BLOBS);
+	dlist_init(&prefetcher->filter_queue);
+
+	SharedStats->wal_distance = 0;
+	SharedStats->block_distance = 0;
+	SharedStats->io_depth = 0;
+
+	/* First usage will cause streaming_read to be allocated. */
+	prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+
+	return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+	lrq_free(prefetcher->streaming_read);
+	hash_destroy(prefetcher->filter_table);
+	pfree(prefetcher);
+}
+
+/*
+ * Provide access to the reader.
+ */
+XLogReaderState *
+XLogPrefetcherGetReader(XLogPrefetcher *prefetcher)
+{
+	return prefetcher->reader;
+}
+
+static void
+XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher, XLogRecPtr lsn)
+{
+	uint32		io_depth;
+	uint32		completed;
+	uint32		reset_request;
+	int64		wal_distance;
+
+
+	/* How far ahead of replay are we now? */
+	if (prefetcher->record)
+		wal_distance = prefetcher->record->lsn - prefetcher->reader->record->lsn;
+	else
+		wal_distance = 0;
+
+	/* How many IOs are currently in flight and completed? */
+	io_depth = lrq_inflight(prefetcher->streaming_read);
+	completed = lrq_completed(prefetcher->streaming_read);
+
+	/* Update the instantaneous stats visible in pg_stat_prefetch_recovery. */
+	SharedStats->io_depth = io_depth;
+	SharedStats->block_distance = io_depth + completed;
+	SharedStats->wal_distance = wal_distance;
+
+	/*
+	 * Have we been asked to reset our stats counters?  This is checked with
+	 * an unsynchronized memory read, but we'll see it eventually and we'll be
+	 * accessing that cache line anyway.
+	 */
+	reset_request = pg_atomic_read_u32(&SharedStats->reset_request);
+	if (reset_request != SharedStats->reset_handled)
+	{
+		XLogPrefetchResetStats();
+		SharedStats->reset_handled = reset_request;
+	}
+
+	prefetcher->next_stats_shm_lsn = lsn + XLOGPREFETCHER_STATS_SHM_DISTANCE;
+}
+
+/*
+ * A callback that examines the next block reference in the WAL.
+ *
+ * Returns LRQ_NEXT_AGAIN if no more WAL data is available yet.
+ *
+ * Returns LRQ_NEXT_IO if the next block reference and it isn't in the buffer
+ * pool, and the kernel has been asked to start reading it to make a future
+ * read faster. An LSN is written to *lsn, and the I/O will be considered to
+ * have completed once that LSN is replayed.
+ *
+ * Returns LRQ_NO_IO if we examined the next block reference and found that it
+ * was already in the buffer pool.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+{
+	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
+	XLogReaderState *reader = prefetcher->reader;
+	XLogRecPtr	replaying_lsn = reader->ReadRecPtr;
+
+	/*
+	 * We keep track of the record and block we're up to between calls with
+	 * prefetcher->record and prefetcher->next_block_id.
+	 */
+	for (;;)
+	{
+		DecodedXLogRecord *record;
+
+		/* Try to read a new future record, if we don't already have one. */
+		if (prefetcher->record == NULL)
+		{
+			bool		nonblocking;
+
+			/*
+			 * If there are already records or an error queued up that could
+			 * be replayed, we don't want to block here.  Otherwise, it's OK
+			 * to block waiting for more data: presumably the caller has
+			 * nothing else to do.
+			 */
+			nonblocking = XLogReaderHasQueuedRecordOrError(reader);
+
+			/* Certain records act as barriers for all readahead. */
+			if (nonblocking && replaying_lsn < prefetcher->no_readahead_until)
+				return LRQ_NEXT_AGAIN;
+
+			record = XLogReadAhead(prefetcher->reader, nonblocking);
+			if (record == NULL)
+			{
+				/*
+				 * We can't read any more, due to an error or lack of data in
+				 * nonblocking mode.
+				 */
+				return LRQ_NEXT_AGAIN;
+			}
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the record
+			 * or issue any prefetches.  We just need to cause one record to
+			 * be decoded.
+			 */
+			if (!RecoveryPrefetchEnabled())
+			{
+				*lsn = InvalidXLogRecPtr;
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* We have a new record to process. */
+			prefetcher->record = record;
+			prefetcher->next_block_id = 0;
+		}
+		else
+		{
+			/* Continue to process from last call, or last loop. */
+			record = prefetcher->record;
+		}
+
+		/*
+		 * Check for operations that require us to filter out block ranges, or
+		 * pause readahead completely.
+		 */
+		if (replaying_lsn < record->lsn)
+		{
+			uint8		rmid = record->header.xl_rmid;
+			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
+			if (rmid == RM_XLOG_ID)
+			{
+				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+					record_type == XLOG_END_OF_RECOVERY)
+				{
+					/*
+					 * These records might change the TLI.  Avoid potential
+					 * bugs if we were to allow "read TLI" and "replay TLI" to
+					 * differ without more analysis.
+					 */
+					prefetcher->no_readahead_until = record->lsn;
+
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+					elog(XLOGPREFETCHER_DEBUG_LEVEL,
+						 "suppressing all readahead until %X/%X is replayed due to possible TLI change",
+						 LSN_FORMAT_ARGS(record->lsn));
+#endif
+
+					/* Fall through so we move past this record. */
+				}
+			}
+			else if (rmid == RM_DBASE_ID)
+			{
+				/*
+				 * When databases are created with the file-copy strategy,
+				 * there are no WAL records to tell us about the creation of
+				 * individual relations.
+				 */
+				if (record_type == XLOG_DBASE_CREATE_FILE_COPY)
+				{
+					xl_dbase_create_file_copy_rec *xlrec =
+					(xl_dbase_create_file_copy_rec *) record->main_data;
+					RelFileNode rnode = {InvalidOid, xlrec->db_id, InvalidOid};
+
+					/*
+					 * Don't try to prefetch anything in this database until
+					 * it has been created, or we might confuse the blocks of
+					 * different generations, if a database OID or relfilenode
+					 * is reused.  It's also more efficient than discovering
+					 * that relations don't exist on disk yet with ENOENT
+					 * errors.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rnode, 0, record->lsn);
+
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+					elog(XLOGPREFETCHER_DEBUG_LEVEL,
+						 "suppressing prefetch in database %u until %X/%X is replayed due to raw file copy",
+						 rnode.dbNode,
+						 LSN_FORMAT_ARGS(record->lsn));
+#endif
+				}
+			}
+			else if (rmid == RM_SMGR_ID)
+			{
+				if (record_type == XLOG_SMGR_CREATE)
+				{
+					xl_smgr_create *xlrec = (xl_smgr_create *)
+					record->main_data;
+
+					if (xlrec->forkNum == MAIN_FORKNUM)
+					{
+						/*
+						 * Don't prefetch anything for this whole relation
+						 * until it has been created.  Otherwise we might
+						 * confuse the blocks of different generations, if a
+						 * relfilenode is reused.  This also avoids the need
+						 * to discover the problem via extra syscalls that
+						 * report ENOENT.
+						 */
+						XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+												record->lsn);
+
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+						elog(XLOGPREFETCHER_DEBUG_LEVEL,
+							 "suppressing prefetch in relation %u/%u/%u until %X/%X is replayed, which creates the relation",
+							 xlrec->rnode.spcNode,
+							 xlrec->rnode.dbNode,
+							 xlrec->rnode.relNode,
+							 LSN_FORMAT_ARGS(record->lsn));
+#endif
+					}
+				}
+				else if (record_type == XLOG_SMGR_TRUNCATE)
+				{
+					xl_smgr_truncate *xlrec = (xl_smgr_truncate *)
+					record->main_data;
+
+					/*
+					 * Don't consider prefetching anything in the truncated
+					 * range until the truncation has been performed.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, xlrec->rnode,
+											xlrec->blkno,
+											record->lsn);
+
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+					elog(XLOGPREFETCHER_DEBUG_LEVEL,
+						 "suppressing prefetch in relation %u/%u/%u from block %u until %X/%X is replayed, which truncates the relation",
+						 xlrec->rnode.spcNode,
+						 xlrec->rnode.dbNode,
+						 xlrec->rnode.relNode,
+						 xlrec->blkno,
+						 LSN_FORMAT_ARGS(record->lsn));
+#endif
+				}
+			}
+		}
+
+		/* Scan the block references, starting where we left off last time. */
+		while (prefetcher->next_block_id <= record->max_block_id)
+		{
+			int			block_id = prefetcher->next_block_id++;
+			DecodedBkpBlock *block = &record->blocks[block_id];
+			SMgrRelation reln;
+			PrefetchBufferResult result;
+
+			if (!block->in_use)
+				continue;
+
+			Assert(!BufferIsValid(block->prefetch_buffer));;
+
+			/*
+			 * Record the LSN of this record.  When it's replayed,
+			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
+			 * to be finished.
+			 */
+			*lsn = record->lsn;
+
+			/* We don't try to prefetch anything but the main fork for now. */
+			if (block->forknum != MAIN_FORKNUM)
+			{
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If there is a full page image attached, we won't be reading the
+			 * page, so don't both trying to prefetch.
+			 */
+			if (block->has_image)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_fpw);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* There is no point in reading a page that will be zeroed. */
+			if (block->flags & BKPBLOCK_WILL_INIT)
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_init);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Should we skip prefetching this block due to a filter? */
+			if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+			{
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * There is no point in repeatedly prefetching the same block. XXX
+			 * This book-keeping could also be used to avoid explicitly
+			 * prefetching sequential blocks.
+			 */
+			for (int i = 0; i < XLOGPREFETCHER_SEQ_WINDOW_SIZE; ++i)
+			{
+				if (block->blkno == prefetcher->recent_block[i] &&
+					RelFileNodeEquals(block->rnode, prefetcher->recent_rnode[i]))
+				{
+					XLogPrefetchIncrement(&SharedStats->skip_seq);
+					return LRQ_NEXT_NO_IO;
+				}
+			}
+			prefetcher->recent_rnode[prefetcher->recent_idx] = block->rnode;
+			prefetcher->recent_block[prefetcher->recent_idx] = block->blkno;
+			prefetcher->recent_idx = (prefetcher->recent_idx + 1) % XLOGPREFETCHER_SEQ_WINDOW_SIZE;
+
+			/*
+			 * We could try to have a fast path for repeated references to the
+			 * same relation (with some scheme to handle invalidations
+			 * safely), but for now we'll call smgropen() every time.
+			 */
+			reln = smgropen(block->rnode, InvalidBackendId);
+
+			/*
+			 * If the relation file doesn't exist on disk, for example because
+			 * we're replaying after a crash and the file will be created and
+			 * then unlinked by WAL that hasn't been replayed yet, suppress
+			 * further prefetching in the relation until this record is
+			 * replayed.
+			 */
+			if (!smgrexists(reln, MAIN_FORKNUM))
+			{
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+				elog(XLOGPREFETCHER_DEBUG_LEVEL,
+					 "suppressing all prefetch in relation %u/%u/%u until %X/%X is replayed, because the relation does not exist on disk",
+					 reln->smgr_rnode.node.spcNode,
+					 reln->smgr_rnode.node.dbNode,
+					 reln->smgr_rnode.node.relNode,
+					 LSN_FORMAT_ARGS(record->lsn));
+#endif
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/*
+			 * If the relation isn't big enough to contain the referenced
+			 * block yet, suppress prefetching of this block and higher until
+			 * this record is replayed.
+			 */
+			if (block->blkno >= smgrnblocks(reln, block->forknum))
+			{
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+				elog(XLOGPREFETCHER_DEBUG_LEVEL,
+					 "suppressing prefetch in relation %u/%u/%u from block %u until %X/%X is replayed, because the relation is too small",
+					 reln->smgr_rnode.node.spcNode,
+					 reln->smgr_rnode.node.dbNode,
+					 reln->smgr_rnode.node.relNode,
+					 block->blkno,
+					 LSN_FORMAT_ARGS(record->lsn));
+#endif
+				XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+										record->lsn);
+				XLogPrefetchIncrement(&SharedStats->skip_new);
+				return LRQ_NEXT_NO_IO;
+			}
+
+			/* Try to initiate prefetching. */
+			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+			if (BufferIsValid(result.recent_buffer))
+			{
+				/* Cache hit, nothing to do. */
+				XLogPrefetchIncrement(&SharedStats->hit);
+				block->prefetch_buffer = result.recent_buffer;
+				return LRQ_NEXT_NO_IO;
+			}
+			else if (result.initiated_io)
+			{
+				/* Cache miss, I/O (presumably) started. */
+				XLogPrefetchIncrement(&SharedStats->prefetch);
+				block->prefetch_buffer = InvalidBuffer;
+				return LRQ_NEXT_IO;
+			}
+			else
+			{
+				/*
+				 * This shouldn't be possible, because we already determined
+				 * that the relation exists on disk and is big enough.
+				 * Something is wrong with the cache invalidation for
+				 * smgrexists(), smgrnblocks(), or the file was unlinked or
+				 * truncated beneath our feet?
+				 */
+				elog(ERROR,
+					 "could not prefetch relation %u/%u/%u block %u",
+					 reln->smgr_rnode.node.spcNode,
+					 reln->smgr_rnode.node.dbNode,
+					 reln->smgr_rnode.node.relNode,
+					 block->blkno);
+			}
+		}
+
+		/*
+		 * Several callsites need to be able to read exactly one record
+		 * without any internal readahead.  Examples: xlog.c reading
+		 * checkpoint records with emode set to PANIC, which might otherwise
+		 * cause XLogPageRead() to panic on some future page, and xlog.c
+		 * determining where to start writing WAL next, which depends on the
+		 * contents of the reader's internal buffer after reading one record.
+		 * Therefore, don't even think about prefetching until the first
+		 * record after XLogPrefetcherBeginRead() has been consumed.
+		 */
+		if (prefetcher->reader->decode_queue_tail &&
+			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
+			return LRQ_NEXT_AGAIN;
+
+		/* Advance to the next record. */
+		prefetcher->record = NULL;
+	}
+	pg_unreachable();
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+	bool		nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+	SetSingleFuncCall(fcinfo, 0);
+
+	if (pg_atomic_read_u32(&SharedStats->reset_request) != SharedStats->reset_handled)
+	{
+		/* There's an unhandled reset request, so just show NULLs */
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = true;
+	}
+	else
+	{
+		for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+			nulls[i] = false;
+	}
+
+	values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&SharedStats->reset_time));
+	values[1] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->prefetch));
+	values[2] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->hit));
+	values[3] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_init));
+	values[4] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_new));
+	values[5] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_fpw));
+	values[6] = Int64GetDatum(pg_atomic_read_u64(&SharedStats->skip_seq));
+	values[7] = Int32GetDatum(SharedStats->wal_distance);
+	values[8] = Int32GetDatum(SharedStats->block_distance);
+	values[9] = Int32GetDatum(SharedStats->io_depth);
+	tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+
+	return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						BlockNumber blockno, XLogRecPtr lsn)
+{
+	XLogPrefetcherFilter *filter;
+	bool		found;
+
+	filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+	if (!found)
+	{
+		/*
+		 * Don't allow any prefetching of this block or higher until replayed.
+		 */
+		filter->filter_until_replayed = lsn;
+		filter->filter_from_block = blockno;
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+	}
+	else
+	{
+		/*
+		 * We were already filtering this rnode.  Extend the filter's lifetime
+		 * to cover this WAL record, but leave the lower of the block numbers
+		 * there because we don't want to have to track individual blocks.
+		 */
+		filter->filter_until_replayed = lsn;
+		dlist_delete(&filter->link);
+		dlist_push_head(&prefetcher->filter_queue, &filter->link);
+		filter->filter_from_block = Min(filter->filter_from_block, blockno);
+	}
+}
+
+/*
+ * Have we replayed any records that caused us to begin filtering a block
+ * range?  That means that relations should have been created, extended or
+ * dropped as required, so we can stop filtering out accesses to a given
+ * relfilenode.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+	while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+														  link,
+														  &prefetcher->filter_queue);
+
+		if (filter->filter_until_replayed >= replaying_lsn)
+			break;
+
+		dlist_delete(&filter->link);
+		hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+	}
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+						 BlockNumber blockno)
+{
+	/*
+	 * Test for empty queue first, because we expect it to be empty most of
+	 * the time and we can avoid the hash table lookup in that case.
+	 */
+	if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+	{
+		XLogPrefetcherFilter *filter;
+
+		/* See if the block range is filtered. */
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter && filter->filter_from_block <= blockno)
+		{
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+			elog(XLOGPREFETCHER_DEBUG_LEVEL,
+				 "prefetch of %u/%u/%u block %u suppressed; filtering until LSN %X/%X is replayed (blocks >= %u filtered)",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode, blockno,
+				 LSN_FORMAT_ARGS(filter->filter_until_replayed),
+				 filter->filter_from_block);
+#endif
+			return true;
+		}
+
+		/* See if the whole database is filtered. */
+		rnode.relNode = InvalidOid;
+		rnode.spcNode = InvalidOid;
+		filter = hash_search(prefetcher->filter_table, &rnode, HASH_FIND, NULL);
+		if (filter)
+		{
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+			elog(XLOGPREFETCHER_DEBUG_LEVEL,
+				 "prefetch of %u/%u/%u block %u suppressed; filtering until LSN %X/%X is replayed (whole database)",
+				 rnode.spcNode, rnode.dbNode, rnode.relNode, blockno,
+				 LSN_FORMAT_ARGS(filter->filter_until_replayed));
+#endif
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * A wrapper for XLogBeginRead() that also resets the prefetcher.
+ */
+void
+XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
+{
+	/* This will forget about any in-flight IO. */
+	prefetcher->reconfigure_count--;
+
+	/* Book-keeping to avoid readahead on first read. */
+	prefetcher->begin_ptr = recPtr;
+
+	prefetcher->no_readahead_until = 0;
+
+	/* This will forget about any queued up records in the decoder. */
+	XLogBeginRead(prefetcher->reader, recPtr);
+}
+
+/*
+ * A wrapper for XLogReadRecord() that provides the same interface, but also
+ * tries to initiate I/O for blocks referenced in future WAL records.
+ */
+XLogRecord *
+XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
+{
+	DecodedXLogRecord *record;
+
+	/*
+	 * See if it's time to reset the prefetching machinery, because a relevant
+	 * GUC was changed.
+	 */
+	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
+	{
+		if (prefetcher->streaming_read)
+			lrq_free(prefetcher->streaming_read);
+
+		/*
+		 * Arbitrarily look up to 4 times further ahead than the number of IOs
+		 * we're allowed to run concurrently.
+		 */
+		prefetcher->streaming_read =
+			lrq_alloc(RecoveryPrefetchEnabled() ? maintenance_io_concurrency * 4 : 1,
+					  RecoveryPrefetchEnabled() ? maintenance_io_concurrency : 1,
+					  (uintptr_t) prefetcher,
+					  XLogPrefetcherNextBlock);
+
+		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
+	}
+
+	/*
+	 * Release last returned record, if there is one.  We need to do this so
+	 * that we can check for empty decode queue accurately.
+	 */
+	XLogReleasePreviousRecord(prefetcher->reader);
+
+	/* If there's nothing queued yet, then start prefetching. */
+	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
+		lrq_prefetch(prefetcher->streaming_read);
+
+	/* Read the next record. */
+	record = XLogNextRecord(prefetcher->reader, errmsg);
+	if (!record)
+		return NULL;
+
+	/*
+	 * The record we just got is the "current" one, for the benefit of the
+	 * XLogRecXXX() macros.
+	 */
+	Assert(record == prefetcher->reader->record);
+
+	/*
+	 * Can we drop any prefetch filters yet, given the record we're about to
+	 * return?  This assumes that any records with earlier LSNs have been
+	 * replayed, so if we were waiting for a relation to be created or
+	 * extended, it is now OK to access blocks in the covered range.
+	 */
+	XLogPrefetcherCompleteFilters(prefetcher, record->lsn);
+
+	/*
+	 * See if it's time to compute some statistics, because enough WAL has
+	 * been processed.
+	 */
+	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
+		XLogPrefetcherComputeStats(prefetcher, record->lsn);
+
+	/*
+	 * The caller is about to replay this record, so we can now report that
+	 * all IO initiated because of early WAL must be finished. This may
+	 * trigger more readahead.
+	 */
+	lrq_complete_lsn(prefetcher->streaming_read, record->lsn);
+
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
+}
+
+bool
+check_recovery_prefetch(int *new_value, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+	if (*new_value == RECOVERY_PREFETCH_ON)
+	{
+		GUC_check_errdetail("recovery_prefetch not supported on platforms that lack posix_fadvise().");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_recovery_prefetch(int new_value, void *extra)
+{
+	/* Reconfigure prefetching, because a setting it depends on changed. */
+	recovery_prefetch = new_value;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..8a48d4f6f7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1727,6 +1727,8 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
+			blk->prefetch_buffer = InvalidBuffer;
+
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
@@ -1933,6 +1935,23 @@ err:
 bool
 XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 				   RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+	return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+/*
+ * Returns information about the block that a block reference refers to,
+ * optionally including the buffer that the block may already be in.
+ *
+ * If the WAL record contains a block reference with the given ID, *rnode,
+ * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
+ * returns true.  Otherwise returns false.
+ */
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+					RelFileNode *rnode, ForkNumber *forknum,
+					BlockNumber *blknum,
+					Buffer *prefetch_buffer)
 {
 	DecodedBkpBlock *bkpb;
 
@@ -1947,6 +1966,8 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 		*forknum = bkpb->forknum;
 	if (blknum)
 		*blknum = bkpb->blkno;
+	if (prefetch_buffer)
+		*prefetch_buffer = bkpb->prefetch_buffer;
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8d2395dae2..5736458263 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
@@ -183,6 +184,9 @@ static bool doRequestWalReceiverReply;
 /* XLogReader object used to parse the WAL records */
 static XLogReaderState *xlogreader = NULL;
 
+/* XLogPrefetcher object used to consume WAL records with read-ahead */
+static XLogPrefetcher *xlogprefetcher = NULL;
+
 /* Parameters passed down from ReadRecord to the XLogPageRead callback. */
 typedef struct XLogPageReadPrivate
 {
@@ -404,18 +408,21 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogReaderState *xlogreader,
-							  int emode, bool fetching_ckpt, TimeLineID replayTLI);
+static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
+							  int emode, bool fetching_ckpt,
+							  TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-										bool fetching_ckpt,
-										XLogRecPtr tliRecPtr,
-										TimeLineID replayTLI,
-										XLogRecPtr replayLSN);
+static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
+													  bool randAccess,
+													  bool fetching_ckpt,
+													  XLogRecPtr tliRecPtr,
+													  TimeLineID replayTLI,
+													  XLogRecPtr replayLSN,
+													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 										int whichChkpt, bool report, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
@@ -561,6 +568,15 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				 errdetail("Failed while allocating a WAL reading processor.")));
 	xlogreader->system_identifier = ControlFile->system_identifier;
 
+	/*
+	 * Set the WAL decode buffer size.  This limits how far ahead we can read
+	 * in the WAL.
+	 */
+	XLogReaderSetDecodeBuffer(xlogreader, NULL, wal_decode_buffer_size);
+
+	/* Create a WAL prefetcher. */
+	xlogprefetcher = XLogPrefetcherAllocate(xlogreader);
+
 	/*
 	 * Allocate two page buffers dedicated to WAL consistency checks.  We do
 	 * it this way, rather than just making static arrays, for two reasons:
@@ -589,7 +605,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 0, true, CheckPointTLI);
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 0, true,
+									  CheckPointTLI);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
@@ -607,8 +624,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
-				XLogBeginRead(xlogreader, checkPoint.redo);
-				if (!ReadRecord(xlogreader, LOG, false,
+				XLogPrefetcherBeginRead(xlogprefetcher, checkPoint.redo);
+				if (!ReadRecord(xlogprefetcher, LOG, false,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
@@ -727,7 +744,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
 		RedoStartTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-		record = ReadCheckpointRecord(xlogreader, CheckPointLoc, 1, true,
+		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc, 1, true,
 									  CheckPointTLI);
 		if (record != NULL)
 		{
@@ -1403,8 +1420,8 @@ FinishWalRecovery(void)
 		lastRec = XLogRecoveryCtl->lastReplayedReadRecPtr;
 		lastRecTLI = XLogRecoveryCtl->lastReplayedTLI;
 	}
-	XLogBeginRead(xlogreader, lastRec);
-	(void) ReadRecord(xlogreader, PANIC, false, lastRecTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, lastRec);
+	(void) ReadRecord(xlogprefetcher, PANIC, false, lastRecTLI);
 	endOfLog = xlogreader->EndRecPtr;
 
 	/*
@@ -1501,6 +1518,8 @@ ShutdownWalRecovery(void)
 	}
 	XLogReaderFree(xlogreader);
 
+	XLogPrefetcherFree(xlogprefetcher);
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -1584,15 +1603,15 @@ PerformWalRecovery(void)
 	{
 		/* back up to find the record */
 		replayTLI = RedoStartTLI;
-		XLogBeginRead(xlogreader, RedoStartLSN);
-		record = ReadRecord(xlogreader, PANIC, false, replayTLI);
+		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
+		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
 	}
 	else
 	{
 		/* just have to read next record after CheckPoint */
 		Assert(xlogreader->ReadRecPtr == CheckPointLoc);
 		replayTLI = CheckPointTLI;
-		record = ReadRecord(xlogreader, LOG, false, replayTLI);
+		record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 	}
 
 	if (record != NULL)
@@ -1706,7 +1725,7 @@ PerformWalRecovery(void)
 			}
 
 			/* Else, try to fetch the next WAL record */
-			record = ReadRecord(xlogreader, LOG, false, replayTLI);
+			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
 
 		/*
@@ -1922,6 +1941,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 		 */
 		if (AllowCascadeReplication())
 			WalSndWakeup();
+
+		/* Reset the prefetcher. */
+		XLogPrefetchReconfigure();
 	}
 }
 
@@ -2306,7 +2328,8 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									 RBM_NORMAL_NO_LOG);
+									 RBM_NORMAL_NO_LOG,
+									 InvalidBuffer);
 		if (!BufferIsValid(buf))
 			continue;
 
@@ -2918,17 +2941,18 @@ ConfirmRecoveryPaused(void)
  * Attempt to read the next XLOG record.
  *
  * Before first call, the reader needs to be positioned to the first record
- * by calling XLogBeginRead().
+ * by calling XLogPrefetcherBeginRead().
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
 static XLogRecord *
-ReadRecord(XLogReaderState *xlogreader, int emode,
+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
 	XLogRecord *record;
+	XLogReaderState *xlogreader = XLogPrefetcherGetReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
 	/* Pass through parameters to XLogPageRead */
@@ -2944,7 +2968,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	{
 		char	   *errormsg;
 
-		record = XLogReadRecord(xlogreader, &errormsg);
+		record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
 		if (record == NULL)
 		{
 			/*
@@ -3057,9 +3081,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
- * Returns number of bytes read, if the page is read successfully, or -1
- * in case of errors.  When errors occur, they are ereport'ed, but only
- * if they have not been previously reported.
+ * Returns number of bytes read, if the page is read successfully, or
+ * XLREAD_FAIL in case of errors.  When errors occur, they are ereport'ed, but
+ * only if they have not been previously reported.
+ *
+ * While prefetching, xlogreader->nonblocking may be set.  In that case,
+ * returns XLREAD_WOULDBLOCK if we'd otherwise have to wait for more WAL.
  *
  * This is responsible for restoring files from archive as needed, as well
  * as for waiting for the requested WAL record to arrive in standby mode.
@@ -3067,7 +3094,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
  * 'emode' specifies the log level used for reporting "file not found" or
  * "end of WAL" situations in archive recovery, or in standby mode when a
  * trigger file is found. If set to WARNING or below, XLogPageRead() returns
- * false in those situations, on higher log levels the ereport() won't
+ * XLREAD_FAIL in those situations, on higher log levels the ereport() won't
  * return.
  *
  * In standby mode, if after a successful return of XLogPageRead() the
@@ -3126,20 +3153,31 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
-		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-										 private->randAccess,
-										 private->fetching_ckpt,
-										 targetRecPtr,
-										 private->replayTLI,
-										 xlogreader->EndRecPtr))
+		if (readFile >= 0 &&
+			xlogreader->nonblocking &&
+			readSource == XLOG_FROM_STREAM &&
+			flushedUpto < targetPagePtr + reqLen)
+			return XLREAD_WOULDBLOCK;
+
+		switch (WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+											private->randAccess,
+											private->fetching_ckpt,
+											targetRecPtr,
+											private->replayTLI,
+											xlogreader->EndRecPtr,
+											xlogreader->nonblocking))
 		{
-			if (readFile >= 0)
-				close(readFile);
-			readFile = -1;
-			readLen = 0;
-			readSource = XLOG_FROM_ANY;
-
-			return -1;
+			case XLREAD_WOULDBLOCK:
+				return XLREAD_WOULDBLOCK;
+			case XLREAD_FAIL:
+				if (readFile >= 0)
+					close(readFile);
+				readFile = -1;
+				readLen = 0;
+				readSource = XLOG_FROM_ANY;
+				return XLREAD_FAIL;
+			case XLREAD_SUCCESS:
+				break;
 		}
 	}
 
@@ -3264,7 +3302,7 @@ next_record_is_invalid:
 	if (StandbyMode)
 		goto retry;
 	else
-		return -1;
+		return XLREAD_FAIL;
 }
 
 /*
@@ -3293,14 +3331,18 @@ next_record_is_invalid:
  * available.
  *
  * When the requested record becomes available, the function opens the file
- * containing it (if not open already), and returns true. When end of standby
- * mode is triggered by the user, and there is no more WAL available, returns
- * false.
+ * containing it (if not open already), and returns XLREAD_SUCCESS. When end
+ * of standby mode is triggered by the user, and there is no more WAL
+ * available, returns XLREAD_FAIL.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
  */
-static bool
+static XLogPageReadResult
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							bool fetching_ckpt, XLogRecPtr tliRecPtr,
-							TimeLineID replayTLI, XLogRecPtr replayLSN)
+							TimeLineID replayTLI, XLogRecPtr replayLSN,
+							bool nonblocking)
 {
 	static TimestampTz last_fail_time = 0;
 	TimestampTz now;
@@ -3354,6 +3396,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
+			/*
+			 * Don't allow any retry loops to occur during nonblocking
+			 * readahead.  Let the caller process everything that has been
+			 * decoded already first.
+			 */
+			if (nonblocking)
+				return XLREAD_WOULDBLOCK;
+
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -3368,7 +3418,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						XLogShutdownWalRcv();
-						return false;
+						return XLREAD_FAIL;
 					}
 
 					/*
@@ -3376,7 +3426,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * and pg_wal.
 					 */
 					if (!StandbyMode)
-						return false;
+						return XLREAD_FAIL;
 
 					/*
 					 * Move to XLOG_FROM_STREAM state, and set to start a
@@ -3520,7 +3570,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
 											  currentSource);
 				if (readFile >= 0)
-					return true;	/* success! */
+					return XLREAD_SUCCESS;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_wal.
@@ -3675,11 +3725,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 							/* just make sure source info is correct... */
 							readSource = XLOG_FROM_STREAM;
 							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							return XLREAD_SUCCESS;
 						}
 						break;
 					}
 
+					/* In nonblocking mode, return rather than sleeping. */
+					if (nonblocking)
+						return XLREAD_WOULDBLOCK;
+
 					/*
 					 * Data not here yet. Check for trigger, then wait for
 					 * walreceiver to wake us up when new WAL arrives.
@@ -3687,13 +3741,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					if (CheckForStandbyTrigger())
 					{
 						/*
-						 * Note that we don't "return false" immediately here.
-						 * After being triggered, we still want to replay all
-						 * the WAL that was already streamed. It's in pg_wal
-						 * now, so we just treat this as a failure, and the
-						 * state machine will move on to replay the streamed
-						 * WAL from pg_wal, and then recheck the trigger and
-						 * exit replay.
+						 * Note that we don't return XLREAD_FAIL immediately
+						 * here. After being triggered, we still want to
+						 * replay all the WAL that was already streamed. It's
+						 * in pg_wal now, so we just treat this as a failure,
+						 * and the state machine will move on to replay the
+						 * streamed WAL from pg_wal, and then recheck the
+						 * trigger and exit replay.
 						 */
 						lastSourceFailed = true;
 						break;
@@ -3744,7 +3798,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		HandleStartupProcInterrupts();
 	}
 
-	return false;				/* not reached */
+	return XLREAD_FAIL;				/* not reached */
 }
 
 
@@ -3789,7 +3843,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  * 1 for "primary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 int whichChkpt, bool report, TimeLineID replayTLI)
 {
 	XLogRecord *record;
@@ -3816,8 +3870,8 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 		return NULL;
 	}
 
-	XLogBeginRead(xlogreader, RecPtr);
-	record = ReadRecord(xlogreader, LOG, true, replayTLI);
+	XLogPrefetcherBeginRead(xlogprefetcher, RecPtr);
+	record = ReadRecord(xlogprefetcher, LOG, true, replayTLI);
 
 	if (record == NULL)
 	{
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index a4dedc58b7..e9685a87a3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -22,6 +22,7 @@
 #include "access/timeline.h"
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -355,11 +356,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber blkno;
+	Buffer		prefetch_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
-	if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+	if (!XLogRecGetBlockInfo(record, block_id, &rnode, &forknum, &blkno,
+							 &prefetch_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d", block_id);
@@ -381,7 +384,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
 		*buf = XLogReadBufferExtended(rnode, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
+									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
+									  prefetch_buffer);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			elog(ERROR, "failed to restore block image");
@@ -410,7 +414,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode);
+		*buf = XLogReadBufferExtended(rnode, forknum, blkno, mode, prefetch_buffer);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -450,6 +454,10 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
  * to imply that the page should be dropped or truncated later.
  *
+ * Optionally, recent_buffer can be used to provide a hint about the location
+ * of the page in the buffer pool; it does not have to be correct, but avoids
+ * a buffer mapping table probe if it is.
+ *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReadBufferForRedoExtended instead. It is important that
  * all pages modified by a WAL record are registered in the WAL records, or
@@ -457,7 +465,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode)
+					   BlockNumber blkno, ReadBufferMode mode,
+					   Buffer recent_buffer)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -465,6 +474,15 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
+	/* Do we have a clue where the buffer might be already? */
+	if (BufferIsValid(recent_buffer) &&
+		mode == RBM_NORMAL &&
+		ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+	{
+		buffer = recent_buffer;
+		goto recent_buffer_fast_path;
+	}
+
 	/* Open the relation at smgr level */
 	smgr = smgropen(rnode, InvalidBackendId);
 
@@ -523,6 +541,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		}
 	}
 
+recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 9eaa51df29..a91442f643 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -930,6 +930,20 @@ CREATE VIEW pg_stat_wal_receiver AS
     FROM pg_stat_get_wal_receiver() s
     WHERE s.pid IS NOT NULL;
 
+CREATE VIEW pg_stat_prefetch_recovery AS
+    SELECT
+            s.stats_reset,
+            s.prefetch,
+            s.hit,
+            s.skip_init,
+            s.skip_new,
+            s.skip_fpw,
+            s.skip_seq,
+            s.wal_distance,
+            s.block_distance,
+            s.io_depth
+     FROM pg_stat_get_prefetch_recovery() s;
+
 CREATE VIEW pg_stat_subscription AS
     SELECT
             su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c10311e036..840b9600e8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -37,6 +37,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
@@ -1320,11 +1321,16 @@ pgstat_reset_shared_counters(const char *target)
 		msg.m_resettarget = RESET_BGWRITER;
 	else if (strcmp(target, "wal") == 0)
 		msg.m_resettarget = RESET_WAL;
+	else if (strcmp(target, "prefetch_recovery") == 0)
+	{
+		XLogPrefetchRequestResetStats();
+		return;
+	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"bgwriter\", \"wal\", or \"prefetch_recovery\".")));
 
 	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
 	pgstat_send(&msg, sizeof(msg));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc..1c33774f35 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -649,6 +649,8 @@ ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
 				pg_atomic_write_u32(&bufHdr->state,
 									buf_state + BUF_USAGECOUNT_ONE);
 
+			pgBufferUsage.local_blks_hit++;
+
 			return true;
 		}
 	}
@@ -680,6 +682,8 @@ ReadRecentBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum,
 			else
 				PinBuffer_Locked(bufHdr);	/* pin for first time */
 
+			pgBufferUsage.shared_blks_hit++;
+
 			return true;
 		}
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 78c073b7c9..d41ae37090 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -211,7 +211,8 @@ XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 	blkno = fsm_logical_to_physical(addr);
 
 	/* If the page doesn't exist already, extend */
-	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+	buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR,
+								 InvalidBuffer);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index cd4ebe2fc5..17f54b153b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/subtrans.h"
 #include "access/syncscan.h"
 #include "access/twophase.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
 #include "miscadmin.h"
@@ -119,6 +120,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, LockShmemSize());
 	size = add_size(size, PredicateLockShmemSize());
 	size = add_size(size, ProcGlobalShmemSize());
+	size = add_size(size, XLogPrefetchShmemSize());
 	size = add_size(size, XLOGShmemSize());
 	size = add_size(size, XLogRecoveryShmemSize());
 	size = add_size(size, CLOGShmemSize());
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(void)
 	 * Set up xlog, clog, and buffers
 	 */
 	XLOGShmemInit();
+	XLogPrefetchShmemInit();
 	XLogRecoveryShmemInit();
 	CLOGShmemInit();
 	CommitTsShmemInit();
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..286dd3f755 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -162,9 +162,11 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
 {
 	/*
 	 * Close it first, to ensure that we notice if the fork has been unlinked
-	 * since we opened it.
+	 * since we opened it.  As an optimization, we can skip that in recovery,
+	 * which already closes relations when dropping them.
 	 */
-	mdclose(reln, forkNum);
+	if (!InRecovery)
+		mdclose(reln, forkNum);
 
 	return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9e8ab1420d..bd94af2905 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -41,6 +41,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
@@ -216,6 +217,7 @@ static bool check_effective_io_concurrency(int *newval, void **extra, GucSource
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
 static bool check_client_connection_check_interval(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
 static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
@@ -480,6 +482,19 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry recovery_prefetch_options[] = {
+	{"off", RECOVERY_PREFETCH_OFF, false},
+	{"on", RECOVERY_PREFETCH_ON, false},
+	{"try", RECOVERY_PREFETCH_TRY, false},
+	{"true", RECOVERY_PREFETCH_ON, true},
+	{"false", RECOVERY_PREFETCH_OFF, true},
+	{"yes", RECOVERY_PREFETCH_ON, true},
+	{"no", RECOVERY_PREFETCH_OFF, true},
+	{"1", RECOVERY_PREFETCH_ON, true},
+	{"0", RECOVERY_PREFETCH_OFF, true},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry force_parallel_mode_options[] = {
 	{"off", FORCE_PARALLEL_OFF, false},
 	{"on", FORCE_PARALLEL_ON, false},
@@ -2803,6 +2818,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
+			gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
+			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
+			GUC_UNIT_BYTE
+		},
+		&wal_decode_buffer_size,
+		512 * 1024, 64 * 1024, MaxAllocSize,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the size of WAL files held for standby servers."),
@@ -3126,7 +3152,8 @@ static struct config_int ConfigureNamesInt[] =
 		0,
 #endif
 		0, MAX_IO_CONCURRENCY,
-		check_maintenance_io_concurrency, NULL, NULL
+		check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+		NULL
 	},
 
 	{
@@ -4998,6 +5025,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Prefetch referenced blocks during recovery"),
+			gettext_noop("Read ahead of the current replay position to find uncached blocks.")
+		},
+		&recovery_prefetch,
+		RECOVERY_PREFETCH_TRY, recovery_prefetch_options,
+		check_recovery_prefetch, assign_recovery_prefetch, NULL
+	},
+
 	{
 		{"force_parallel_mode", PGC_USERSET, DEVELOPER_OPTIONS,
 			gettext_noop("Forces use of parallel query facilities."),
@@ -12250,6 +12287,20 @@ check_client_connection_check_interval(int *newval, void **extra, GucSource sour
 	return true;
 }
 
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+	/*
+	 * Reconfigure recovery prefetching, because a setting it depends on
+	 * changed.
+	 */
+	maintenance_io_concurrency = newval;
+	if (AmStartupProcess())
+		XLogPrefetchReconfigure();
+#endif
+}
+
 static void
 assign_pgstat_temp_directory(const char *newval, void *extra)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 93d221a37b..7cac856451 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,12 @@
 #max_wal_size = 1GB
 #min_wal_size = 80MB
 
+# - Prefetching during recovery -
+
+#recovery_prefetch = try		# prefetch pages referenced in the WAL?
+#wal_decode_buffer_size = 512kB		# lookahead window used for prefetching
+					# (change requires restart)
+
 # - Archiving -
 
 #archive_mode = off		# enables archiving; off, on, or always
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 09f6464331..1df9dd2fbe 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -50,6 +50,7 @@ extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
 extern bool track_wal_io_timing;
+extern int	wal_decode_buffer_size;
 
 extern int	CheckPointSegments;
 
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..5ef74c1eb9
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ *		Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int	recovery_prefetch;
+
+/* Possible values for recovery_prefetch */
+typedef enum
+{
+	RECOVERY_PREFETCH_OFF,
+	RECOVERY_PREFETCH_ON,
+	RECOVERY_PREFETCH_TRY
+}			RecoveryPrefetchValue;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+
+extern void XLogPrefetchReconfigure(void);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchRequestResetStats(void);
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+
+extern XLogReaderState *XLogPrefetcherGetReader(XLogPrefetcher *prefetcher);
+
+extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
+									XLogRecPtr recPtr);
+
+extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											char **errmsg);
+
+#endif
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f4388cc9be..be266296d5 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -39,6 +39,7 @@
 #endif
 
 #include "access/xlogrecord.h"
+#include "storage/buf.h"
 
 /* WALOpenSegment represents a WAL segment being read. */
 typedef struct WALOpenSegment
@@ -125,6 +126,9 @@ typedef struct
 	ForkNumber	forknum;
 	BlockNumber blkno;
 
+	/* Prefetching workspace. */
+	Buffer		prefetch_buffer;
+
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
 
@@ -430,5 +434,9 @@ extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *
 extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
 							   RelFileNode *rnode, ForkNumber *forknum,
 							   BlockNumber *blknum);
+extern bool XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+								RelFileNode *rnode, ForkNumber *forknum,
+								BlockNumber *blknum,
+								Buffer *prefetch_buffer);
 
 #endif							/* XLOGREADER_H */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 64708949db..ff40f96e42 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -84,7 +84,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													Buffer *buf);
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode);
+									 BlockNumber blkno, ReadBufferMode mode,
+									 Buffer recent_buffer);
 
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 25304430f4..9bbe539385 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6366,6 +6366,14 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+  proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+  proretset => 't', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int8,int4,int4,int4}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{stats_reset,prefetch,hit,skip_init,skip_new,skip_fpw,skip_seq,wal_distance,block_distance,io_depth}',
+  prosrc => 'pg_stat_get_prefetch_recovery' },
+
 { oid => '2621', descr => 'reload configuration files',
   proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ea774968f0..c9b258508d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -450,4 +450,8 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+/* in access/transam/xlogprefetcher.c */
+extern bool check_recovery_prefetch(int *new_value, void **extra, GucSource source);
+extern void assign_recovery_prefetch(int new_value, void *extra);
+
 #endif							/* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 423b9b99fb..ac473dd98b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1871,6 +1871,17 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+    s.prefetch,
+    s.hit,
+    s.skip_init,
+    s.skip_new,
+    s.skip_fpw,
+    s.skip_seq,
+    s.wal_distance,
+    s.block_distance,
+    s.io_depth
+   FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, hit, skip_init, skip_new, skip_fpw, skip_seq, wal_distance, block_distance, io_depth);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72fafb795b..7f51cca0a5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1410,6 +1410,9 @@ LogicalRepWorker
 LogicalRewriteMappingData
 LogicalTape
 LogicalTapeSet
+LsnReadQueue
+LsnReadQueueNextFun
+LsnReadQueueNextStatus
 LtreeGistOptions
 LtreeSignature
 MAGIC
@@ -2953,6 +2956,10 @@ XLogPageHeaderData
 XLogPageReadCB
 XLogPageReadPrivate
 XLogPageReadResult
+XLogPrefetcher
+XLogPrefetcherFilter
+XLogPrefetchState
+XLogPrefetchStats
 XLogReaderRoutine
 XLogReaderState
 XLogRecData
-- 
2.35.1

#162

rjuju123@gmail.com

almost 4 years ago

In reply to: Thomas Munro (#161)

Re: WIP: WAL prefetch (another approach)

On Thu, Mar 31, 2022 at 10:49:32PM +1300, Thomas Munro wrote:

On Mon, Mar 21, 2022 at 9:29 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

So I finally finished looking at this patch. Here again, AFAICS the feature is
working as expected and I didn't find any problem. I just have some minor
comments, like for the previous patch.

Thanks very much for the review. I've attached a new version
addressing most of your feedback, and also rebasing over the new
WAL-logged CREATE DATABASE. I've also fixed a couple of bugs (see
end).
For the docs:
+        Whether to try to prefetch blocks that are referenced in the WAL that
+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables
+        prefetching only if the operating system provides the
+        <function>posix_fadvise</function> function, which is currently used
+        to implement prefetching.  Note that some operating systems provide the
+        function, but don't actually perform any prefetching.
Is there any reason not to change it to try? I'm wondering if some system says
that the function exists but simply raise an error if you actually try to use
it. I think that at least WSL does that for some functions.
Yeah, we could just default it to try. Whether we should ship that
way is another question, but done for now.

Should there be an associated pg15 open item for that, when the patch will be
committed? Note that in wal.sgml, the patch still says:

+ [...] By default, prefetching in
+ recovery is disabled.

I guess this should be changed even if we eventually choose to disable it by
default?

I don't think there are any supported systems that have a
posix_fadvise() that fails with -1, or we'd know about it, because
we already use it in other places. We do support one OS that provides
a dummy function in libc that does nothing at all (Solaris/illumos),
and at least a couple that enter the kernel but are known to do
nothing at all for WILLNEED (AIX, FreeBSD).

Ah, I didn't know that, thanks for the info!

bool
XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+{
+   return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+}
+
+bool
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+                   RelFileNode *rnode, ForkNumber *forknum,
+                   BlockNumber *blknum,
+                   Buffer *prefetch_buffer)
{
It's missing comments on that function. XLogRecGetBlockTag comments should
probably be reworded at the same time.
New comment added for XLogRecGetBlockInfo(). Wish I could come up
with a better name for that... Not quite sure what you thought I should
change about XLogRecGetBlockTag().

Since XLogRecGetBlockTag is now a wrapper for XLogRecGetBlockInfo, I thought it
would be better to document only the specific behavior for this one (so no
prefetch_buffer), rather than duplicating the whole description in both places.
It seems like a good recipe to miss one of the comments the next time something
is changed there.

For the name, why not the usual XLogRecGetBlockTagExtended()?

@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
if (lastSourceFailed)
{
+           /*
+            * Don't allow any retry loops to occur during nonblocking
+            * readahead.  Let the caller process everything that has been
+            * decoded already first.
+            */
+           if (nonblocking)
+               return XLREAD_WOULDBLOCK;
Is that really enough? I'm wondering if the code path in ReadRecord() that
forces lastSourceFailed to False while it actually failed when switching into
archive recovery (xlogrecovery.c around line 3044) can be problematic here.
I don't see the problem scenario, could you elaborate?

Sorry, I missed that in standby mode ReadRecord would keep going until a record
is found, so no problem indeed.

+   /* Do we have a clue where the buffer might be already? */
+   if (BufferIsValid(recent_buffer) &&
+       mode == RBM_NORMAL &&
+       ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+   {
+       buffer = recent_buffer;
+       goto recent_buffer_fast_path;
+   }
Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?
Hmm. I guess ReadRecentBuffer() should really do that. Done.

Ah, I also thought it be be better there but was assuming that there was some
possible usage where it's not wanted. Good then!

Should ReadRecentBuffer comment be updated to mention that pgBufferUsage is
incremented as appropriate? FWIW that's the first place I looked when checking
if the stats would be incremented.

Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
so some comments would be helpful.

OK, I'll come back to that.

Ok!

+/*
+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
Should there be a bit more comments about what this function is supposed to
enforce?
I have added a comment to explain.

small typos:

+ * Returns LRQ_NEXT_IO if the next block reference and it isn't in the buffer
+ * pool, [...]

I guess s/if the next block/if there's a next block/ or s/and it//.

+ * Returns LRQ_NO_IO if we examined the next block reference and found that it
+ * was already in the buffer pool.

should be LRQ_NEXT_NO_IO, and also this is returned if prefetching is disabled
or it the next block isn't prefetchable.

I'm wondering if it's a bit overkill to implement this as a callback. Do you
have near future use cases in mind? For now no other code could use the
infrastructure at all as the lrq is private, so some changes will be needed to
make it truly configurable anyway.

Yeah. Actually, in the next step I want to throw away the lrq part,
and keep just the XLogPrefetcherNextBlock() function, with some small
modifications.

Ah I see, that makes sense then.

Admittedly the control flow is a little confusing, but the point of
this architecture is to separate "how to prefetch one more thing" from
"when to prefetch, considering I/O depth and related constraints".
The first thing, "how", is represented by XLogPrefetcherNextBlock().
The second thing, "when", is represented here by the
LsnReadQueue/lrq_XXX stuff that is private in this file for now, but
later I will propose to replace that second thing with the
pg_streaming_read facility of commitfest entry 38/3316. This is a way
of getting there step by step. I also wrote briefly about that here:

/messages/by-id/CA+hUKGJ7OqpdnbSTq5oK=djSeVW2JMnrVPSm8JC-_dbN6Y7bpw@mail.gmail.com

I unsurprisingly didn't read the direct IO patch, and also joined the
prefetching thread quite recently so I missed that mail. Thanks for the
pointer!

If we keep it as a callback, I think it would make sense to extract some part,
like the main prefetch filters / global-limit logic, so other possible
implementations can use it if needed. It would also help to reduce this
function a bit, as it's somewhat long.

I can't imagine reusing any of those filtering things anywhere else.
I admit that the function is kinda long...

Yeah, I thought your plan was to provide custom prefetching method or something
like that. As-is, apart from making the function less long it wouldn't do
much.

Other changes:
[...]
3. The handling for XLOG_SMGR_CREATE was firing for every fork, but
it really only needed to fire for the main fork, for now. (There's no
reason at all this thing shouldn't prefetch other forks, that's just
left for later).

Ah indeed. While at it, should there some comments on top of the file
mentioning that only the main fork is prefetched?

4. To make it easier to see the filtering logic at work, I added code
to log messages about that if you #define XLOGPREFETCHER_DEBUG_LEVEL.
Could be extended to show more internal state and events...

FTR I also tested the patch defining this. I will probably define it on my
buildfarm animal when the patch is committed to make sure it doesn't get
broken.

5. While retesting various scenarios, it bothered me that big seq
scan UPDATEs would repeatedly issue posix_fadvise() for the same block
(because multiple rows in a page are touched by consecutive records,
and the page doesn't make it into the buffer pool until a bit later).
I resurrected the defences I had against that a few versions back
using a small window of recent prefetches, which I'd originally
developed as a way to avoid explicit prefetches of sequential scans
(prefetch 1, 2, 3, ...). That turned out to be useless superstition
based on ancient discussions in this mailing list, but I think it's
still useful to avoid obviously stupid sequences of repeat system
calls (prefetch 1, 1, 1, ...). So now it has a little one-cache-line
sized window of history, to avoid doing that.

Nice!

+ * To detect repeat access to the same block and skip useless extra system
+ * calls, we remember a small windows of recently prefetched blocks.

Should it be "repeated" access, and small window (singular)?

Also, I'm wondering if the "seq" part of the related pieces is a bit too much
specific, as there could be other workloads that lead to repeated update of the
same blocks. Maybe it's ok to use it for internal variables, but the new
skip_seq field seems a bit too obscure for some user facing thing. Maybe
skip_same, skip_repeated or something like that?

I need to re-profile a few workloads after these changes, and then
there are a couple of bikeshed-colour items:

1. It's completely arbitrary that it limits its lookahead to
maintenance_io_concurrency * 4 blockrefs ahead in the WAL. I have no
principled reason to choose 4. In the AIO version of this (to
follow), that number of blocks finishes up getting pinned at the same
time, so more thought might be needed on that, but that doesn't apply
here yet, so it's a bit arbitrary.

Yeah, I don't see that as a blocker for now. Maybe use some #define to make it
more obvious though, as it's a bit hidden in the code right now?

3. At some point in this long thread I was convinced to name the view
pg_stat_prefetch_recovery, but the GUC is called recovery_prefetch.
That seems silly...

FWIW I prefer recovery_prefetch to prefetch_recovery.

#163

thomas.munro@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#162)

Re: WIP: WAL prefetch (another approach)

On Mon, Apr 4, 2022 at 3:12 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

[review]

Thanks! I took almost all of your suggestions about renaming things,
comments, docs and moving a magic number into a macro.

Minor changes:

1. Rebased over the shmem stats changes and others that have just
landed today (woo!). The way my simple SharedStats object works and
is reset looks a little primitive next to the shiny new stats
infrastructure, but I can always adjust that in a follow-up patch if
required.

2. It was a bit annoying that the pg_stat_recovery_prefetch view
would sometimes show stale numbers when waiting for WAL to be
streamed, since that happens at arbitrary points X bytes apart in the
WAL. Now it also happens before sleeping/waiting and when recovery
ends.

3. Last year, commit a55a9847 synchronised config.sgml with guc.c's
categories. A couple of hunks in there that modified the previous
version of this work before it all got reverted. So I've re-added the
WAL_RECOVERY GUC category, to match the new section in config.sgml.

About test coverage, the most interesting lines of xlogprefetcher.c
that stand out as unreached in a gcov report are in the special
handling for the new CREATE DATABASE in file-copy mode -- but that's
probably something to raise in the thread that introduced that new
functionality without a test. I've tested that code locally; if you
define XLOGPREFETCHER_DEBUG_LEVEL you'll see that it won't touch
anything in the new database until recovery has replayed the
file-copy.

As for current CI-vs-buildfarm blind spots that recently bit me and
others, I also tested -m32 and -fsanitize=undefined,unaligned builds.

I reran one of the quick pgbench/crash/drop-caches/recover tests I had
lying around and saw a 17s -> 6s speedup with FPW off (you need much
longer tests to see speedup with them on, so this is a good way for
quick sanity checks -- see Tomas V's results for long runs with FPWs
and curved effects).

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

#164

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Justin Pryzby (#87)

Re: WIP: WAL prefetch (another approach)

The docs seem to be wrong about the default.

+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables

+   concurrency and distance, respectively.  By default, it is set to
+   <literal>try</literal>, which enabled the feature on systems where
+   <function>posix_fadvise</function> is available.

Should say "which enables".

+       {
+               {"recovery_prefetch", PGC_SIGHUP, WAL_RECOVERY,
+                       gettext_noop("Prefetch referenced blocks during recovery"),
+                       gettext_noop("Look ahead in the WAL to find references to uncached data.")
+               },
+               &recovery_prefetch,
+               RECOVERY_PREFETCH_TRY, recovery_prefetch_options,
+               check_recovery_prefetch, assign_recovery_prefetch, NULL
+       },

Curiously, I reported a similar issue last year.

Show quoted text

On Thu, Apr 08, 2021 at 10:37:04PM -0500, Justin Pryzby wrote:

--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -816,9 +816,7 @@
prefetching mechanism is most likely to be effective on systems
with <varname>full_page_writes</varname> set to
<varname>off</varname> (where that is safe), and where the working
-   set is larger than RAM.  By default, prefetching in recovery is enabled
-   on operating systems that have <function>posix_fadvise</function>
-   support.
+   set is larger than RAM.  By default, prefetching in recovery is disabled.
</para>
</sect1>

#165

Shinoda, Noriyoshi (PN Japan FSIP)

thomas.munro@gmail.com

almost 4 years ago

In reply to: Justin Pryzby (#164)

Re: WIP: WAL prefetch (another approach)

On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

The docs seem to be wrong about the default.

+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> enables

Fixed.

+   concurrency and distance, respectively.  By default, it is set to
+   <literal>try</literal>, which enabled the feature on systems where
+   <function>posix_fadvise</function> is available.

Should say "which enables".

Fixed.

Curiously, I reported a similar issue last year.

Sorry. I guess both times we only agreed on what the default should
be in the final review round before commit, and I let the docs get out
of sync (well, the default is mentioned in two places and I apparently
ended my search too soon, changing only one). I also found another
recently obsoleted sentence: the one about showing nulls sometimes was
no longer true. Removed.

#166

noriyoshi.shinoda@hpe.com

almost 4 years ago

In reply to: Thomas Munro (#165)

1 attachment(s)

RE: WIP: WAL prefetch (another approach)

Hi,
Thank you for developing the great feature. I tested this feature and checked the documentation. Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

It is also not displayed in the list of "28.2. The Statistics Collector".
https://www.postgresql.org/docs/devel/monitoring.html

The attached patch modifies the pg_stat_prefetch_recovery view to appear as a separate view.

Regards,
Noriyoshi Shinoda
-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com>
Sent: Friday, April 8, 2022 10:47 AM
To: Justin Pryzby <pryzby@telsasoft.com>
Cc: Tomas Vondra <tomas.vondra@enterprisedb.com>; Stephen Frost <sfrost@snowman.net>; Andres Freund <andres@anarazel.de>; Jakub Wartak <Jakub.Wartak@tomtom.com>; Alvaro Herrera <alvherre@2ndquadrant.com>; Tomas Vondra <tomas.vondra@2ndquadrant.com>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: WIP: WAL prefetch (another approach)

On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

The docs seem to be wrong about the default.

+        are not yet in the buffer pool, during recovery.  Valid values are
+        <literal>off</literal> (the default), <literal>on</literal> and
+        <literal>try</literal>.  The setting <literal>try</literal> 
+ enables

Fixed.

+   concurrency and distance, respectively.  By default, it is set to
+   <literal>try</literal>, which enabled the feature on systems where
+   <function>posix_fadvise</function> is available.

Should say "which enables".

Fixed.

Curiously, I reported a similar issue last year.

Sorry. I guess both times we only agreed on what the default should be in the final review round before commit, and I let the docs get out of sync (well, the default is mentioned in two places and I apparently ended my search too soon, changing only one). I also found another recently obsoleted sentence: the one about showing nulls sometimes was no longer true. Removed.

Attachments:

pg_stat_recovery_prefetch_doc_v1.diffapplication/octet-stream; name=pg_stat_recovery_prefetch_doc_v1.diffDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87b6e5f..07e61fb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2977,18 +2977,20 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  </sect2>
 
- <sect2 id="monitoring-pg-stat-subscription">
-  <title><structname>pg_stat_subscription</structname></title>
+ <sect2 id="monitoring-pg-stat-recovery-prefetch">
+  <title><structname>pg_stat_recovery_prefetch</structname></title>
 
   <indexterm>
-   <primary>pg_stat_subscription</primary>
+   <primary>pg_stat_recovery_prefetch</primary>
   </indexterm>
 
   <para>
-   The <structname>pg_stat_subscription</structname> view will contain one
-   row per subscription for main worker (with null PID if the worker is
-   not running), and additional rows for workers handling the initial data
-   copy of the subscribed tables.
+   The <structname>pg_stat_recovery_prefetch</structname> view will contain
+   only one row.  The columns <structfield>wal_distance</structfield>,
+   <structfield>block_distance</structfield> and
+   <structfield>io_depth</structfield> show current values, and the
+   other columns show cumulative counters that can be reset
+   with the <function>pg_stat_reset_shared</function> function.
   </para>
 
   <table id="pg-stat-recovery-prefetch-view" xreflabel="pg_stat_recovery_prefetch">
@@ -3052,13 +3054,20 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </tgroup>
   </table>
 
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-subscription">
+  <title><structname>pg_stat_subscription</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_subscription</primary>
+  </indexterm>
+
   <para>
-   The <structname>pg_stat_recovery_prefetch</structname> view will contain
-   only one row.  The columns <structfield>wal_distance</structfield>,
-   <structfield>block_distance</structfield> and
-   <structfield>io_depth</structfield> show current values, and the
-   other columns show cumulative counters that can be reset
-   with the <function>pg_stat_reset_shared</function> function.
+   The <structname>pg_stat_subscription</structname> view will contain one
+   row per subscription for main worker (with null PID if the worker is
+   not running), and additional rows for workers handling the initial data
+   copy of the subscribed tables.
   </para>
 
   <table id="pg-stat-subscription" xreflabel="pg_stat_subscription">

#167

/messages/by-id/CAKrAKeVk-LRHMdyT6x_p33eF6dCorM2jed5h_eHdRdv0reSYTA@mail.gmail.com

thomas.munro@gmail.com

almost 4 years ago

In reply to: Shinoda, Noriyoshi (PN Japan FSIP) (#166)

Re: WIP: WAL prefetch (another approach)

On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP)
<noriyoshi.shinoda@hpe.com> wrote:

Thank you for developing the great feature. I tested this feature and checked the documentation. Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

Hi! Thanks. I had just committed a fix before I saw your message,
because there was already another report here:

#168

Shinoda, Noriyoshi (PN Japan FSIP)

noriyoshi.shinoda@hpe.com

almost 4 years ago

In reply to: Thomas Munro (#167)

RE: WIP: WAL prefetch (another approach)

Hi,
Thank you for your reply.
I missed the message, sorry.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com>
Sent: Tuesday, April 12, 2022 6:28 PM
To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; Tomas Vondra <tomas.vondra@enterprisedb.com>; Stephen Frost <sfrost@snowman.net>; Andres Freund <andres@anarazel.de>; Jakub Wartak <Jakub.Wartak@tomtom.com>; Alvaro Herrera <alvherre@2ndquadrant.com>; Tomas Vondra <tomas.vondra@2ndquadrant.com>; Dmitry Dolgov <9erthalion6@gmail.com>; David Steele <david@pgmasters.net>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: WIP: WAL prefetch (another approach)

On Tue, Apr 12, 2022 at 9:03 PM Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> wrote:

Thank you for developing the great feature. I tested this feature and checked the documentation. Currently, the documentation for the pg_stat_prefetch_recovery view is included in the description for the pg_stat_subscription view.

INVALID URI REMOVED
toring-stats.html*MONITORING-PG-STAT-SUBSCRIPTION__;Iw!!NpxR!xRu7zc4Hc
ZppB-32Fp3YfESPqJ7B4AOP_RF7QuYP-kCWidoiJ5txu9CW8sX61TfwddE$

Hi! Thanks. I had just committed a fix before I saw your message, because there was already another report here:

/messages/by-id/CAKrAKeVk-LRHMdyT6x_p33eF6dCorM2jed5h_eHdRdv0reSYTA@mail.gmail.com

#169

Simon Riggs

simon.riggs@enterprisedb.com

almost 4 years ago

In reply to: Thomas Munro (#163)

Re: WIP: WAL prefetch (another approach)

On Thu, 7 Apr 2022 at 08:46, Thomas Munro <thomas.munro@gmail.com> wrote:

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

--
Simon Riggs http://www.EnterpriseDB.com/

#170

https://www.postgresql.org/docs/current/wal-reliability.html:

tomas.vondra@enterprisedb.com

almost 4 years ago

In reply to: Simon Riggs (#169)

Re: WIP: WAL prefetch (another approach)

On 4/12/22 15:58, Simon Riggs wrote:

On Thu, 7 Apr 2022 at 08:46, Thomas Munro <thomas.munro@gmail.com> wrote:

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

I don't see why/how would an async prefetch make FPW unnecessary. Did
anyone claim that be the case?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#171

Simon Riggs

simon.riggs@enterprisedb.com

almost 4 years ago

In reply to: Tomas Vondra (#170)

Re: WIP: WAL prefetch (another approach)

On Tue, 12 Apr 2022 at 16:41, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/12/22 15:58, Simon Riggs wrote:

On Thu, 7 Apr 2022 at 08:46, Thomas Munro <thomas.munro@gmail.com> wrote:

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

I don't see why/how would an async prefetch make FPW unnecessary. Did
anyone claim that be the case?

Other way around. FPWs make prefetch unnecessary.
Therefore you would only want prefetch with FPW=off, AFAIK.

Or put this another way: when is it safe and sensible to use
recovery_prefetch != off?

--
Simon Riggs http://www.EnterpriseDB.com/

#172

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

almost 4 years ago

In reply to: Simon Riggs (#169)

Re: WIP: WAL prefetch (another approach)

Simon Riggs <simon.riggs@enterprisedb.com> writes:

On Thu, 7 Apr 2022 at 08:46, Thomas Munro <thomas.munro@gmail.com> wrote:

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

Our WAL reliability docs claim that ZFS is safe against torn pages:

If you have file-system software that prevents partial page writes
(e.g., ZFS), you can turn off this page imaging by turning off the
full_page_writes parameter.

- ilmari

#173

tomas.vondra@enterprisedb.com

almost 4 years ago

In reply to: Simon Riggs (#171)

Re: WIP: WAL prefetch (another approach)

On 4/12/22 17:46, Simon Riggs wrote:

On Tue, 12 Apr 2022 at 16:41, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/12/22 15:58, Simon Riggs wrote:

On Thu, 7 Apr 2022 at 08:46, Thomas Munro <thomas.munro@gmail.com> wrote:

With that... I've finally pushed the 0002 patch and will be watching
the build farm.

This is a nice feature if it is safe to turn off full_page_writes.

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

I don't see why/how would an async prefetch make FPW unnecessary. Did
anyone claim that be the case?

Other way around. FPWs make prefetch unnecessary.
Therefore you would only want prefetch with FPW=off, AFAIK.

Or put this another way: when is it safe and sensible to use
recovery_prefetch != off?

That assumes the FPI stays in memory until the next modification, and
that can be untrue for a number of reasons. Long checkpoint interval
with enough random accesses in between is a nice example. See the
benchmarks I did a year ago (regular pgbench).

Or imagine a r/o replica used to run analytics queries, that access so
much data it evicts the buffers initialized by the FPI records.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#174

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

almost 4 years ago

In reply to: Simon Riggs (#171)

Re: WIP: WAL prefetch (another approach)

Other way around. FPWs make prefetch unnecessary.
Therefore you would only want prefetch with FPW=off, AFAIK.

A few scenarios I can imagine page prefetch can help are, 1/ A DR replica
instance that is smaller instance size than primary. Page prefetch can
bring the pages back into memory in advance when they are evicted. This
speeds up the replay and is cost effective. 2/ Allows larger
checkpoint_timeout for the same recovery SLA and perhaps improved
performance? 3/ WAL prefetch (not pages by itself) can improve replay by
itself (not sure if it was measured in isolation, Tomas V can comment on
it). 4/ Read replica running analytical workload scenario Tomas V mentioned
earlier.

Or put this another way: when is it safe and sensible to use
recovery_prefetch != off?

When checkpoint_timeout is set large and under heavy write activity, on a
read replica that has working set higher than the memory and receiving
constant updates from primary. This covers 1 & 4 above.

Show quoted text

--
Simon Riggs http://www.EnterpriseDB.com/

#175

thomas.munro@gmail.com

almost 4 years ago

In reply to: Dagfinn Ilmari Mannsåker (#172)

Re: WIP: WAL prefetch (another approach)

On Wed, Apr 13, 2022 at 3:57 AM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

Simon Riggs <simon.riggs@enterprisedb.com> writes:

This is a nice feature if it is safe to turn off full_page_writes.

As other have said/shown, it does also help if a block with FPW is
evicted and then read back in during one checkpoint cycle, in other
words if the working set is larger than shared buffers.

This also provides infrastructure for proposals in the next cycle, as
part of commitfest #3316:
* in direct I/O mode, I/O stalls become more likely due to lack of
kernel prefetching/double-buffering, so prefetching becomes more
essential
* even in buffered I/O mode when benefiting from free
double-buffering, the copy from kernel buffer to user space buffer can
be finished in the background instead of calling pread() when you need
the page, but you need to start it sooner
* adjacent blocks accessed by nearby records can be merged into a
single scatter-read, for example with preadv() in the background
* repeated buffer lookups, pins, locks (and maybe eventually replay)
to the same page can be consolidated

Pie-in-the-sky ideas:
* someone might eventually want to be able to replay in parallel
(hard, but certainly requires lookahead)
* I sure hope we'll eventually use different techniques for torn-page
protection to avoid the high online costs of FPW

When is it safe to do that? On which platform?

I am not aware of any released software that allows full_page_writes
to be safely disabled. Perhaps something has been released recently
that allows this? I think we have substantial documentation about
safety of other settings, so we should carefully document things here
also.

Our WAL reliability docs claim that ZFS is safe against torn pages:

https://www.postgresql.org/docs/current/wal-reliability.html:

If you have file-system software that prevents partial page writes
(e.g., ZFS), you can turn off this page imaging by turning off the
full_page_writes parameter.

Unfortunately, posix_fadvise(WILLNEED) doesn't do anything on ZFS
right now :-(. I have some patches to fix that on Linux[1]https://github.com/openzfs/zfs/pull/9807 and
FreeBSD and it seems like there's a good chance of getting them
committed based on feedback, but it needs some more work on tests and
mmap integration. If anyone's interested in helping get that landed
faster, please ping me off-list.

[1]: https://github.com/openzfs/zfs/pull/9807

#176

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#175)

Re: WIP: WAL prefetch (another approach)

I believe that the WAL prefetch patch probably accounts for the
intermittent errors that buildfarm member topminnow has shown
since it went in, eg [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47:

diff -U3 /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out
--- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out	2022-04-10 03:05:15.972622440 +0200
+++ /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out	2022-04-25 05:09:49.861642059 +0200
@@ -34,11 +34,7 @@
 (1 row)

 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2');
  ok 
 ----
@@ -46,11 +42,7 @@
 (1 row)

 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 -- ===================================================================
 -- Test for filtering out WAL records of a particular table
 -- ===================================================================

I've reproduced this manually on that machine, and confirmed that the
proximate cause is that XLogNextRecord() is returning NULL because
state->decode_queue_head == NULL, without bothering to provide an errormsg
(which doesn't seem very well thought out in itself). I obtained the
contents of the xlogreader struct at failure:

(gdb) p *xlogreader
$1 = {routine = {page_read = 0x594270 <read_local_xlog_page_no_wait>,
segment_open = 0x593b44 <wal_segment_open>,
segment_close = 0x593d38 <wal_segment_close>}, system_identifier = 0,
private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752,
abortedRecPtr = 26230752, missingContrecPtr = 26230784,
overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752,
PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240",
decode_buffer_size = 65536, free_decode_buffer = true,
decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240",
decode_queue_head = 0x0, decode_queue_tail = 0x0,
readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = {
ws_dir = '\000' <repeats 1023 times>, ws_segsize = 16777216}, seg = {
ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0,
latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752,
currTLI = 1, currTLIValidUntil = 0, nextTLI = 0,
readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960,
errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false}

I don't have an intuition about where to look beyond that, any
suggestions?

What I do know so far is that while the failure reproduces fairly
reliably under "make check" (more than half the time, which squares
with topminnow's history), it doesn't reproduce at all under "make
installcheck" (after removing NO_INSTALLCHECK), which seems odd.
Maybe it's dependent on how much WAL history the installation has
accumulated?

It could be that this is a bug in pg_walinspect or a fault in its
test case; hard to tell since that got committed at about the same
time as the prefetch changes.

regards, tom lane

#177

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#176)

Re: WIP: WAL prefetch (another approach)

Oh, one more bit of data: here's an excerpt from pg_waldump output after
the failed test:

rmgr: Btree len (rec/tot): 72/ 72, tx: 727, lsn: 0/01903BC8, prev 0/01903B70, desc: INSERT_LEAF off 111, blkref #0: rel 1663/16384/2673 blk 9
rmgr: Btree len (rec/tot): 72/ 72, tx: 727, lsn: 0/01903C10, prev 0/01903BC8, desc: INSERT_LEAF off 141, blkref #0: rel 1663/16384/2674 blk 7
rmgr: Standby len (rec/tot): 42/ 42, tx: 727, lsn: 0/01903C58, prev 0/01903C10, desc: LOCK xid 727 db 16384 rel 16391
rmgr: Transaction len (rec/tot): 437/ 437, tx: 727, lsn: 0/01903C88, prev 0/01903C58, desc: COMMIT 2022-04-25 20:16:03.374197 CEST; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16391
rmgr: Heap len (rec/tot): 59/ 59, tx: 728, lsn: 0/01903E40, prev 0/01903C88, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0
rmgr: Heap len (rec/tot): 59/ 59, tx: 728, lsn: 0/01903E80, prev 0/01903E40, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 728, lsn: 0/01903EC0, prev 0/01903E80, desc: COMMIT 2022-04-25 20:16:03.379323 CEST
rmgr: Heap len (rec/tot): 59/ 59, tx: 729, lsn: 0/01903EE8, prev 0/01903EC0, desc: INSERT off 3 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0
rmgr: Heap len (rec/tot): 59/ 59, tx: 729, lsn: 0/01903F28, prev 0/01903EE8, desc: INSERT off 4 flags 0x00, blkref #0: rel 1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 729, lsn: 0/01903F68, prev 0/01903F28, desc: COMMIT 2022-04-25 20:16:03.381720 CEST

The error is complaining about not being able to read 0/01903E40,
which AFAICT is from the first "INSERT INTO sample_tbl" command,
which most certainly ought to be down to disk at this point.

Also, I modified the test script to see what WAL LSNs it thought
it was dealing with, and got

+\echo 'wal_lsn1 = ' :wal_lsn1
+wal_lsn1 =  0/1903E40
+\echo 'wal_lsn2 = ' :wal_lsn2
+wal_lsn2 =  0/1903EE8

confirming that idea of where 0/01903E40 is in the WAL history.
So this is sure looking like a bug somewhere in xlogreader.c,
not in pg_walinspect.

regards, tom lane

#178

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#177)

Re: WIP: WAL prefetch (another approach)

On Tue, Apr 26, 2022 at 6:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that the WAL prefetch patch probably accounts for the
intermittent errors that buildfarm member topminnow has shown
since it went in, eg [1]:
diff -U3 /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out

Hmm, maybe but I suspect not. I think I might see what's happening here.

+ERROR: could not read WAL at 0/1903E40

I've reproduced this manually on that machine, and confirmed that the
proximate cause is that XLogNextRecord() is returning NULL because
state->decode_queue_head == NULL, without bothering to provide an errormsg
(which doesn't seem very well thought out in itself). I obtained the

Thanks for doing that. After several hours of trying I also managed
to reproduce it on that gcc23 system (not at all sure why it doesn't
show up elsewhere; MIPS 32 bit layout may be a factor), and added some
trace to get some more clues. Still looking into it, but here is the
current hypothesis I'm testing:

1. The reason there's a messageless ERROR in this case is because
there is new read_page callback logic introduced for pg_walinspect,
called via read_local_xlog_page_no_wait(), which is like the old
read_local_xlog_page() except that it returns -1 if you try to read
past the current "flushed" LSN, and we have no queued message. An
error is then reported by XLogReadRecord(), and appears to the user.

2. The reason pg_walinspect tries to read WAL data past the flushed
LSN is because its GetWALRecordsInfo() function keeps calling
XLogReadRecord() until EndRecPtr >= end_lsn, where end_lsn is taken
from a snapshot of the flushed LSN, but I don't see where it takes
into account that the flushed LSN might momentarily fall in the middle
of a record. In that case, xlogreader.c will try to read the next
page, which fails because it's past the flushed LSN (see point 1).

I will poke some more tomorrow to try to confirm this and try to come
up with a fix.

#179

/messages/by-id/CA+hUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA@mail.gmail.com

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#178)

Re: WIP: WAL prefetch (another approach)

On Tue, Apr 26, 2022 at 6:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I will poke some more tomorrow to try to confirm this and try to come
up with a fix.

Done, and moved over to the pg_walinspect commit thread to reach the
right eyeballs:

#180