WIP: WAL prefetch (another approach)
Hello hackers,
Based on ideas from earlier discussions[1]https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html[2]/messages/by-id/49df9cd2-7086-02d0-3f8d-535a32d44c82@postgrespro.ru, here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).
Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.
Some notes:
* PrefetchBuffer() is only beneficial if your kernel and filesystem
have a working POSIX_FADV_WILLNEED implementation. That includes
Linux ext4 and xfs, but excludes macOS and Windows. In future we
might use asynchronous I/O to bring data all the way into our own
buffer pool; hopefully the PrefetchBuffer() interface wouldn't change
much and this code would automatically benefit.
* For now, for proof-of-concept purposes, the patch uses a second
XLogReader to read ahead in the WAL. I am thinking about how to write
a two-cursor XLogReader that reads and decodes each record just once.
* It can handle simple crash recovery and streaming replication
scenarios, but doesn't yet deal with complications like timeline
changes (the way to do that might depend on how the previous point
works out). The integration with WAL receiver probably needs some
work, I've been testing pretty narrow cases so far, and the way I
hijacked read_local_xlog_page() probably isn't right.
* On filesystems with block size <= BLCKSZ, it's a waste of a syscall
to try to prefetch a block that we have a FPW for, but otherwise it
can avoid a later stall due to a read-before-write at pwrite() time,
so I added a second GUC wal_prefetch_fpw to make that optional.
Earlier work, and how this patch compares:
* Sean Chittenden wrote pg_prefaulter[1]https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html, an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.
* Konstantin Knizhnik proposed a dedicated PostgreSQL process that
would do approximately the same thing[2]/messages/by-id/49df9cd2-7086-02d0-3f8d-535a32d44c82@postgrespro.ru. My WIP patch differs mainly
in that it does the prefetching work in the recovery loop itself, and
uses PrefetchBuffer() rather than FilePrefetch() directly. This
avoids a bunch of communication and complications, but admittedly does
introduce new system calls into a hot loop (for now); perhaps I could
pay for that by removing more lseek(SEEK_END) noise. It also deals
with various edge cases relating to created, dropped and truncated
relations a bit differently. It also tries to avoid generating
sequential WILLNEED advice, based on experimental evidence[3]https://github.com/macdice/some-io-tests that
that affects Linux's readahead heuristics negatively, though I don't
understand the exact mechanism there.
Here are some cases where I expect this patch to perform badly:
* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.
* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.
* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall. That might imply that you should probably be
running with a larger shared_buffers (?). It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).
* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").
[1]: https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html
[2]: /messages/by-id/49df9cd2-7086-02d0-3f8d-535a32d44c82@postgrespro.ru
[3]: https://github.com/macdice/some-io-tests
Attachments:
On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
Hello hackers,
Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).
Thanks, I only did a very quick review so far, but the patch looks fine.
In general, I find it somewhat non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know.
Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.
Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.
OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.
...
Earlier work, and how this patch compares:
* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.
How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.
What other common filesystems are missing support for this?
Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.
...
Here are some cases where I expect this patch to perform badly:
* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.
Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.
* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.
I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).
* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall. That might imply that you should probably be
running with a larger shared_buffers (?). It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).
Makes sense.
* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").
In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?
Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).Thanks, I only did a very quick review so far, but the patch looks fine.
Thanks for looking!
Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.
Using a 4GB RAM 16 thread virtual machine running Linux debian10
4.19.0-6-amd64 with an ext4 filesystem on NVMe storage:
postgres -D pgdata \
-c full_page_writes=off \
-c checkpoint_timeout=60min \
-c max_wal_size=10GB \
-c synchronous_commit=off
# in another shell
pgbench -i -s300 postgres
psql postgres -c checkpoint
pgbench -T60 -Mprepared -c4 -j4 postgres
killall -9 postgres
# save the crashed pgdata dir for repeated experiments
mv pgdata pgdata-save
# repeat this with values like wal_prefetch_distance=-1, 1kB, 8kB, 64kB, ...
rm -fr pgdata
cp -r pgdata-save pgdata
postgres -D pgdata -c wal_prefetch_distance=-1
What I see on my desktop machine is around 10x speed-up:
wal_prefetch_distance=-1 -> 62s (same number for unpatched)
wal_prefetch_distance=8kb -> 6s
wal_prefetch_distance=64kB -> 5s
On another dev machine I managed to get a 20x speedup, using a much
longer test. It's probably more interesting to try out some more
realistic workloads rather than this cache-destroying uniform random
stuff, though. It might be interesting to test on systems with high
random read latency, but high concurrency; I can think of a bunch of
network storage environments where that's the case, but I haven't
looked into them, beyond some toy testing with (non-Linux) NFS over a
slow network (results were promising).
Earlier work, and how this patch compares:
* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.
Assuming they like it enough to commit it (and initial informal
feedback on the general concept has been positive -- it's not messing
with their code at all, it's just boilerplate code to connect the
relevant Linux and FreeBSD VFS callbacks), it could indeed be quite a
while before it appear in conservative package repos, but I don't
know, it depends where you get your OpenZFS/ZoL module from.
What other common filesystems are missing support for this?
Using our build farm as a way to know which operating systems we care
about as a community, in no particular order:
* I don't know for exotic or network filesystems on Linux
* AIX 7.2's manual says "Valid option, but this value does not perform
any action" for every kind of advice except POSIX_FADV_NOWRITEBEHIND
(huh, nonstandard advice).
* Solaris's posix_fadvise() was a dummy libc function, as of 10 years
ago when they closed the source; who knows after that.
* FreeBSD's UFS and NFS support other advice through a default handler
but unfortunately ignore WILLNEED (I have patches for those too, not
good enough to send anywhere yet).
* OpenBSD has no such syscall
* NetBSD has the syscall, and I can see that it's hooked up to
readahead code, so that's probably the only unqualified yes in this
list
* Windows has no equivalent syscall; the closest thing might be to use
ReadFileEx() to initiate an async read into a dummy buffer; maybe you
can use a zero event so it doesn't even try to tell you when the I/O
completes, if you don't care?
* macOS has no such syscall, but you could in theory do an aio_read()
into a dummy buffer. On the other hand I don't think that interface
is a general solution for POSIX systems, because on at least Linux and
Solaris, aio_read() is emulated by libc with a whole bunch of threads
and we are allergic to those things (and even if we weren't, we
wouldn't want a whole threadpool in every PostgreSQL process, so you'd
need to hand off to a worker process, and then why bother?).
* HPUX, I don't know
We could test any of those with a simple test I wrote[1]https://github.com/macdice/some-io-tests, but I'm not
likely to test any non-open-source OS myself due to lack of access.
Amazingly, HPUX's posix_fadvise() doesn't appear to conform to POSIX:
it sets errno and returns -1, while POSIX says that it should return
an error number. Checking our source tree, I see that in
pg_flush_data(), we also screwed that up and expect errno to be set,
though we got it right in FilePrefetch().
In any case, Linux must be at the very least 90% of PostgreSQL
installations. Incidentally, sync_file_range() without wait is a sort
of opposite of WILLNEED (it means something like
"POSIX_FADV_WILLSYNC"), and no one seem terribly upset that we really
only have that on Linux (the emulations are pretty poor AFAICS).
Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.
That is a good idea, and I agree. I have a patch set that does
exactly that. It's nearly independent of the WAL prefetch work; it
just changes how PrefetchBuffer() is implemented, affecting bitmap
index scans, vacuum and any future user of PrefetchBuffer. If you
apply these patches too then WAL prefetch will use it (just set
max_background_readers = 4 or whatever):
https://github.com/postgres/postgres/compare/master...macdice:bgreader
That's simplified from an abandoned patch I had lying around because I
was experimenting with prefetching all the way into shared buffers
this way. The simplified version just does pread() into a dummy
buffer, for the side effect of warming the kernel's cache, pretty much
like pg_prefaulter. There are some tricky questions around whether
it's better to wait or not when the request queue is full; the way I
have that is far too naive, and that question is probably related to
your point about being cleverer about how many prefetch blocks you
should try to have in flight. A future version of PrefetchBuffer()
might lock the buffer then tell the worker (or some kernel async I/O
facility) to write the data into the buffer. If I understand
correctly, to make that work we need Robert's IO lock/condition
variable transplant[2]/messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com, and Andres's scheme for a suitable
interlocking protocol, and no doubt some bulletproof cleanup
machinery. I'm not working on any of that myself right now because I
don't want to step on Andres's toes.
Here are some cases where I expect this patch to perform badly:
* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.
Yeah, but I don't want to do too much work in the startup process, or
get too opinionated about how the underlying I/O stack works. I think
we'd need to do things like that in a direct I/O future, but we'd
probably offload it (?). I figured the best approach for early work
in this space would be to just get out of the way if we detect
sequential access.
* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).
Hmm. It's basically a buffer map probe. I think the adaptive logic
would probably be some kind of periodically resetting counter scheme,
but you're probably right to suspect that it might not even be worth
bothering with, especially if a single XLogReader can be made to do
the readahead with no real extra cost. Perhaps we should work on
making the cost of all prefetching overheads as low as possible first,
before trying to figure out whether it's worth building a system for
avoiding it.
* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
FWIW, on pgbench tests on flash storage I've found that 1KB only helps
a bit, 8KB is great, and more than that doesn't get any better. Of
course, this is meaningless in general; a zipfian workload might need
to look a lot further head than a uniform one to find anything worth
prefetching, and that's exactly what you're complaining about, and I
agree.
But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?
Yeah, so you need to be touching blocks more than once between
checkpoints, if you want to see speed-up on a system with blocks <=
BLCKSZ and FPW on. If checkpoints are far enough apart you'll
eventually run out of FPWs and start replaying non-FPW stuff. Or you
could be on a filesystem with larger blocks than PostgreSQL.
Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Yeah, I think you're right, we should probably try to make a little
queue to track LSNs and count prefetch requests in and out. I think
you'd also want PrefetchBuffer() to tell you if the block was already
in the buffer pool, so that you don't count blocks that it decided not
to prefetch. I guess PrefetchBuffer() needs to return an enum (I
already had it returning a bool for another purpose relating to an
edge case in crash recovery, when relations have been dropped by a
later WAL record). I will think about that.
Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.
There are two levels of defence against repeatedly prefetching the
same block: PrefetchBuffer() checks for blocks that are already in our
cache, and before that, PrefetchState remembers the last block so that
we can avoid fetching that block (or the following block).
[1]: https://github.com/macdice/some-io-tests
[2]: /messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
On Fri, Jan 3, 2020 at 5:57 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Here is a new WIP version of the patch set that does that. Changes:
1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1]/messages/by-id/13619.1557935593@sss.pgh.pa.us, but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.
2. You can now change the relevant GUCs (wal_prefetch_distance,
wal_prefetch_fpw, effective_io_concurrency) at runtime and reload for
them to take immediate effect. For example, you can enable the
feature on a running replica by setting wal_prefetch_distance=8kB
(from the default of -1, which means off), and something like
effective_io_concurrency=10, and telling the postmaster to reload.
3. The new code is moved out to a new file
src/backend/access/transam/xlogprefetcher.c, to minimise new bloat in
the mighty xlog.c file. Functions were renamed to make their purpose
clearer, and a lot of comments were added.
4. The WAL receiver now exposes the current 'write' position via an
atomic value in shared memory, so we don't need to hammer the WAL
receiver's spinlock.
5. There is some rudimentary user documentation of the GUCs.
Attachments:
0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v2.patchapplication/octet-stream; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v2.patchDownload
From 34a5bcab7eb4a2ac64f0fe9a533cacba0e7481b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
---
src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++-------------
src/include/storage/bufmgr.h | 3 ++
2 files changed, 47 insertions(+), 33 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..6e0875022c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,6 +519,48 @@ ComputeIoConcurrency(int io_concurrency, double *target)
return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
}
+void
+SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
+ LWLock *newPartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ Assert(BlockNumberIsValid(blockNum));
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+ forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ newHash = BufTableHashCode(&newTag);
+ newPartitionLock = BufMappingPartitionLock(newHash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ /* If not in buffers, initiate prefetch */
+ if (buf_id < 0)
+ smgrprefetch(smgr_reln, forkNum, blockNum);
+
+ /*
+ * If the block *is* in buffers, we do nothing. This is not really ideal:
+ * the block might be just about to be evicted, which would be stupid
+ * since we know we are going to need it soon. But the only easy answer
+ * is to bump the usage_count, which does not seem like a great solution:
+ * when the caller does ultimately touch the block, usage_count would get
+ * bumped again, resulting in too much favoritism for blocks that are
+ * involved in a prefetch sequence. A real fix would involve some
+ * additional per-buffer state, and it's not clear that there's enough of
+ * a problem to justify that.
+ */
+#endif
+}
+
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
@@ -550,39 +592,8 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
}
else
{
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
- LWLock *newPartitionLock; /* buffer partition lock for it */
- int buf_id;
-
- /* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
- forkNum, blockNum);
-
- /* determine its hash code and partition lock ID */
- newHash = BufTableHashCode(&newTag);
- newPartitionLock = BufMappingPartitionLock(newHash);
-
- /* see if the block is in the buffer pool already */
- LWLockAcquire(newPartitionLock, LW_SHARED);
- buf_id = BufTableLookup(&newTag, newHash);
- LWLockRelease(newPartitionLock);
-
- /* If not in buffers, initiate prefetch */
- if (buf_id < 0)
- smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
- /*
- * If the block *is* in buffers, we do nothing. This is not really
- * ideal: the block might be just about to be evicted, which would be
- * stupid since we know we are going to need it soon. But the only
- * easy answer is to bump the usage_count, which does not seem like a
- * great solution: when the caller does ultimately touch the block,
- * usage_count would get bumped again, resulting in too much
- * favoritism for blocks that are involved in a prefetch sequence. A
- * real fix would involve some additional per-buffer state, and it's
- * not clear that there's enough of a problem to justify that.
- */
+ /* pass it to the shared buffer version */
+ SharedPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..89a47afec1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -18,6 +18,7 @@
#include "storage/buf.h"
#include "storage/bufpage.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
#include "utils/snapmgr.h"
@@ -162,6 +163,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
* prototypes for functions in bufmgr.c
*/
extern bool ComputeIoConcurrency(int io_concurrency, double *target);
+extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.23.0
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v2.patchapplication/octet-stream; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v2.patchDownload
From 794f6c7d9f8e0b3b3e97aad1ce13d275be25bb4c Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.
An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.
---
src/backend/access/transam/xlog.c | 4 ++--
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/walreceiverfuncs.c | 4 ++--
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3813eadfb4..0c389e9315 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9261,7 +9261,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -12082,7 +12082,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
+ receivedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
{
havedata = true;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..9bce63b534 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -294,7 +294,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b9d0..1079b3f8cb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2903,7 +2903,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..147b374a26 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.23.0
0003-Add-WalRcvGetWriteRecPtr-new-definition-v2.patchapplication/octet-stream; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v2.patchDownload
From 165c9a9c5ecf300c2be1b79e2e480807416b2fae Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.
The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
---
src/backend/replication/walreceiver.c | 5 +++++
src/backend/replication/walreceiverfuncs.c | 10 ++++++++++
src/include/replication/walreceiver.h | 9 +++++++++
3 files changed, 24 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..88a51ba35f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -244,6 +244,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 9bce63b534..14e9a6245a 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,16 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ return pg_atomic_read_u64(&WalRcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 147b374a26..1e8f304dc4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -83,6 +84,13 @@ typedef struct
XLogRecPtr receivedUpto;
TimeLineID receivedTLI;
+ /*
+ * Same as above, but advanced after writing and before flushing, without
+ * the need to acquire the spin lock. Data can be read by another process
+ * up to this point, but shouldn't be used for data integrity purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
@@ -323,6 +331,7 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.23.0
0004-Allow-PrefetchBuffer-to-report-missing-file-in-re-v2.patchapplication/octet-stream; name=0004-Allow-PrefetchBuffer-to-report-missing-file-in-re-v2.patchDownload
From 9d7368f3328fdbe15d2078f44c8e4578bb90b84c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 30 Dec 2019 16:43:50 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report missing file in
recovery.
Normally, smgrread() in recovery would create any missing files,
on the assumption that a later WAL record must unlink it. In
order to support prefetching buffers during recovery, we must
also handle missing files there. To give the caller the
opportunity to do that, return false to indicate that the
underlying file doesn't exist.
Also report whether a prefetch was actually initiated, so that
callers can limit the number of concurrent IOs they issue without
counting the prefetch calls that did nothing.
---
src/backend/storage/buffer/bufmgr.c | 9 +++++++--
src/backend/storage/smgr/md.c | 9 +++++++--
src/backend/storage/smgr/smgr.c | 9 ++++++---
src/include/storage/bufmgr.h | 12 ++++++++++--
src/include/storage/md.h | 2 +-
src/include/storage/smgr.h | 2 +-
6 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6e0875022c..5dbbcf8111 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,7 +519,7 @@ ComputeIoConcurrency(int io_concurrency, double *target)
return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
}
-void
+PrefetchBufferResult
SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -545,7 +545,11 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
/* If not in buffers, initiate prefetch */
if (buf_id < 0)
- smgrprefetch(smgr_reln, forkNum, blockNum);
+ {
+ if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+ return PREFETCH_BUFFER_NOREL;
+ return PREFETCH_BUFFER_MISS;
+ }
/*
* If the block *is* in buffers, we do nothing. This is not really ideal:
@@ -559,6 +563,7 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
* a problem to justify that.
*/
#endif
+ return PREFETCH_BUFFER_HIT;
}
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
/*
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
*/
-void
+bool
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
#ifdef USE_PREFETCH
off_t seekpos;
MdfdVec *v;
- v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+ if (v == NULL)
+ return false;
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
#endif /* USE_PREFETCH */
+
+ return true;
}
/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..f6c8a37290 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
- void (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+ bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_read) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
@@ -489,11 +489,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
/*
* smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ * In recovery only, this can return false to indicate that a file
+ * doesn't exist (presumably it has been dropped by a later commit).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
- smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+ return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
}
/*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89a47afec1..5d7a796ba0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,12 +159,20 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+typedef enum PrefetchBufferResult
+{
+ PREFETCH_BUFFER_HIT,
+ PREFETCH_BUFFER_MISS,
+ PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
/*
* prototypes for functions in bufmgr.c
*/
extern bool ComputeIoConcurrency(int io_concurrency, double *target);
-extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
extern void mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void smgrread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
--
2.23.0
0005-Prefetch-referenced-blocks-during-recovery-v2.patchapplication/octet-stream; name=0005-Prefetch-referenced-blocks-during-recovery-v2.patchDownload
From 545ddb9055dfff3eff520d5fc854a8f4abfdf029 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 12 Feb 2020 18:17:24 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
Introduce a new GUC wal_prefetch_distance. If it is set to a positive
number of bytes, then read ahead in the WAL at most that distance and
initiate asynchronous reading of referenced blocks, in the hope of
avoiding I/O stalls.
The number of concurrent asynchronous reads is limited by both
effective_io_concurrency and wal_prefetch_distance.
Author: Thomas Munro
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 38 ++
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 65 +++
src/backend/access/transam/xlogprefetcher.c | 456 ++++++++++++++++++++
src/backend/access/transam/xlogutils.c | 23 +-
src/backend/replication/logical/logical.c | 2 +-
src/backend/utils/misc/guc.c | 25 ++
src/include/access/xlog.h | 4 +
src/include/access/xlogprefetcher.h | 25 ++
src/include/access/xlogutils.h | 20 +
src/include/storage/bufmgr.h | 5 +
src/include/utils/guc.h | 2 +
12 files changed, 664 insertions(+), 2 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetcher.c
create mode 100644 src/include/access/xlogprefetcher.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..415b0793e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3082,6 +3082,44 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-prefetch-distance" xreflabel="wal_prefetch_distance">
+ <term><varname>wal_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as <xref linkend="guc-effective-io-concurrency"/>.
+ If this value is specified without units, it is taken as bytes.
+ The default is -1, meaning that WAL prefetching is disabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+ <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks with full page images during recovery.
+ Usually this doesn't help, since such blocks will not be read. However,
+ on file systems with a block size larger than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+ read-before-write when a blocks are later written.
+ This setting has no effect unless
+ <xref linkend="guc-wal-prefetch-distance"/> is set to a positive number.
+ The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetcher.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0c389e9315..0f27a4da54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,11 +34,13 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage_xlog.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
#include "miscadmin.h"
@@ -104,6 +106,8 @@ int wal_level = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
+int wal_prefetch_distance = -1;
+bool wal_prefetch_fpw = false;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -801,6 +805,7 @@ static XLogSource readSource = 0; /* XLOG_FROM_* code */
*/
static XLogSource currentSource = 0; /* XLOG_FROM_* code */
static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
typedef struct XLogPageReadPrivate
{
@@ -6191,6 +6196,7 @@ CheckRequiredParameterValues(void)
}
}
+
/*
* This must be called ONCE during postmaster or standalone-backend startup
*/
@@ -7046,6 +7052,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetcher *prefetcher = NULL;
InRedo = true;
@@ -7053,6 +7060,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* the first time through, see if we need to enable prefetching */
+ ResetWalPrefetcher();
+
/*
* main redo apply loop
*/
@@ -7082,6 +7092,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /*
+ * The first time through, or if any relevant settings or the
+ * WAL source changes, we'll restart the prefetching machinery
+ * as appropriate. This is simpler than trying to handle
+ * various complicated state changes.
+ */
+ if (unlikely(reset_wal_prefetcher))
+ {
+ /* If we had one already, destroy it. */
+ if (prefetcher)
+ {
+ XLogPrefetcherFree(prefetcher);
+ prefetcher = NULL;
+ }
+ /* If we want one, create it. */
+ if (wal_prefetch_distance > 0)
+ prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+ reset_wal_prefetcher = false;
+ }
+
+ /* Peform WAL prefetching, if enabled. */
+ if (prefetcher)
+ XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7269,6 +7304,8 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ if (prefetcher)
+ XLogPrefetcherFree(prefetcher);
if (reachedRecoveryTarget)
{
@@ -10128,6 +10165,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
}
}
+void
+assign_wal_prefetch_distance(int new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
/*
* Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11911,6 +11966,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* and move on to the next state.
*/
currentSource = XLOG_FROM_STREAM;
+ ResetWalPrefetcher();
break;
case XLOG_FROM_STREAM:
@@ -12334,3 +12390,12 @@ XLogRequestWalReceiverReply(void)
{
doRequestWalReceiverReply = true;
}
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+ reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..6b565dc313
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,456 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ * Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/hsearch.h"
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ XLogRecPtr *prefetch_queue;
+ int prefetch_queue_size;
+ int prefetch_head;
+ int prefetch_tail;
+
+ /* Details of last prefetched block. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+ XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance ot get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* We're allowed to read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_LSN;
+ prefetcher->options.lsn = (XLogRecPtr) -1;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ &prefetcher->options);
+ prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /*
+ * The size of the queue is determined by target_prefetch_pages, which is
+ * derived from effective_io_concurrency. In theory we might have a
+ * separate queue for each tablespace, but it's not clear how that should
+ * work, so for now we'll just use the system-wide GUC to rate-limit all
+ * prefetching.
+ */
+ prefetcher->prefetch_queue_size = target_prefetch_pages;
+ prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+ prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+ /* Prepare to read at the given LSN. */
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher->prefetch_queue);
+ pfree(prefetcher);
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ /*
+ * If an error has occurred or we've hit the end of the WAL or a timeline
+ * change, do nothing. Eventually we might be restarted by the recovery
+ * loop deciding to reset us due to a new timeline or a GUC change.
+ */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of IOs running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /* Can we drop any filters yet, due to problem records begin replayed? */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /* Main prefetch loop. */
+ for (;;)
+ {
+ XLogReaderState *reader = prefetcher->reader;
+ char *error;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ elog(LOG, "WAL prefetch: %s", error);
+ prefetcher->shutdown = true;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (prefetcher->reader->ReadRecPtr >= replaying_lsn + wal_prefetch_distance)
+ break;
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /*
+ * Scan the record for block references. We might already have been
+ * partway through processing this record when we hit maximum I/O
+ * concurrency, so start where we left off.
+ */
+ for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+ {
+ DecodedBkpBlock *block = &reader->blocks[i];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might thing we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !wal_prefetch_fpw)
+ continue;
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+ block->blkno))
+ continue;
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat or sequential access, then skip it. We
+ * expect the kernel to detect sequential access on its own
+ * and do a better job than we could.
+ */
+ if (block->blkno == prefetcher->last_blkno ||
+ block->blkno == prefetcher->last_blkno + 1)
+ {
+ prefetcher->last_blkno = block->blkno;
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ switch (SharedPrefetchBuffer(reln, block->forknum, block->blkno))
+ {
+ case PREFETCH_BUFFER_HIT:
+ /* It's already cached, so do nothing. */
+ break;
+ case PREFETCH_BUFFER_MISS:
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before
+ * processing any more blocks from this record.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = i + 1;
+ return;
+ }
+ break;
+ case PREFETCH_BUFFER_NOREL:
+ /*
+ * The underlying segment file doesn't exist. Presumably it
+ * will be unlinked by a later WAL record. When recovery
+ * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+ * flag. We certainly don't want to do that sort of thing
+ * while merely prefetching, so let's just ignore references
+ * to this relation until this record is replayed, and let
+ * recovery create the dummy file or complain if something is
+ * wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ break;
+ }
+ }
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * IO, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when IO really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ }
+}
+
+/*
+ * Check if the maximum allowed number of IOs is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
loc = targetPagePtr + reqLen;
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
- if (!RecoveryInProgress())
+ if (options)
+ {
+ switch (options->read_upto_policy)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ read_upto = GetWalRcvWriteRecPtr();
+ break;
+ case XLRO_LSN:
+ read_upto = options->lsn;
+ break;
+ default:
+ read_upto = 0;
+ elog(ERROR, "unknown read_upto_policy value");
+ break;
+ }
+ }
+ else if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
if (loc <= read_upto)
break;
+ if (options && options->nowait)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->slot = slot;
- ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+ ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
if (!ctx->reader)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8228e1f390..2e07e2394a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1240,6 +1240,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless wal_prefetch_distance is set to a positive number.")
+ },
+ &wal_prefetch_fpw,
+ false,
+ NULL, assign_wal_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2626,6 +2638,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("How many bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable WAL prefetching."),
+ GUC_UNIT_BYTE
+ },
+ &wal_prefetch_distance,
+ -1, -1, INT_MAX,
+ NULL, assign_wal_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -11484,6 +11507,8 @@ assign_effective_io_concurrency(int newval, void *extra)
{
#ifdef USE_PREFETCH
target_prefetch_pages = *((int *) extra);
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
#endif /* USE_PREFETCH */
}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..0a31edfba4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int wal_keep_segments;
extern int XLOGbuffers;
extern int XLogArchiveTimeout;
extern int wal_retrieve_retry_interval;
+extern int wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
extern char *XLogArchiveCommand;
extern bool EnableHotStandby;
extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
extern void XLogRequestWalReceiverReply(void);
+extern void ResetWalPrefetcher(void);
+
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..070ffc5c85
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ * Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /* How far to read. */
+ enum {
+ XLRO_WALRCV_WRITTEN,
+ XLRO_LSN
+ } read_upto_policy;
+
+ /* If read_upto_policy is XLRO_LSN, the LSN. */
+ XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5d7a796ba0..6e91c33f3d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
typedef enum PrefetchBufferResult
{
PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..903b0ec02b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
#endif /* GUC_H */
--
2.23.0
On Wed, Feb 12, 2020 at 7:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.
I started a separate thread[1]/messages/by-id/CA+hUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA@mail.gmail.com to discuss that GUC, because it's
basically an independent question. Meanwhile, here's a new version of
the WAL prefetch patch, with the following changes:
1. A monitoring view:
postgres=# select * from pg_stat_wal_prefetcher ;
prefetch | skip_hit | skip_new | skip_fpw | skip_seq | distance | queue_depth
----------+----------+----------+----------+----------+----------+-------------
95854 | 291458 | 435 | 0 | 26245 | 261800 | 10
(1 row)
That shows a bunch of counters for blocks prefetched and skipped for
various reasons. It also shows the current read-ahead distance (in
bytes of WAL) and queue depth (an approximation of how many I/Os might
be in flight, used for rate limiting; I'm struggling to come up with a
better short name for this). This can be used to see the effects of
experiments with different settings, eg:
alter system set effective_io_concurrency = 20;
alter system set wal_prefetch_distance = '256kB';
select pg_reload_conf();
2. A log message when WAL prefetching begins and ends, so you can see
what it did during crash recovery:
LOG: WAL prefetch finished at 0/C5E98758; prefetch = 1112628,
skip_hit = 3607540,
skip_new = 45592, skip_fpw = 0, skip_seq = 177049, avg_distance =
247907.942532,
avg_queue_depth = 22.261352
3. A bit of general user documentation.
[1]: /messages/by-id/CA+hUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA@mail.gmail.com
Attachments:
0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRelatio.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRelatio.patchDownload
From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
---
src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++-------------
src/include/storage/bufmgr.h | 3 ++
2 files changed, 47 insertions(+), 33 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..6e0875022c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,6 +519,48 @@ ComputeIoConcurrency(int io_concurrency, double *target)
return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
}
+void
+SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
+ LWLock *newPartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ Assert(BlockNumberIsValid(blockNum));
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+ forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ newHash = BufTableHashCode(&newTag);
+ newPartitionLock = BufMappingPartitionLock(newHash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ /* If not in buffers, initiate prefetch */
+ if (buf_id < 0)
+ smgrprefetch(smgr_reln, forkNum, blockNum);
+
+ /*
+ * If the block *is* in buffers, we do nothing. This is not really ideal:
+ * the block might be just about to be evicted, which would be stupid
+ * since we know we are going to need it soon. But the only easy answer
+ * is to bump the usage_count, which does not seem like a great solution:
+ * when the caller does ultimately touch the block, usage_count would get
+ * bumped again, resulting in too much favoritism for blocks that are
+ * involved in a prefetch sequence. A real fix would involve some
+ * additional per-buffer state, and it's not clear that there's enough of
+ * a problem to justify that.
+ */
+#endif
+}
+
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
@@ -550,39 +592,8 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
}
else
{
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
- LWLock *newPartitionLock; /* buffer partition lock for it */
- int buf_id;
-
- /* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
- forkNum, blockNum);
-
- /* determine its hash code and partition lock ID */
- newHash = BufTableHashCode(&newTag);
- newPartitionLock = BufMappingPartitionLock(newHash);
-
- /* see if the block is in the buffer pool already */
- LWLockAcquire(newPartitionLock, LW_SHARED);
- buf_id = BufTableLookup(&newTag, newHash);
- LWLockRelease(newPartitionLock);
-
- /* If not in buffers, initiate prefetch */
- if (buf_id < 0)
- smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
- /*
- * If the block *is* in buffers, we do nothing. This is not really
- * ideal: the block might be just about to be evicted, which would be
- * stupid since we know we are going to need it soon. But the only
- * easy answer is to bump the usage_count, which does not seem like a
- * great solution: when the caller does ultimately touch the block,
- * usage_count would get bumped again, resulting in too much
- * favoritism for blocks that are involved in a prefetch sequence. A
- * real fix would involve some additional per-buffer state, and it's
- * not clear that there's enough of a problem to justify that.
- */
+ /* pass it to the shared buffer version */
+ SharedPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..89a47afec1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -18,6 +18,7 @@
#include "storage/buf.h"
#include "storage/bufpage.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
#include "utils/snapmgr.h"
@@ -162,6 +163,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
* prototypes for functions in bufmgr.c
*/
extern bool ComputeIoConcurrency(int io_concurrency, double *target);
+extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.20.1
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecPtr.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecPtr.patchDownload
From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.
An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.
---
src/backend/access/transam/xlog.c | 4 ++--
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/walreceiverfuncs.c | 4 ++--
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..cc7072ba13 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9283,7 +9283,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -12104,7 +12104,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
+ receivedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
{
havedata = true;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..9bce63b534 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -294,7 +294,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index abb533b9d0..1079b3f8cb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2903,7 +2903,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..147b374a26 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0003-Add-WalRcvGetWriteRecPtr-new-definition.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition.patchDownload
From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.
The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
---
src/backend/replication/walreceiver.c | 5 +++++
src/backend/replication/walreceiverfuncs.c | 10 ++++++++++
src/include/replication/walreceiver.h | 9 +++++++++
3 files changed, 24 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..88a51ba35f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -244,6 +244,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 9bce63b534..14e9a6245a 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,16 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ return pg_atomic_read_u64(&WalRcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 147b374a26..1e8f304dc4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -83,6 +84,13 @@ typedef struct
XLogRecPtr receivedUpto;
TimeLineID receivedTLI;
+ /*
+ * Same as above, but advanced after writing and before flushing, without
+ * the need to acquire the spin lock. Data can be read by another process
+ * up to this point, but shouldn't be used for data integrity purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
@@ -323,6 +331,7 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0004-Allow-PrefetchBuffer-to-report-the-outcome.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-the-outcome.patchDownload
From f9a53985e0e30659caa41c95c85001c91b3deb5f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 30 Dec 2019 16:43:50 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report the outcome.
Report when a relation's backing file is missing, to prepare
for use during recovery. This will be used to handle cases of
relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been
replayed yet, after a crash.
Also report whether a prefetch was actually initiated, so that
callers can limit the number of concurrent I/Os they try to
issue, without counting the prefetch calls that did nothing
because the page was already in our buffers.
Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 9 +++++++--
src/backend/storage/smgr/md.c | 9 +++++++--
src/backend/storage/smgr/smgr.c | 10 +++++++---
src/include/storage/bufmgr.h | 12 ++++++++++--
src/include/storage/md.h | 2 +-
src/include/storage/smgr.h | 2 +-
6 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6e0875022c..5dbbcf8111 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -519,7 +519,7 @@ ComputeIoConcurrency(int io_concurrency, double *target)
return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
}
-void
+PrefetchBufferResult
SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -545,7 +545,11 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
/* If not in buffers, initiate prefetch */
if (buf_id < 0)
- smgrprefetch(smgr_reln, forkNum, blockNum);
+ {
+ if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+ return PREFETCH_BUFFER_NOREL;
+ return PREFETCH_BUFFER_MISS;
+ }
/*
* If the block *is* in buffers, we do nothing. This is not really ideal:
@@ -559,6 +563,7 @@ SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum, BlockNumber blo
* a problem to justify that.
*/
#endif
+ return PREFETCH_BUFFER_HIT;
}
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
/*
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
*/
-void
+bool
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
#ifdef USE_PREFETCH
off_t seekpos;
MdfdVec *v;
- v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+ if (v == NULL)
+ return false;
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
#endif /* USE_PREFETCH */
+
+ return true;
}
/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
- void (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+ bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_read) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
/*
* smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ * In recovery only, this can return false to indicate that a file
+ * doesn't exist (presumably it has been dropped by a later WAL
+ * record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
- smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+ return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
}
/*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89a47afec1..5d7a796ba0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,12 +159,20 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+typedef enum PrefetchBufferResult
+{
+ PREFETCH_BUFFER_HIT,
+ PREFETCH_BUFFER_MISS,
+ PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
/*
* prototypes for functions in bufmgr.c
*/
extern bool ComputeIoConcurrency(int io_concurrency, double *target);
-extern void SharedPrefetchBuffer(SMgrRelation smgr_reln, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
extern void mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void smgrread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
--
2.20.1
0005-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery.patchDownload
From 6dc2cfa4b64ac25513c36538272e08b937bd46a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 2 Mar 2020 15:33:51 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
Introduce a new GUC wal_prefetch_distance. If it is set to a positive
number of bytes, then read ahead in the WAL at most that distance, and
initiate asynchronous reading of referenced blocks. The goal is to
avoid I/O stalls and benefit from concurrent I/O.
The number of concurrent asynchronous reads is limited by both
effective_io_concurrency and wal_prefetch_distance. The feature is
disabled by default.
Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 38 ++
doc/src/sgml/monitoring.sgml | 69 +++
doc/src/sgml/wal.sgml | 12 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 64 ++
src/backend/access/transam/xlogprefetcher.c | 653 ++++++++++++++++++++
src/backend/access/transam/xlogutils.c | 23 +-
src/backend/catalog/system_views.sql | 11 +
src/backend/replication/logical/logical.c | 2 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 25 +
src/include/access/xlog.h | 4 +
src/include/access/xlogprefetcher.h | 28 +
src/include/access/xlogutils.h | 20 +
src/include/catalog/pg_proc.dat | 8 +
src/include/storage/bufmgr.h | 5 +
src/include/utils/guc.h | 2 +
src/test/regress/expected/rules.out | 8 +
18 files changed, 974 insertions(+), 2 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetcher.c
create mode 100644 src/include/access/xlogprefetcher.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..415b0793e1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3082,6 +3082,44 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-prefetch-distance" xreflabel="wal_prefetch_distance">
+ <term><varname>wal_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as <xref linkend="guc-effective-io-concurrency"/>.
+ If this value is specified without units, it is taken as bytes.
+ The default is -1, meaning that WAL prefetching is disabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+ <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks with full page images during recovery.
+ Usually this doesn't help, since such blocks will not be read. However,
+ on file systems with a block size larger than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+ read-before-write when a blocks are later written.
+ This setting has no effect unless
+ <xref linkend="guc-wal-prefetch-distance"/> is set to a positive number.
+ The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87586a7b06..013537d2be 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2184,6 +2191,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+ <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-wal-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-wal-prefetch-distance"/>,
+ <xref linkend="guc-wal-prefetch-fpw"/> or
+ <xref linkend="guc-effective-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 4eb8feb903..943462ca05 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-wal-prefetch-distance"/> parameter can be
+ used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-effective-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, WAL prefetching is disabled.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetcher.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc7072ba13..d042ebeaf5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -104,6 +105,8 @@ int wal_level = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
+int wal_prefetch_distance = -1;
+bool wal_prefetch_fpw = false;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -805,6 +808,7 @@ static XLogSource readSource = 0; /* XLOG_FROM_* code */
*/
static XLogSource currentSource = 0; /* XLOG_FROM_* code */
static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
typedef struct XLogPageReadPrivate
{
@@ -6212,6 +6216,7 @@ CheckRequiredParameterValues(void)
}
}
+
/*
* This must be called ONCE during postmaster or standalone-backend startup
*/
@@ -7068,6 +7073,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetcher *prefetcher = NULL;
InRedo = true;
@@ -7075,6 +7081,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* the first time through, see if we need to enable prefetching */
+ ResetWalPrefetcher();
+
/*
* main redo apply loop
*/
@@ -7104,6 +7113,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /*
+ * The first time through, or if any relevant settings or the
+ * WAL source changes, we'll restart the prefetching machinery
+ * as appropriate. This is simpler than trying to handle
+ * various complicated state changes.
+ */
+ if (unlikely(reset_wal_prefetcher))
+ {
+ /* If we had one already, destroy it. */
+ if (prefetcher)
+ {
+ XLogPrefetcherFree(prefetcher);
+ prefetcher = NULL;
+ }
+ /* If we want one, create it. */
+ if (wal_prefetch_distance > 0)
+ prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+ reset_wal_prefetcher = false;
+ }
+
+ /* Peform WAL prefetching, if enabled. */
+ if (prefetcher)
+ XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7291,6 +7325,8 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ if (prefetcher)
+ XLogPrefetcherFree(prefetcher);
if (reachedRecoveryTarget)
{
@@ -10150,6 +10186,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
}
}
+void
+assign_wal_prefetch_distance(int new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
/*
* Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11933,6 +11987,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* and move on to the next state.
*/
currentSource = XLOG_FROM_STREAM;
+ ResetWalPrefetcher();
break;
case XLOG_FROM_STREAM:
@@ -12356,3 +12411,12 @@ XLogRequestWalReceiverReply(void)
{
doRequestWalReceiverReply = true;
}
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+ reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..5b32522bb5
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,653 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ * Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/shmem.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ XLogRecPtr *prefetch_queue;
+ int prefetch_queue_size;
+ int prefetch_head;
+ int prefetch_tail;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Counters used to compute avg_queue_depth and avg_distance. */
+ double samples;
+ double queue_depth_sum;
+ double distance_sum;
+ XLogRecPtr next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Sequential/repeat blocks skipped. */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+ pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+ return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+ pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+ MonitoringStats->distance = -1;
+ MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+ bool found;
+
+ MonitoringStats = (XLogPrefetcherMonitoringStats *)
+ ShmemInitStruct("XLogPrefetcherMonitoringStats",
+ sizeof(XLogPrefetcherMonitoringStats),
+ &found);
+ if (!found)
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+ XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* We're allowed to read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_LSN;
+ prefetcher->options.lsn = (XLogRecPtr) -1;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ &prefetcher->options);
+ prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /*
+ * The size of the queue is determined by target_prefetch_pages, which is
+ * derived from effective_io_concurrency. In theory we might have a
+ * separate queue for each tablespace, but it's not clear how that should
+ * work, so for now we'll just use the system-wide GUC to rate-limit all
+ * prefetching.
+ */
+ prefetcher->prefetch_queue_size = target_prefetch_pages;
+ prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+ prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+ /* Prepare to read at the given LSN. */
+ elog(LOG, "WAL prefetch started at %X/%X",
+ (uint32) (lsn << 32), (uint32) lsn);
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ XLogPrefetcherResetMonitoringStats();
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ double avg_distance = 0;
+ double avg_queue_depth = 0;
+
+ /* Log final statistics. */
+ if (prefetcher->samples > 0)
+ {
+ avg_distance = prefetcher->distance_sum / prefetcher->samples;
+ avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+ }
+ elog(LOG,
+ "WAL prefetch finished at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&MonitoringStats->prefetch),
+ pg_atomic_read_u64(&MonitoringStats->skip_hit),
+ pg_atomic_read_u64(&MonitoringStats->skip_new),
+ pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+ pg_atomic_read_u64(&MonitoringStats->skip_seq),
+ avg_distance,
+ avg_queue_depth);
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher->prefetch_queue);
+ pfree(prefetcher);
+
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /* Can we drop any filters yet, due to problem records begin replayed? */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /* Main prefetch loop. */
+ for (;;)
+ {
+ XLogReaderState *reader = prefetcher->reader;
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ elog(LOG, "WAL prefetch error: %s", error);
+ prefetcher->shutdown = true;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ MonitoringStats->distance = distance;
+
+ /* Sample the averages so we can log them at end of recovery. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ prefetcher->distance_sum += MonitoringStats->distance;
+ prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+ prefetcher->samples += 1.0;
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= wal_prefetch_distance)
+ break;
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /*
+ * Scan the record for block references. We might already have been
+ * partway through processing this record when we hit maximum I/O
+ * concurrency, so start where we left off.
+ */
+ for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+ {
+ DecodedBkpBlock *block = &reader->blocks[i];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might thing we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !wal_prefetch_fpw)
+ {
+ inc_counter(&MonitoringStats->skip_fpw);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+ block->blkno))
+ {
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat or sequential access, then skip it. We
+ * expect the kernel to detect sequential access on its own
+ * and do a better job than we could.
+ */
+ if (block->blkno == prefetcher->last_blkno ||
+ block->blkno == prefetcher->last_blkno + 1)
+ {
+ prefetcher->last_blkno = block->blkno;
+ inc_counter(&MonitoringStats->skip_seq);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ switch (SharedPrefetchBuffer(reln, block->forknum, block->blkno))
+ {
+ case PREFETCH_BUFFER_HIT:
+ /* It's already cached, so do nothing. */
+ inc_counter(&MonitoringStats->skip_hit);
+ break;
+ case PREFETCH_BUFFER_MISS:
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ inc_counter(&MonitoringStats->prefetch);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before
+ * processing any more blocks from this record.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = i + 1;
+ return;
+ }
+ break;
+ case PREFETCH_BUFFER_NOREL:
+ /*
+ * The underlying segment file doesn't exist. Presumably it
+ * will be unlinked by a later WAL record. When recovery
+ * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+ * flag. We certainly don't want to do that sort of thing
+ * while merely prefetching, so let's just ignore references
+ * to this relation until this record is replayed, and let
+ * recovery create the dummy file or complain if something is
+ * wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ break;
+ }
+ }
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+ bool nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (MonitoringStats->distance < 0)
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = false;
+ values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+ values[5] = Int32GetDatum(MonitoringStats->distance);
+ values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+ }
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth++;
+ Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth--;
+ Assert(MonitoringStats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
loc = targetPagePtr + reqLen;
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
- if (!RecoveryInProgress())
+ if (options)
+ {
+ switch (options->read_upto_policy)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ read_upto = GetWalRcvWriteRecPtr();
+ break;
+ case XLRO_LSN:
+ read_upto = options->lsn;
+ break;
+ default:
+ read_upto = 0;
+ elog(ERROR, "unknown read_upto_policy value");
+ break;
+ }
+ }
+ else if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
if (loc <= read_upto)
break;
+ if (options && options->nowait)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f681aafcf9..d0882e5f82 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_wal_prefetcher AS
+ SELECT
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->slot = slot;
- ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+ ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
if (!ctx->reader)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetcher.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetcherShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetcherShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 464f264d9a..893c9478d9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1240,6 +1240,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless wal_prefetch_distance is set to a positive number.")
+ },
+ &wal_prefetch_fpw,
+ false,
+ NULL, assign_wal_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2626,6 +2638,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("How many bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable WAL prefetching."),
+ GUC_UNIT_BYTE
+ },
+ &wal_prefetch_distance,
+ -1, -1, INT_MAX,
+ NULL, assign_wal_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -11484,6 +11507,8 @@ assign_effective_io_concurrency(int newval, void *extra)
{
#ifdef USE_PREFETCH
target_prefetch_pages = *((int *) extra);
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
#endif /* USE_PREFETCH */
}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..0a31edfba4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int wal_keep_segments;
extern int XLOGbuffers;
extern int XLogArchiveTimeout;
extern int wal_retrieve_retry_interval;
+extern int wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
extern char *XLogArchiveCommand;
extern bool EnableHotStandby;
extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
extern void XLogRequestWalReceiverReply(void);
+extern void ResetWalPrefetcher(void);
+
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ * Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /* How far to read. */
+ enum {
+ XLRO_WALRCV_WRITTEN,
+ XLRO_LSN
+ } read_upto_policy;
+
+ /* If read_upto_policy is XLRO_LSN, the LSN. */
+ XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07a86c7b7b..0bd16c1b77 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+ proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+ prosrc => 'pg_stat_get_wal_prefetcher' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5d7a796ba0..6e91c33f3d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
typedef enum PrefetchBufferResult
{
PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..903b0ec02b 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 634f8256f7..62b1e0e113 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2087,6 +2087,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.autoanalyze_count
FROM pg_stat_all_tables
WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
pg_stat_wal_receiver| SELECT s.pid,
s.status,
s.receive_start_lsn,
--
2.20.1
I tried my luck at a quick read of this patchset.
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.
First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.
From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
LGTM.
It's a pity to have to include smgr.h in bufmgr.h. Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.
From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.
Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.
From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
Umm, how come you're using WalRcv here instead of walrcv? I would flag
this patch for sneaky nastiness if this weren't mostly harmless. (I
think we should do away with local walrcv pointers altogether. But that
should be a separate patch, I think.)
+ pg_atomic_uint64 writtenUpto;
Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.
I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?
/* * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation. + * + * In recovery only, this can return false to indicate that a file + * doesn't exist (presumably it has been dropped by a later WAL + * record). */ -void +bool smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln, + ForkNumber forkNum, + BlockNumber blockNum); extern void PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum);
Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Alvaro,
On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
I tried my luck at a quick read of this patchset.
Thanks! Here's a new patch set, and some inline responses to your feedback:
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.
The primary control is now maintenance_io_concurrency, which is
basically what Tomas suggested.
The byte-based control is just a cap to prevent it reading a crazy
distance ahead, that also functions as the on/off switch for the
feature. In this version I've added "max" to the name, to make that
clearer.
First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.
We should be careful about over-promising here: Sean basically had a
best case scenario for this type of techology, partly due to his 16kB
filesystem blocks. Common results may be a lot more pedestrian,
though it could get more interesting if we figure out how to get rid
of FPWs...
From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.LGTM.
It's a pity to have to include smgr.h in bufmgr.h. Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.
OK, done.
While staring at this, I decided that SharedPrefetchBuffer() was a
weird word order, so I changed it to PrefetchSharedBuffer(). Then, by
analogy, I figured I should also change the pre-existing function
LocalPrefetchBuffer() to PrefetchLocalBuffer(). Do you think this is
an improvement?
From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.
Ok, I renamed that variable and a related one. There are more things
you could rename if you pull on that thread some more, including
pg_stat_wal_receiver's received_lsn column, but I didn't do that in
this patch.
From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that,
it needs to be able to see the write pointer advancing in shared
memory.The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
Umm, how come you're using WalRcv here instead of walrcv? I would flag
this patch for sneaky nastiness if this weren't mostly harmless. (I
think we should do away with local walrcv pointers altogether. But that
should be a separate patch, I think.)
OK, done.
+ pg_atomic_uint64 writtenUpto;
Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.
Moved.
We use [u]int64 in various places in the replication code. Ideally
I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
assume that pg_atomic_uint64 is the right atomic integer width and
signedness, but here we are. In dsa.h I made a special typedef for
the atomic version of something else, but that's because the size of
that thing varied depending on the build, whereas our LSNs are of a
fixed width that ought to be en... <trails off>.
I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?
I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.
/* * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation. + * + * In recovery only, this can return false to indicate that a file + * doesn't exist (presumably it has been dropped by a later WAL + * record). */ -void +bool smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.
Hmm. But... md.c has other code like that. It's true that I'm adding
InRecovery awareness to a function that didn't previously have it, but
that's just because we previously had no reason to prefetch stuff in
recovery.
+extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln, + ForkNumber forkNum, + BlockNumber blockNum); extern void PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum);Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?
Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.
Attachments:
0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v4.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v4.patchDownload
From 71641bcfed33c0a89f27b5246734eb4b8196485c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery. A new function
PrefetchSharedBuffer() is provided that works with SMgrRelation, and
LocalPrefetchBuffer() is renamed to PrefetchLocalBuffer() to fit with
that more natural naming scheme.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 84 ++++++++++++++++-----------
src/backend/storage/buffer/localbuf.c | 4 +-
src/include/storage/buf_internals.h | 2 +-
src/include/storage/bufmgr.h | 6 ++
4 files changed, 59 insertions(+), 37 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..d30aed6fd9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -466,6 +466,53 @@ static int ckpt_buforder_comparator(const void *pa, const void *pb);
static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
+ LWLock *newPartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ Assert(BlockNumberIsValid(blockNum));
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+ forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ newHash = BufTableHashCode(&newTag);
+ newPartitionLock = BufMappingPartitionLock(newHash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ /* If not in buffers, initiate prefetch */
+ if (buf_id < 0)
+ smgrprefetch(smgr_reln, forkNum, blockNum);
+
+ /*
+ * If the block *is* in buffers, we do nothing. This is not really ideal:
+ * the block might be just about to be evicted, which would be stupid
+ * since we know we are going to need it soon. But the only easy answer
+ * is to bump the usage_count, which does not seem like a great solution:
+ * when the caller does ultimately touch the block, usage_count would get
+ * bumped again, resulting in too much favoritism for blocks that are
+ * involved in a prefetch sequence. A real fix would involve some
+ * additional per-buffer state, and it's not clear that there's enough of
+ * a problem to justify that.
+ */
+#endif /* USE_PREFETCH */
+}
+
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
@@ -493,43 +540,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+ PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
- LWLock *newPartitionLock; /* buffer partition lock for it */
- int buf_id;
-
- /* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
- forkNum, blockNum);
-
- /* determine its hash code and partition lock ID */
- newHash = BufTableHashCode(&newTag);
- newPartitionLock = BufMappingPartitionLock(newHash);
-
- /* see if the block is in the buffer pool already */
- LWLockAcquire(newPartitionLock, LW_SHARED);
- buf_id = BufTableLookup(&newTag, newHash);
- LWLockRelease(newPartitionLock);
-
- /* If not in buffers, initiate prefetch */
- if (buf_id < 0)
- smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
- /*
- * If the block *is* in buffers, we do nothing. This is not really
- * ideal: the block might be just about to be evicted, which would be
- * stupid since we know we are going to need it soon. But the only
- * easy answer is to bump the usage_count, which does not seem like a
- * great solution: when the caller does ultimately touch the block,
- * usage_count would get bumped again, resulting in too much
- * favoritism for blocks that are involved in a prefetch sequence. A
- * real fix would involve some additional per-buffer state, and it's
- * not clear that there's enough of a problem to justify that.
- */
+ /* pass it to the shared buffer version */
+ PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
/*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
* initiate asynchronous read of a block of a relation
*
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
#ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d2a5b52f6e..e00dd3ffb7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
+/* forward declared, to avoid including smgr.h */
+struct SMgrRelationData;
+
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
@@ -159,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.20.1
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v4.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v4.patchDownload
From a3a22ea59e9a9ac1d03dd3f22708e32a796785af Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk. Also rename
a couple of variables relating to this value.
An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlog.c | 20 ++++++++++----------
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/README | 2 +-
src/backend/replication/walreceiver.c | 10 +++++-----
src/backend/replication/walreceiverfuncs.c | 12 ++++++------
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 8 ++++----
7 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fa446ffa4..fd30e27425 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -205,8 +205,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
static XLogRecPtr LastRec;
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
static TimeLineID receiveTLI = 0;
/*
@@ -9288,7 +9288,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -11682,7 +11682,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto < targetPagePtr + reqLen))
+ flushedUpto < targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -11713,10 +11713,10 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
- readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+ readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
targetPageOff;
}
else
@@ -11952,7 +11952,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
curFileTLI = tli;
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName);
- receivedUpto = 0;
+ flushedUpto = 0;
}
/*
@@ -12132,14 +12132,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* XLogReceiptTime will not advance, so the grace time
* allotted to conflicting queries will decrease.
*/
- if (RecPtr < receivedUpto)
+ if (RecPtr < flushedUpto)
havedata = true;
else
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
- if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+ flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+ if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
{
havedata = true;
if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
WalRcvData->receiveStart.
As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
the startup process to know how far WAL replay can advance.
Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..0bdd0c3074 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
* in the primary server), and then keeps receiving XLOG records and
* writing them to the disk as long as the connection is alive. As XLOG
* records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
* process of how far it can proceed with XLOG replay.
*
* If the primary server ends streaming, but doesn't disconnect, walreceiver
@@ -1006,10 +1006,10 @@ XLogWalRcvFlush(bool dying)
/* Update shared-memory status */
SpinLockAcquire(&walrcv->mutex);
- if (walrcv->receivedUpto < LogstreamResult.Flush)
+ if (walrcv->flushedUpto < LogstreamResult.Flush)
{
- walrcv->latestChunkStart = walrcv->receivedUpto;
- walrcv->receivedUpto = LogstreamResult.Flush;
+ walrcv->latestChunkStart = walrcv->flushedUpto;
+ walrcv->flushedUpto = LogstreamResult.Flush;
walrcv->receivedTLI = ThisTimeLineID;
}
SpinLockRelease(&walrcv->mutex);
@@ -1362,7 +1362,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
state = WalRcv->walRcvState;
receive_start_lsn = WalRcv->receiveStart;
receive_start_tli = WalRcv->receiveStartTLI;
- received_lsn = WalRcv->receivedUpto;
+ received_lsn = WalRcv->flushedUpto;
received_tli = WalRcv->receivedTLI;
last_send_time = WalRcv->lastMsgSendTime;
last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..31025f97e3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -264,11 +264,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
/*
* If this is the first startup of walreceiver (on this timeline),
- * initialize receivedUpto and latestChunkStart to the starting point.
+ * initialize flushedUpto and latestChunkStart to the starting point.
*/
if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
{
- walrcv->receivedUpto = recptr;
+ walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
walrcv->latestChunkStart = recptr;
}
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -294,13 +294,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
SpinLockAcquire(&walrcv->mutex);
- recptr = walrcv->receivedUpto;
+ recptr = walrcv->flushedUpto;
if (latestChunkStart)
*latestChunkStart = walrcv->latestChunkStart;
if (receiveTLI)
@@ -327,7 +327,7 @@ GetReplicationApplyDelay(void)
TimestampTz chunkReplayStartTime;
SpinLockAcquire(&walrcv->mutex);
- receivePtr = walrcv->receivedUpto;
+ receivePtr = walrcv->flushedUpto;
SpinLockRelease(&walrcv->mutex);
replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f74bc8493..658e5280fd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2914,7 +2914,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..9ed71139ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,19 +74,19 @@ typedef struct
TimeLineID receiveStartTLI;
/*
- * receivedUpto-1 is the last byte position that has already been
+ * flushedUpto-1 is the last byte position that has already been
* received, and receivedTLI is the timeline it came from. At the first
* startup of walreceiver, these are set to receiveStart and
* receiveStartTLI. After that, walreceiver updates these whenever it
* flushes the received WAL to disk.
*/
- XLogRecPtr receivedUpto;
+ XLogRecPtr flushedUpto;
TimeLineID receivedTLI;
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
- * receivedUpto before the last flush to disk. Startup process can use
+ * flushedUpto before the last flush to disk. Startup process can use
* this to detect whether it's keeping up or not.
*/
XLogRecPtr latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0003-Add-WalRcvGetWriteRecPtr-new-definition-v4.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v4.patchDownload
From 6ef218f60cab62ecbd5ad120cf535cb4e5045f45 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.
The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/replication/walreceiver.c | 5 +++++
src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
src/include/replication/walreceiver.h | 10 ++++++++++
3 files changed, 27 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 0bdd0c3074..e250f5583c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -245,6 +245,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 31025f97e3..96b44e2c88 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ WalRcvData *walrcv = WalRcv;
+
+ return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9ed71139ce..914e6e3d44 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -142,6 +143,14 @@ typedef struct
slock_t mutex; /* locks shared variables shown above */
+ /*
+ * Like flushedUpto, but advanced after writing and before flushing,
+ * without the need to acquire the spin lock. Data can be read by another
+ * process up to this point, but shouldn't be used for data integrity
+ * purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* force walreceiver reply? This doesn't need to be locked; memory
* barriers for ordering are sufficient. But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0004-Allow-PrefetchBuffer-to-report-what-happened-v4.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-what-happened-v4.patchDownload
From d60e6f15180a40b117b3fc9b330967e52a5b6485 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
Report whether a prefetch was actually initiated, so that callers can
limit the number of concurrent I/Os they try to issue, without counting
the prefetch calls that did nothing because the page was already in our
buffers.
Also report when a relation's backing file is missing, to prepare for
use during recovery. This will be used to handle cases of relations
that are referenced in the WAL but have been unlinked already due to
actions covered by WAL records that haven't been replayed yet, after a
crash.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 15 ++++++++++-----
src/backend/storage/buffer/localbuf.c | 7 +++++--
src/backend/storage/smgr/md.c | 9 +++++++--
src/backend/storage/smgr/smgr.c | 10 +++++++---
src/include/storage/buf_internals.h | 5 +++--
src/include/storage/bufmgr.h | 17 ++++++++++++-----
src/include/storage/md.h | 2 +-
src/include/storage/smgr.h | 2 +-
8 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..b13e05cce8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,7 +469,7 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
/*
* Implementation of PrefetchBuffer() for shared buffers.
*/
-void
+PrefetchBufferResult
PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum)
@@ -497,7 +497,11 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
/* If not in buffers, initiate prefetch */
if (buf_id < 0)
- smgrprefetch(smgr_reln, forkNum, blockNum);
+ {
+ if (!smgrprefetch(smgr_reln, forkNum, blockNum))
+ return PREFETCH_BUFFER_NOREL;
+ return PREFETCH_BUFFER_MISS;
+ }
/*
* If the block *is* in buffers, we do nothing. This is not really ideal:
@@ -511,6 +515,7 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* a problem to justify that.
*/
#endif /* USE_PREFETCH */
+ return PREFETCH_BUFFER_HIT;
}
/*
@@ -521,7 +526,7 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* block will not be delayed by the I/O. Prefetching is optional.
* No-op if prefetching isn't compiled in.
*/
-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,12 +545,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
/* pass it to the shared buffer version */
- PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..c728986e12 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,7 +60,7 @@ static Block GetLocalBufferStorage(void);
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
-void
+PrefetchBufferResult
PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
@@ -81,11 +81,14 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
if (hresult)
{
/* Yes, so nothing to do */
- return;
+ return PREFETCH_BUFFER_HIT;
}
/* Not in buffers, so initiate prefetch */
smgrprefetch(smgr, forkNum, blockNum);
+ return PREFETCH_BUFFER_MISS;
+#else
+ return PREFETCH_BUFFER_HIT;
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
/*
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
*/
-void
+bool
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
#ifdef USE_PREFETCH
off_t seekpos;
MdfdVec *v;
- v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+ if (v == NULL)
+ return false;
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
#endif /* USE_PREFETCH */
+
+ return true;
}
/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
- void (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+ bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_read) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
/*
* smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ * In recovery only, this can return false to indicate that a file
+ * doesn't exist (presumably it has been dropped by a later WAL
+ * record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
- smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+ return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
}
/*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e00dd3ffb7..1210d1e7e8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,14 +159,21 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+typedef enum PrefetchBufferResult
+{
+ PREFETCH_BUFFER_HIT,
+ PREFETCH_BUFFER_MISS,
+ PREFETCH_BUFFER_NOREL
+} PrefetchBufferResult;
+
/*
* prototypes for functions in bufmgr.c
*/
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
- ForkNumber forkNum,
- BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+ BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
extern void mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void smgrread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
--
2.20.1
0005-Prefetch-referenced-blocks-during-recovery-v4.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery-v4.patchDownload
From 477ac4a1f280faf189da52e635cea15367a262a8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 2 Mar 2020 15:33:51 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 38 ++
doc/src/sgml/monitoring.sgml | 69 +++
doc/src/sgml/wal.sgml | 12 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 64 ++
src/backend/access/transam/xlogprefetcher.c | 654 ++++++++++++++++++++
src/backend/access/transam/xlogutils.c | 23 +-
src/backend/catalog/system_views.sql | 11 +
src/backend/replication/logical/logical.c | 2 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 38 +-
src/include/access/xlog.h | 4 +
src/include/access/xlogprefetcher.h | 28 +
src/include/access/xlogutils.h | 20 +
src/include/catalog/pg_proc.dat | 8 +
src/include/storage/bufmgr.h | 5 +
src/include/utils/guc.h | 2 +
src/test/regress/expected/rules.out | 8 +
18 files changed, 987 insertions(+), 3 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetcher.c
create mode 100644 src/include/access/xlogprefetcher.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 672bf6f1ee..8249ec0139 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3102,6 +3102,44 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+ <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+ If this value is specified without units, it is taken as bytes.
+ The default is -1, meaning that WAL prefetching is disabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+ <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks with full page images during recovery.
+ Usually this doesn't help, since such blocks will not be read. However,
+ on file systems with a block size larger than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+ read-before-write when a blocks are later written.
+ This setting has no effect unless
+ <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+ The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2192,6 +2199,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+ <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-wal-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-wal-prefetch-distance"/>,
+ <xref linkend="guc-wal-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..9e956ad2a1 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-wal-prefetch-distance"/> parameter can be
+ used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, WAL prefetching is disabled.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetcher.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd30e27425..f01a24f577 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -105,6 +106,8 @@ int wal_level = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
+int max_wal_prefetch_distance = -1;
+bool wal_prefetch_fpw = false;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -806,6 +809,7 @@ static XLogSource readSource = XLOG_FROM_ANY;
*/
static XLogSource currentSource = XLOG_FROM_ANY;
static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
typedef struct XLogPageReadPrivate
{
@@ -6213,6 +6217,7 @@ CheckRequiredParameterValues(void)
}
}
+
/*
* This must be called ONCE during postmaster or standalone-backend startup
*/
@@ -7069,6 +7074,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetcher *prefetcher = NULL;
InRedo = true;
@@ -7076,6 +7082,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* the first time through, see if we need to enable prefetching */
+ ResetWalPrefetcher();
+
/*
* main redo apply loop
*/
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /*
+ * The first time through, or if any relevant settings or the
+ * WAL source changes, we'll restart the prefetching machinery
+ * as appropriate. This is simpler than trying to handle
+ * various complicated state changes.
+ */
+ if (unlikely(reset_wal_prefetcher))
+ {
+ /* If we had one already, destroy it. */
+ if (prefetcher)
+ {
+ XLogPrefetcherFree(prefetcher);
+ prefetcher = NULL;
+ }
+ /* If we want one, create it. */
+ if (max_wal_prefetch_distance > 0)
+ prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+ reset_wal_prefetcher = false;
+ }
+
+ /* Peform WAL prefetching, if enabled. */
+ if (prefetcher)
+ XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7292,6 +7326,8 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ if (prefetcher)
+ XLogPrefetcherFree(prefetcher);
if (reachedRecoveryTarget)
{
@@ -10155,6 +10191,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
}
}
+void
+assign_max_wal_prefetch_distance(int new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ max_wal_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
/*
* Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11961,6 +12015,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* and move on to the next state.
*/
currentSource = XLOG_FROM_STREAM;
+ ResetWalPrefetcher();
break;
case XLOG_FROM_STREAM:
@@ -12390,3 +12445,12 @@ XLogRequestWalReceiverReply(void)
{
doRequestWalReceiverReply = true;
}
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+ reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..1d0bce692a
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,654 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ * Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ XLogRecPtr *prefetch_queue;
+ int prefetch_queue_size;
+ int prefetch_head;
+ int prefetch_tail;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Counters used to compute avg_queue_depth and avg_distance. */
+ double samples;
+ double queue_depth_sum;
+ double distance_sum;
+ XLogRecPtr next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Sequential/repeat blocks skipped. */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically.
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+ pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+ return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+ pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+ MonitoringStats->distance = -1;
+ MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+ bool found;
+
+ MonitoringStats = (XLogPrefetcherMonitoringStats *)
+ ShmemInitStruct("XLogPrefetcherMonitoringStats",
+ sizeof(XLogPrefetcherMonitoringStats),
+ &found);
+ if (!found)
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+ XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* We're allowed to read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_LSN;
+ prefetcher->options.lsn = (XLogRecPtr) -1;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ &prefetcher->options);
+ prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching.
+ */
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+ prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+ prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+ /* Prepare to read at the given LSN. */
+ elog(LOG, "WAL prefetch started at %X/%X",
+ (uint32) (lsn << 32), (uint32) lsn);
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ XLogPrefetcherResetMonitoringStats();
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ double avg_distance = 0;
+ double avg_queue_depth = 0;
+
+ /* Log final statistics. */
+ if (prefetcher->samples > 0)
+ {
+ avg_distance = prefetcher->distance_sum / prefetcher->samples;
+ avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+ }
+ elog(LOG,
+ "WAL prefetch finished at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&MonitoringStats->prefetch),
+ pg_atomic_read_u64(&MonitoringStats->skip_hit),
+ pg_atomic_read_u64(&MonitoringStats->skip_new),
+ pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+ pg_atomic_read_u64(&MonitoringStats->skip_seq),
+ avg_distance,
+ avg_queue_depth);
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher->prefetch_queue);
+ pfree(prefetcher);
+
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /* Can we drop any filters yet, due to problem records begin replayed? */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /* Main prefetch loop. */
+ for (;;)
+ {
+ XLogReaderState *reader = prefetcher->reader;
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ elog(LOG, "WAL prefetch error: %s", error);
+ prefetcher->shutdown = true;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ MonitoringStats->distance = distance;
+
+ /* Sample the averages so we can log them at end of recovery. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ prefetcher->distance_sum += MonitoringStats->distance;
+ prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+ prefetcher->samples += 1.0;
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_wal_prefetch_distance)
+ break;
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /*
+ * Scan the record for block references. We might already have been
+ * partway through processing this record when we hit maximum I/O
+ * concurrency, so start where we left off.
+ */
+ for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+ {
+ DecodedBkpBlock *block = &reader->blocks[i];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might thing we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !wal_prefetch_fpw)
+ {
+ inc_counter(&MonitoringStats->skip_fpw);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+ block->blkno))
+ {
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat or sequential access, then skip it. We
+ * expect the kernel to detect sequential access on its own
+ * and do a better job than we could.
+ */
+ if (block->blkno == prefetcher->last_blkno ||
+ block->blkno == prefetcher->last_blkno + 1)
+ {
+ prefetcher->last_blkno = block->blkno;
+ inc_counter(&MonitoringStats->skip_seq);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ switch (PrefetchSharedBuffer(reln, block->forknum, block->blkno))
+ {
+ case PREFETCH_BUFFER_HIT:
+ /* It's already cached, so do nothing. */
+ inc_counter(&MonitoringStats->skip_hit);
+ break;
+ case PREFETCH_BUFFER_MISS:
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ inc_counter(&MonitoringStats->prefetch);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before
+ * processing any more blocks from this record.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = i + 1;
+ return;
+ }
+ break;
+ case PREFETCH_BUFFER_NOREL:
+ /*
+ * The underlying segment file doesn't exist. Presumably it
+ * will be unlinked by a later WAL record. When recovery
+ * reads this block, it will use the EXTENSION_CREATE_RECOVERY
+ * flag. We certainly don't want to do that sort of thing
+ * while merely prefetching, so let's just ignore references
+ * to this relation until this record is replayed, and let
+ * recovery create the dummy file or complain if something is
+ * wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ break;
+ }
+ }
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+ bool nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (MonitoringStats->distance < 0)
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = false;
+ values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+ values[5] = Int32GetDatum(MonitoringStats->distance);
+ values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+ }
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth++;
+ Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth--;
+ Assert(MonitoringStats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
loc = targetPagePtr + reqLen;
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
- if (!RecoveryInProgress())
+ if (options)
+ {
+ switch (options->read_upto_policy)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ read_upto = GetWalRcvWriteRecPtr();
+ break;
+ case XLRO_LSN:
+ read_upto = options->lsn;
+ break;
+ default:
+ read_upto = 0;
+ elog(ERROR, "unknown read_upto_policy value");
+ break;
+ }
+ }
+ else if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
if (loc <= read_upto)
break;
+ if (options && options->nowait)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..7b27ac4805 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_wal_prefetcher AS
+ SELECT
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e3da7d3625..34f3017871 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->slot = slot;
- ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+ ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
if (!ctx->reader)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetcher.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetcherShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetcherShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..a2a9f62160 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,6 +197,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1241,6 +1242,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_wal_prefetch_distance is set to a positive number.")
+ },
+ &wal_prefetch_fpw,
+ false,
+ NULL, assign_wal_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2627,6 +2640,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable WAL prefetching."),
+ GUC_UNIT_BYTE
+ },
+ &max_wal_prefetch_distance,
+ -1, -1, INT_MAX,
+ NULL, assign_max_wal_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2900,7 +2924,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11498,6 +11523,17 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..82829d7854 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int wal_keep_segments;
extern int XLOGbuffers;
extern int XLogArchiveTimeout;
extern int wal_retrieve_retry_interval;
+extern int max_wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
extern char *XLogArchiveCommand;
extern bool EnableHotStandby;
extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
extern void XLogRequestWalReceiverReply(void);
+extern void ResetWalPrefetcher(void);
+
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ * Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /* How far to read. */
+ enum {
+ XLRO_WALRCV_WRITTEN,
+ XLRO_LSN
+ } read_upto_policy;
+
+ /* If read_upto_policy is XLRO_LSN, the LSN. */
+ XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..742741afa1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+ proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+ prosrc => 'pg_stat_get_wal_prefetcher' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 1210d1e7e8..3ca171adb8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -159,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+/*
+ * When you try to prefetch a buffer, there are three possibilities: it's
+ * already cached in our buffer pool, it's not cached but we can ask the kernel
+ * we'll be loading it soon, or the relation file doesn't exist.
+ */
typedef enum PrefetchBufferResult
{
PREFETCH_BUFFER_HIT,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..7d076a9743 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_max_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7304611c3..63bbb796fc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2102,6 +2102,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.autoanalyze_count
FROM pg_stat_all_tables
WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
pg_stat_wal_receiver| SELECT s.pid,
s.status,
s.receive_start_lsn,
--
2.20.1
On 2020-Mar-17, Thomas Munro wrote:
Hi Thomas
On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.The primary control is now maintenance_io_concurrency, which is
basically what Tomas suggested.
The byte-based control is just a cap to prevent it reading a crazy
distance ahead, that also functions as the on/off switch for the
feature. In this version I've added "max" to the name, to make that
clearer.
Mumble. I guess I should wait to comment on this after reading 0005
more in depth.
First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed. "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release,
I think.We should be careful about over-promising here: Sean basically had a
best case scenario for this type of techology, partly due to his 16kB
filesystem blocks. Common results may be a lot more pedestrian,
though it could get more interesting if we figure out how to get rid
of FPWs...
Well, in my mind it's an established fact that our WAL replay uses far
too little of the available I/O speed. I guess if the system is
generating little WAL, then this change will show no benefit, but that's
not the kind of system that cares about this anyway -- for the others,
the parallelisation gains will be substantial, I'm sure.
From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 3 Dec 2019 17:13:40 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.Previously a Relation was required, but it's annoying to have
to create a "fake" one in recovery.
While staring at this, I decided that SharedPrefetchBuffer() was a
weird word order, so I changed it to PrefetchSharedBuffer(). Then, by
analogy, I figured I should also change the pre-existing function
LocalPrefetchBuffer() to PrefetchLocalBuffer(). Do you think this is
an improvement?
Looks good. I doubt you'll break anything by renaming that routine.
From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:10:17 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().The new name better reflects the fact that the value it returns
is updated only when received data has been flushed to disk.An upcoming patch will make use of the latest data that was
written without waiting for it to be flushed, so use more
precise function names.Ugh. (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.Ok, I renamed that variable and a related one. There are more things
you could rename if you pull on that thread some more, including
pg_stat_wal_receiver's received_lsn column, but I didn't do that in
this patch.
+1 for that approach. Maybe we'll want to rename the SQL-visible name,
but I wouldn't burden this patch with that, lest we lose the entire
series to that :-)
+ pg_atomic_uint64 writtenUpto;
Are we already using uint64s for XLogRecPtrs anywhere? This seems
novel. Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.Moved.
LGTM.
We use [u]int64 in various places in the replication code. Ideally
I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
assume that pg_atomic_uint64 is the right atomic integer width and
signedness, but here we are. In dsa.h I made a special typedef for
the atomic version of something else, but that's because the size of
that thing varied depending on the build, whereas our LSNs are of a
fixed width that ought to be en... <trails off>.
Let's rewrite Postgres in Rust ...
I didn't understand the purpose of inc_counter() as written. Why not
just pg_atomic_fetch_add_u64(..., 1)?I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.
Ah, I see, you don't want lock xadd ... That's non-obvious. I suppose
the function could use more commentary on *why* you're doing it that way
then.
/* * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation. + * + * In recovery only, this can return false to indicate that a file + * doesn't exist (presumably it has been dropped by a later WAL + * record). */ -void +bool smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy. I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.Hmm. But... md.c has other code like that. It's true that I'm adding
InRecovery awareness to a function that didn't previously have it, but
that's just because we previously had no reason to prefetch stuff in
recovery.
True. I'm uncomfortable about it anyway. I also noticed that
_mdfd_getseg() already has InRecovery-specific behavior flags.
Clearly that ship has sailed. Consider my objection^W comment withdrawn.
Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.
LGTM.
As before, I didn't get to reading 0005 in depth.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 18, 2020 at 2:47 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Mar-17, Thomas Munro wrote:
I didn't want counters that wrap at ~4 billion, but I did want to be
able to read and write concurrently without tearing. Instructions
like "lock xadd" would provide more guarantees that I don't need,
since only one thread is doing all the writing and there's no ordering
requirement. It's basically just counter++, but some platforms need a
spinlock to perform atomic read and write of 64 bit wide numbers, so
more hoop jumping is required.Ah, I see, you don't want lock xadd ... That's non-obvious. I suppose
the function could use more commentary on *why* you're doing it that way
then.
I updated the comment:
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically. The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?Agreed, and changed. I suspect that other users of the main
PrefetchBuffer() call will eventually want that, to do a better job of
keeping the request queue full, for example bitmap heap scan and
(hypothetical) btree scan with prefetch.LGTM.
Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.
The concept here is that eventually we'll have just one XLogReader for
both read ahead and recovery, and we could attach the prefetch results
to the decoded records, and then recovery would try to use already
looked up buffers to avoid a bit of work (and then recheck). In other
words, the WAL would be decoded only once, and the buffers would
hopefully be looked up only once, so you'd claw back all of the
overheads of this patch. For now that's not done, and the buffer in
the result is only compared with InvalidBuffer to check if there was a
hit or not.
Similar things could be done for bitmap heap scan and btree prefetch
with this interface: their prefetch machinery could hold onto these
results in their block arrays and try to avoid a more expensive
ReadBuffer() call if they already have a buffer (though as before,
there's a small chance it turns out to be the wrong one and they need
to fall back to ReadBuffer()).
As before, I didn't get to reading 0005 in depth.
Updated to account for the above-mentioned change, and with a couple
of elog() calls changed to ereport().
Attachments:
0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v5.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela-v5.patchDownload
From 94df05846b155dfc68997f17899ddb34637d868a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
Previously a Relation was required, but it's annoying to have to create
a "fake" one in recovery. A new function PrefetchSharedBuffer() is
provided that works with SMgrRelation, and LocalPrefetchBuffer() is
renamed to PrefetchLocalBuffer() to fit with that more natural naming
scheme.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 84 ++++++++++++++++-----------
src/backend/storage/buffer/localbuf.c | 4 +-
src/include/storage/buf_internals.h | 2 +-
src/include/storage/bufmgr.h | 6 ++
4 files changed, 59 insertions(+), 37 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..d30aed6fd9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -466,6 +466,53 @@ static int ckpt_buforder_comparator(const void *pa, const void *pb);
static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
+ LWLock *newPartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ Assert(BlockNumberIsValid(blockNum));
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+ forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ newHash = BufTableHashCode(&newTag);
+ newPartitionLock = BufMappingPartitionLock(newHash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ /* If not in buffers, initiate prefetch */
+ if (buf_id < 0)
+ smgrprefetch(smgr_reln, forkNum, blockNum);
+
+ /*
+ * If the block *is* in buffers, we do nothing. This is not really ideal:
+ * the block might be just about to be evicted, which would be stupid
+ * since we know we are going to need it soon. But the only easy answer
+ * is to bump the usage_count, which does not seem like a great solution:
+ * when the caller does ultimately touch the block, usage_count would get
+ * bumped again, resulting in too much favoritism for blocks that are
+ * involved in a prefetch sequence. A real fix would involve some
+ * additional per-buffer state, and it's not clear that there's enough of
+ * a problem to justify that.
+ */
+#endif /* USE_PREFETCH */
+}
+
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
@@ -493,43 +540,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+ PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
- LWLock *newPartitionLock; /* buffer partition lock for it */
- int buf_id;
-
- /* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
- forkNum, blockNum);
-
- /* determine its hash code and partition lock ID */
- newHash = BufTableHashCode(&newTag);
- newPartitionLock = BufMappingPartitionLock(newHash);
-
- /* see if the block is in the buffer pool already */
- LWLockAcquire(newPartitionLock, LW_SHARED);
- buf_id = BufTableLookup(&newTag, newHash);
- LWLockRelease(newPartitionLock);
-
- /* If not in buffers, initiate prefetch */
- if (buf_id < 0)
- smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
- /*
- * If the block *is* in buffers, we do nothing. This is not really
- * ideal: the block might be just about to be evicted, which would be
- * stupid since we know we are going to need it soon. But the only
- * easy answer is to bump the usage_count, which does not seem like a
- * great solution: when the caller does ultimately touch the block,
- * usage_count would get bumped again, resulting in too much
- * favoritism for blocks that are involved in a prefetch sequence. A
- * real fix would involve some additional per-buffer state, and it's
- * not clear that there's enough of a problem to justify that.
- */
+ /* pass it to the shared buffer version */
+ PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
/*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
* initiate asynchronous read of a block of a relation
*
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
#ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d2a5b52f6e..e00dd3ffb7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
+/* forward declared, to avoid including smgr.h */
+struct SMgrRelationData;
+
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
@@ -159,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.20.1
0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v5.patchtext/x-patch; charset=US-ASCII; name=0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP-v5.patchDownload
From 02a03ee9767fbb2ef6fc62bdf1e64c0fe24eccfa Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk. Also rename a
couple of variables relating to this value.
An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlog.c | 20 ++++++++++----------
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/README | 2 +-
src/backend/replication/walreceiver.c | 10 +++++-----
src/backend/replication/walreceiverfuncs.c | 12 ++++++------
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 8 ++++----
7 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index de2d4ee582..abb227ce66 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -205,8 +205,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
static XLogRecPtr LastRec;
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
static TimeLineID receiveTLI = 0;
/*
@@ -9288,7 +9288,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -11682,7 +11682,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto < targetPagePtr + reqLen))
+ flushedUpto < targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -11713,10 +11713,10 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
- readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+ readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
targetPageOff;
}
else
@@ -11952,7 +11952,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
curFileTLI = tli;
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName);
- receivedUpto = 0;
+ flushedUpto = 0;
}
/*
@@ -12132,14 +12132,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* XLogReceiptTime will not advance, so the grace time
* allotted to conflicting queries will decrease.
*/
- if (RecPtr < receivedUpto)
+ if (RecPtr < flushedUpto)
havedata = true;
else
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
- if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+ flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+ if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
{
havedata = true;
if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..e075c1c71b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
WalRcvData->receiveStart.
As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
the startup process to know how far WAL replay can advance.
Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..0bdd0c3074 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
* in the primary server), and then keeps receiving XLOG records and
* writing them to the disk as long as the connection is alive. As XLOG
* records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
* process of how far it can proceed with XLOG replay.
*
* If the primary server ends streaming, but doesn't disconnect, walreceiver
@@ -1006,10 +1006,10 @@ XLogWalRcvFlush(bool dying)
/* Update shared-memory status */
SpinLockAcquire(&walrcv->mutex);
- if (walrcv->receivedUpto < LogstreamResult.Flush)
+ if (walrcv->flushedUpto < LogstreamResult.Flush)
{
- walrcv->latestChunkStart = walrcv->receivedUpto;
- walrcv->receivedUpto = LogstreamResult.Flush;
+ walrcv->latestChunkStart = walrcv->flushedUpto;
+ walrcv->flushedUpto = LogstreamResult.Flush;
walrcv->receivedTLI = ThisTimeLineID;
}
SpinLockRelease(&walrcv->mutex);
@@ -1362,7 +1362,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
state = WalRcv->walRcvState;
receive_start_lsn = WalRcv->receiveStart;
receive_start_tli = WalRcv->receiveStartTLI;
- received_lsn = WalRcv->receivedUpto;
+ received_lsn = WalRcv->flushedUpto;
received_tli = WalRcv->receivedTLI;
last_send_time = WalRcv->lastMsgSendTime;
last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 89c903e45a..31025f97e3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -264,11 +264,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
/*
* If this is the first startup of walreceiver (on this timeline),
- * initialize receivedUpto and latestChunkStart to the starting point.
+ * initialize flushedUpto and latestChunkStart to the starting point.
*/
if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
{
- walrcv->receivedUpto = recptr;
+ walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
walrcv->latestChunkStart = recptr;
}
@@ -286,7 +286,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -294,13 +294,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
SpinLockAcquire(&walrcv->mutex);
- recptr = walrcv->receivedUpto;
+ recptr = walrcv->flushedUpto;
if (latestChunkStart)
*latestChunkStart = walrcv->latestChunkStart;
if (receiveTLI)
@@ -327,7 +327,7 @@ GetReplicationApplyDelay(void)
TimestampTz chunkReplayStartTime;
SpinLockAcquire(&walrcv->mutex);
- receivePtr = walrcv->receivedUpto;
+ receivePtr = walrcv->flushedUpto;
SpinLockRelease(&walrcv->mutex);
replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 76ec3c7dd0..928a27dbaf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2913,7 +2913,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e08afc6548..9ed71139ce 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -74,19 +74,19 @@ typedef struct
TimeLineID receiveStartTLI;
/*
- * receivedUpto-1 is the last byte position that has already been
+ * flushedUpto-1 is the last byte position that has already been
* received, and receivedTLI is the timeline it came from. At the first
* startup of walreceiver, these are set to receiveStart and
* receiveStartTLI. After that, walreceiver updates these whenever it
* flushes the received WAL to disk.
*/
- XLogRecPtr receivedUpto;
+ XLogRecPtr flushedUpto;
TimeLineID receivedTLI;
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
- * receivedUpto before the last flush to disk. Startup process can use
+ * flushedUpto before the last flush to disk. Startup process can use
* this to detect whether it's keeping up or not.
*/
XLogRecPtr latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvStreaming(void);
extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0003-Add-WalRcvGetWriteRecPtr-new-definition-v5.patchtext/x-patch; charset=US-ASCII; name=0003-Add-WalRcvGetWriteRecPtr-new-definition-v5.patchDownload
From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.
The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/replication/walreceiver.c | 5 +++++
src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
src/include/replication/walreceiver.h | 10 ++++++++++
3 files changed, 27 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 0bdd0c3074..e250f5583c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -245,6 +245,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -985,6 +987,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 31025f97e3..96b44e2c88 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -310,6 +310,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ WalRcvData *walrcv = WalRcv;
+
+ return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9ed71139ce..914e6e3d44 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -142,6 +143,14 @@ typedef struct
slock_t mutex; /* locks shared variables shown above */
+ /*
+ * Like flushedUpto, but advanced after writing and before flushing,
+ * without the need to acquire the spin lock. Data can be read by another
+ * process up to this point, but shouldn't be used for data integrity
+ * purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* force walreceiver reply? This doesn't need to be locked; memory
* barriers for ordering are sufficient. But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname);
extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
0004-Allow-PrefetchBuffer-to-report-what-happened-v5.patchtext/x-patch; charset=US-ASCII; name=0004-Allow-PrefetchBuffer-to-report-what-happened-v5.patchDownload
From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.
Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.
If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.
Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 38 +++++++++++++++++++++++----
src/backend/storage/buffer/localbuf.c | 17 ++++++++----
src/backend/storage/smgr/md.c | 9 +++++--
src/backend/storage/smgr/smgr.c | 10 ++++---
src/include/storage/buf_internals.h | 5 ++--
src/include/storage/bufmgr.h | 19 ++++++++++----
src/include/storage/md.h | 2 +-
src/include/storage/smgr.h | 2 +-
8 files changed, 78 insertions(+), 24 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d30aed6fd9..4ceb40a856 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -469,11 +469,13 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
/*
* Implementation of PrefetchBuffer() for shared buffers.
*/
-void
+PrefetchBufferResult
PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum)
{
+ PrefetchBufferResult result = { InvalidBuffer, false };
+
#ifdef USE_PREFETCH
BufferTag newTag; /* identity of requested block */
uint32 newHash; /* hash value for newTag */
@@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
/* If not in buffers, initiate prefetch */
if (buf_id < 0)
- smgrprefetch(smgr_reln, forkNum, blockNum);
+ {
+ /*
+ * Try to initiate an asynchronous read. This returns false in
+ * recovery if the relation file doesn't exist.
+ */
+ if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ result.initiated_io = true;
+ }
+ else
+ {
+ /*
+ * Report the buffer it was in at that time. The caller may be able
+ * to avoid a buffer table lookup, but it's not pinned and it must be
+ * rechecked!
+ */
+ result.buffer = buf_id + 1;
+ }
/*
* If the block *is* in buffers, we do nothing. This is not really ideal:
@@ -511,6 +529,8 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* a problem to justify that.
*/
#endif /* USE_PREFETCH */
+
+ return result;
}
/*
@@ -520,8 +540,12 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* buffer. Instead it tries to ensure that a future ReadBuffer for the given
* block will not be delayed by the I/O. Prefetching is optional.
* No-op if prefetching isn't compiled in.
+ *
+ * If the block is already cached, the result includes a valid buffer that can
+ * be used by the caller to avoid the need for a later buffer lookup, but it's
+ * not pinned, so the caller must recheck it.
*/
-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
/* pass it to the shared buffer version */
- PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
+#else
+ PrefetchBuffer result = { InvalidBuffer, false };
+
+ return result;
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..18a8614e9b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,10 +60,12 @@ static Block GetLocalBufferStorage(void);
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
-void
+PrefetchBufferResult
PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
+ PrefetchBufferResult result = { InvalidBuffer, false };
+
#ifdef USE_PREFETCH
BufferTag newTag; /* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -81,12 +83,17 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
if (hresult)
{
/* Yes, so nothing to do */
- return;
+ result.buffer = -hresult->id - 1;
+ }
+ else
+ {
+ /* Not in buffers, so initiate prefetch */
+ smgrprefetch(smgr, forkNum, blockNum);
+ result.initiated_io = true;
}
-
- /* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
#endif /* USE_PREFETCH */
+
+ return result;
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ba12fc2077 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -525,14 +525,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
/*
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
*/
-void
+bool
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
#ifdef USE_PREFETCH
off_t seekpos;
MdfdVec *v;
- v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+ if (v == NULL)
+ return false;
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -540,6 +543,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
#endif /* USE_PREFETCH */
+
+ return true;
}
/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c39dd533e6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
- void (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+ bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_read) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
@@ -489,11 +489,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
/*
* smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ * In recovery only, this can return false to indicate that a file
+ * doesn't exist (presumably it has been dropped by a later WAL
+ * record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
- smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+ return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
}
/*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e00dd3ffb7..64b643569f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -46,6 +46,15 @@ typedef enum
* replay; otherwise same as RBM_NORMAL */
} ReadBufferMode;
+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+ Buffer buffer; /* If valid, a hit (recheck needed!) */
+ bool initiated_io; /* If true, a miss resulting in async I/O */
+} PrefetchBufferResult;
+
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
@@ -162,11 +171,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
- ForkNumber forkNum,
- BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+ BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
extern void mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..dc740443e2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrdounlink(SMgrRelation reln, bool isRedo);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void smgrread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
--
2.20.1
0005-Prefetch-referenced-blocks-during-recovery-v5.patchtext/x-patch; charset=US-ASCII; name=0005-Prefetch-referenced-blocks-during-recovery-v5.patchDownload
From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.
Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 38 ++
doc/src/sgml/monitoring.sgml | 69 ++
doc/src/sgml/wal.sgml | 12 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 64 ++
src/backend/access/transam/xlogprefetcher.c | 663 ++++++++++++++++++++
src/backend/access/transam/xlogutils.c | 23 +-
src/backend/catalog/system_views.sql | 11 +
src/backend/replication/logical/logical.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 38 +-
src/include/access/xlog.h | 4 +
src/include/access/xlogprefetcher.h | 28 +
src/include/access/xlogutils.h | 20 +
src/include/catalog/pg_proc.dat | 8 +
src/include/utils/guc.h | 2 +
src/test/regress/expected/rules.out | 8 +
18 files changed, 992 insertions(+), 4 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetcher.c
create mode 100644 src/include/access/xlogprefetcher.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 672bf6f1ee..8249ec0139 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3102,6 +3102,44 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance">
+ <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>.
+ If this value is specified without units, it is taken as bytes.
+ The default is -1, meaning that WAL prefetching is disabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw">
+ <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks with full page images during recovery.
+ Usually this doesn't help, since such blocks will not be read. However,
+ on file systems with a block size larger than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a costly
+ read-before-write when a blocks are later written.
+ This setting has no effect unless
+ <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number.
+ The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..df4291092b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-wal-prefetcher-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2192,6 +2199,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-wal-prefetcher-view" xreflabel="pg_stat_wal_prefetcher">
+ <title><structname>pg_stat_wal_prefetcher</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-wal-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_wal_prefetcher</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-wal-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-wal-prefetch-distance"/>,
+ <xref linkend="guc-wal-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..9e956ad2a1 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,18 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-wal-prefetch-distance"/> parameter can be
+ used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, WAL prefetching is disabled.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..20e044c7c8 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetcher.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abb227ce66..85f36ef6f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -34,6 +34,7 @@
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetcher.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -105,6 +106,8 @@ int wal_level = WAL_LEVEL_MINIMAL;
int CommitDelay = 0; /* precommit delay in microseconds */
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
int wal_retrieve_retry_interval = 5000;
+int max_wal_prefetch_distance = -1;
+bool wal_prefetch_fpw = false;
#ifdef WAL_DEBUG
bool XLOG_DEBUG = false;
@@ -806,6 +809,7 @@ static XLogSource readSource = XLOG_FROM_ANY;
*/
static XLogSource currentSource = XLOG_FROM_ANY;
static bool lastSourceFailed = false;
+static bool reset_wal_prefetcher = false;
typedef struct XLogPageReadPrivate
{
@@ -6213,6 +6217,7 @@ CheckRequiredParameterValues(void)
}
}
+
/*
* This must be called ONCE during postmaster or standalone-backend startup
*/
@@ -7069,6 +7074,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetcher *prefetcher = NULL;
InRedo = true;
@@ -7076,6 +7082,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* the first time through, see if we need to enable prefetching */
+ ResetWalPrefetcher();
+
/*
* main redo apply loop
*/
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /*
+ * The first time through, or if any relevant settings or the
+ * WAL source changes, we'll restart the prefetching machinery
+ * as appropriate. This is simpler than trying to handle
+ * various complicated state changes.
+ */
+ if (unlikely(reset_wal_prefetcher))
+ {
+ /* If we had one already, destroy it. */
+ if (prefetcher)
+ {
+ XLogPrefetcherFree(prefetcher);
+ prefetcher = NULL;
+ }
+ /* If we want one, create it. */
+ if (max_wal_prefetch_distance > 0)
+ prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+ reset_wal_prefetcher = false;
+ }
+
+ /* Peform WAL prefetching, if enabled. */
+ if (prefetcher)
+ XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7292,6 +7326,8 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ if (prefetcher)
+ XLogPrefetcherFree(prefetcher);
if (reachedRecoveryTarget)
{
@@ -10155,6 +10191,24 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
}
}
+void
+assign_max_wal_prefetch_distance(int new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ max_wal_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
+void
+assign_wal_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ wal_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+}
+
/*
* Issue appropriate kind of fsync (if any) for an XLOG output file.
@@ -11961,6 +12015,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* and move on to the next state.
*/
currentSource = XLOG_FROM_STREAM;
+ ResetWalPrefetcher();
break;
case XLOG_FROM_STREAM:
@@ -12390,3 +12445,12 @@ XLogRequestWalReceiverReply(void)
{
doRequestWalReceiverReply = true;
}
+
+/*
+ * Schedule a WAL prefetcher reset, on change of relevant settings.
+ */
+void
+ResetWalPrefetcher(void)
+{
+ reset_wal_prefetcher = true;
+}
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
new file mode 100644
index 0000000000..715552b428
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -0,0 +1,663 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.c
+ * Prefetching support for PostgreSQL write-ahead log manager
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetcher.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log message
+ * that appears at the end of crash recovery.
+ */
+#define XLOGPREFETCHER_MONITORING_SAMPLE_STEP 32768
+
+/*
+ * Internal state used for book-keeping.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ XLogRecPtr *prefetch_queue;
+ int prefetch_queue_size;
+ int prefetch_head;
+ int prefetch_tail;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Counters used to compute avg_queue_depth and avg_distance. */
+ double samples;
+ double queue_depth_sum;
+ double distance_sum;
+ XLogRecPtr next_sample_lsn;
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory just for the benefit of monitoring
+ * functions.
+ */
+typedef struct XLogPrefetcherMonitoringStats
+{
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Sequential/repeat blocks skipped. */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetcherMonitoringStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+
+/*
+ * On modern systems this is really just *counter++. On some older systems
+ * there might be more to it, due to inability to read and write 64 bit values
+ * atomically. The counters will only be written to by one process, and there
+ * is no ordering requirement, so there's no point in using higher overhead
+ * pg_atomic_fetch_add_u64().
+ */
+static inline void inc_counter(pg_atomic_uint64 *counter)
+{
+ pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);
+}
+
+static XLogPrefetcherMonitoringStats *MonitoringStats;
+
+size_t
+XLogPrefetcherShmemSize(void)
+{
+ return sizeof(XLogPrefetcherMonitoringStats);
+}
+
+static void
+XLogPrefetcherResetMonitoringStats(void)
+{
+ pg_atomic_init_u64(&MonitoringStats->prefetch, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_hit, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_new, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_fpw, 0);
+ pg_atomic_init_u64(&MonitoringStats->skip_seq, 0);
+ MonitoringStats->distance = -1;
+ MonitoringStats->queue_depth = 0;
+}
+
+void
+XLogPrefetcherShmemInit(void)
+{
+ bool found;
+
+ MonitoringStats = (XLogPrefetcherMonitoringStats *)
+ ShmemInitStruct("XLogPrefetcherMonitoringStats",
+ sizeof(XLogPrefetcherMonitoringStats),
+ &found);
+ if (!found)
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+ XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher));
+
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* We're allowed to read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_LSN;
+ prefetcher->options.lsn = (XLogRecPtr) -1;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ &prefetcher->options);
+ prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching.
+ */
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency;
+ prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size);
+ prefetcher->prefetch_head = prefetcher->prefetch_tail = 0;
+
+ /* Prepare to read at the given LSN. */
+ ereport(LOG,
+ (errmsg("WAL prefetch started at %X/%X",
+ (uint32) (lsn << 32), (uint32) lsn)));
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ XLogPrefetcherResetMonitoringStats();
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ double avg_distance = 0;
+ double avg_queue_depth = 0;
+
+ /* Log final statistics. */
+ if (prefetcher->samples > 0)
+ {
+ avg_distance = prefetcher->distance_sum / prefetcher->samples;
+ avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples;
+ }
+ ereport(LOG,
+ (errmsg("WAL prefetch finished at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&MonitoringStats->prefetch),
+ pg_atomic_read_u64(&MonitoringStats->skip_hit),
+ pg_atomic_read_u64(&MonitoringStats->skip_new),
+ pg_atomic_read_u64(&MonitoringStats->skip_fpw),
+ pg_atomic_read_u64(&MonitoringStats->skip_seq),
+ avg_distance,
+ avg_queue_depth)));
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher->prefetch_queue);
+ pfree(prefetcher);
+
+ XLogPrefetcherResetMonitoringStats();
+}
+
+/*
+ * Read ahead in the WAL, as far as we can within the limits set by the user.
+ * Begin fetching any referenced blocks that are not already in the buffer
+ * pool.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /* Can we drop any filters yet, due to problem records begin replayed? */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /* Main prefetch loop. */
+ for (;;)
+ {
+ XLogReaderState *reader = prefetcher->reader;
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ ereport(LOG, (errmsg("WAL prefetch error: %s", error)));
+ prefetcher->shutdown = true;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ MonitoringStats->distance = distance;
+
+ /* Sample the averages so we can log them at end of recovery. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ prefetcher->distance_sum += MonitoringStats->distance;
+ prefetcher->queue_depth_sum += MonitoringStats->queue_depth;
+ prefetcher->samples += 1.0;
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_MONITORING_SAMPLE_STEP;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_wal_prefetch_distance)
+ break;
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /*
+ * Scan the record for block references. We might already have been
+ * partway through processing this record when we hit maximum I/O
+ * concurrency, so start where we left off.
+ */
+ for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i)
+ {
+ PrefetchBufferResult prefetch;
+ DecodedBkpBlock *block = &reader->blocks[i];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might thing we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !wal_prefetch_fpw)
+ {
+ inc_counter(&MonitoringStats->skip_fpw);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode,
+ block->blkno))
+ {
+ inc_counter(&MonitoringStats->skip_new);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat or sequential access, then skip it. We
+ * expect the kernel to detect sequential access on its own
+ * and do a better job than we could.
+ */
+ if (block->blkno == prefetcher->last_blkno ||
+ block->blkno == prefetcher->last_blkno + 1)
+ {
+ prefetcher->last_blkno = block->blkno;
+ inc_counter(&MonitoringStats->skip_seq);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+ if (BufferIsValid(prefetch.buffer))
+ {
+ /*
+ * It was already cached, so do nothing. Perhaps in future we
+ * could remember the buffer so that recovery doesn't have to
+ * look it up again.
+ */
+ inc_counter(&MonitoringStats->skip_hit);
+ }
+ else if (prefetch.initiated_io)
+ {
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ inc_counter(&MonitoringStats->prefetch);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before
+ * processing any more blocks from this record.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = i + 1;
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Neither cached nor initiated. The underlying segment file
+ * doesn't exist. Presumably it will be unlinked by a later
+ * WAL record. When recovery reads this block, it will use the
+ * EXTENSION_CREATE_RECOVERY flag. We certainly don't want to
+ * do that sort of thing while merely prefetching, so let's
+ * just ignore references to this relation until this record is
+ * replayed, and let recovery create the dummy file or complain
+ * if something is wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ inc_counter(&MonitoringStats->skip_new);
+ }
+ }
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_wal_prefetcher(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_WAL_PREFETCHER_COLS 7
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_WAL_PREFETCHER_COLS];
+ bool nulls[PG_STAT_GET_WAL_PREFETCHER_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (MonitoringStats->distance < 0)
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_WAL_PREFETCHER_COLS; ++i)
+ nulls[i] = false;
+ values[0] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->prefetch));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_hit));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_new));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_fpw));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&MonitoringStats->skip_seq));
+ values[5] = Int32GetDatum(MonitoringStats->distance);
+ values[6] = Int32GetDatum(MonitoringStats->queue_depth);
+ }
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth++;
+ Assert(MonitoringStats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ MonitoringStats->queue_depth--;
+ Assert(MonitoringStats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..fad2acb514 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -827,6 +828,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
loc = targetPagePtr + reqLen;
@@ -841,7 +843,23 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
- if (!RecoveryInProgress())
+ if (options)
+ {
+ switch (options->read_upto_policy)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ read_upto = GetWalRcvWriteRecPtr();
+ break;
+ case XLRO_LSN:
+ read_upto = options->lsn;
+ break;
+ default:
+ read_upto = 0;
+ elog(ERROR, "unknown read_upto_policy value");
+ break;
+ }
+ }
+ else if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -879,6 +897,9 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
if (loc <= read_upto)
break;
+ if (options && options->nowait)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..7b27ac4805 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -811,6 +811,17 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_wal_prefetcher AS
+ SELECT
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..792d90ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->slot = slot;
- ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+ ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
if (!ctx->reader)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4ceb40a856..4fc391a6e4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -572,7 +572,7 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
#else
- PrefetchBuffer result = { InvalidBuffer, false };
+ PrefetchBufferResult result = { InvalidBuffer, false };
return result;
#endif /* USE_PREFETCH */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..5ca98b8886 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetcher.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetcherShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetcherShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..a2a9f62160 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,6 +197,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1241,6 +1242,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_wal_prefetch_distance is set to a positive number.")
+ },
+ &wal_prefetch_fpw,
+ false,
+ NULL, assign_wal_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2627,6 +2640,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_wal_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable WAL prefetching."),
+ GUC_UNIT_BYTE
+ },
+ &max_wal_prefetch_distance,
+ -1, -1, INT_MAX,
+ NULL, assign_max_wal_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2900,7 +2924,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11498,6 +11523,17 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /* Reset the WAL prefetcher, because a setting it depends on changed. */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ ResetWalPrefetcher();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..82829d7854 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -111,6 +111,8 @@ extern int wal_keep_segments;
extern int XLOGbuffers;
extern int XLogArchiveTimeout;
extern int wal_retrieve_retry_interval;
+extern int max_wal_prefetch_distance;
+extern bool wal_prefetch_fpw;
extern char *XLogArchiveCommand;
extern bool EnableHotStandby;
extern bool fullPageWrites;
@@ -319,6 +321,8 @@ extern void SetWalWriterSleeping(bool sleeping);
extern void XLogRequestWalReceiverReply(void);
+extern void ResetWalPrefetcher(void);
+
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
new file mode 100644
index 0000000000..585f5564a3
--- /dev/null
+++ b/src/include/access/xlogprefetcher.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetcher.h
+ * Declarations for the XLog prefetching facility
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetcher.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCHER_H
+#define XLOGPREFETCHER_H
+
+#include "access/xlogdefs.h"
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch, XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetcherShmemSize(void);
+extern void XLogPrefetcherShmemInit(void);
+
+#endif
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..1c8e67d74a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,26 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private
+ * data for an xlog reader, causing read_local_xlog_page to modify its
+ * behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /* How far to read. */
+ enum {
+ XLRO_WALRCV_WRITTEN,
+ XLRO_LSN
+ } read_upto_policy;
+
+ /* If read_upto_policy is XLRO_LSN, the LSN. */
+ XLogRecPtr lsn;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..742741afa1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6082,6 +6082,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_wal_prefetcher', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int4,int4}',
+ proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth}',
+ prosrc => 'pg_stat_get_wal_prefetcher' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..7d076a9743 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,5 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+extern void assign_max_wal_prefetch_distance(int new_value, void *extra);
+extern void assign_wal_prefetch_fpw(bool new_value, void *extra);
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7304611c3..63bbb796fc 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2102,6 +2102,14 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.autoanalyze_count
FROM pg_stat_all_tables
WHERE ((pg_stat_all_tables.schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (pg_stat_all_tables.schemaname !~ '^pg_toast'::text));
+pg_stat_wal_prefetcher| SELECT s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth
+ FROM pg_stat_get_wal_prefetcher() s(prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth);
pg_stat_wal_receiver| SELECT s.pid,
s.status,
s.receive_start_lsn,
--
2.20.1
Hi,
On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:
From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.
Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?
From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.
We probably should take this into account in nodeBitmapHeapscan.c
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index d30aed6fd9..4ceb40a856 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -469,11 +469,13 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg); /* * Implementation of PrefetchBuffer() for shared buffers. */ -void +PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln, ForkNumber forkNum, BlockNumber blockNum) { + PrefetchBufferResult result = { InvalidBuffer, false }; + #ifdef USE_PREFETCH BufferTag newTag; /* identity of requested block */ uint32 newHash; /* hash value for newTag */ @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,/* If not in buffers, initiate prefetch */ if (buf_id < 0) - smgrprefetch(smgr_reln, forkNum, blockNum); + { + /* + * Try to initiate an asynchronous read. This returns false in + * recovery if the relation file doesn't exist. + */ + if (smgrprefetch(smgr_reln, forkNum, blockNum)) + result.initiated_io = true; + } + else + { + /* + * Report the buffer it was in at that time. The caller may be able + * to avoid a buffer table lookup, but it's not pinned and it must be + * rechecked! + */ + result.buffer = buf_id + 1;
Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?
-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));/* pass it off to localbuf.c */ - PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum); + return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum); } else { /* pass it to the shared buffer version */ - PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum); + return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum); } +#else + PrefetchBuffer result = { InvalidBuffer, false }; + + return result; #endif /* USE_PREFETCH */ }
Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?
+/* + * Type returned by PrefetchBuffer(). + */ +typedef struct PrefetchBufferResult +{ + Buffer buffer; /* If valid, a hit (recheck needed!) */
I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?
From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion:
/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
Why is it disabled by default? Just for "risk management"?
+ <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance"> + <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>) + <indexterm> + <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + The maximum distance to look ahead in the WAL during recovery, to find + blocks to prefetch. Prefetching blocks that will soon be needed can + reduce I/O wait times. The number of concurrent prefetches is limited + by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>. + If this value is specified without units, it is taken as bytes. + The default is -1, meaning that WAL prefetching is disabled. + </para> + </listitem> + </varlistentry>
Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw"> + <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Whether to prefetch blocks with full page images during recovery. + Usually this doesn't help, since such blocks will not be read. However, + on file systems with a block size larger than + <productname>PostgreSQL</productname>'s, prefetching can avoid a costly + read-before-write when a blocks are later written. + This setting has no effect unless + <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number. + The default is off. + </para> + </listitem> + </varlistentry>
Hm. I think this needs more details - it's not clear enough what this
actually controls. I assume it's about prefetching for WAL records that
contain the FPW, but it also could be read to be about not prefetching
any pages that had FPWs before, or such?
</variablelist> </sect2> <sect2 id="runtime-config-wal-archiving"> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 987580d6df..df4291092b 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </entry> </row>+ <row> + <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry> + <entry>Only one row, showing statistics about blocks prefetched during recovery. + See <xref linkend="pg-stat-wal-prefetcher-view"/> for details. + </entry> + </row> +
'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?
+ <row> + <entry><structfield>distance</structfield></entry> + <entry><type>integer</type></entry> + <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry> + </row> + <row> + <entry><structfield>queue_depth</structfield></entry> + <entry><type>integer</type></entry> + <entry>How many prefetches have been initiated but are not yet known to have completed</entry> + </row> + </tbody> + </tgroup> + </table>
Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?
It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).
I think it'd be good to have a 'reset time' column.
+ <para> + The <structname>pg_stat_wal_prefetcher</structname> view will contain only + one row. It is filled with nulls if recovery is not running or WAL + prefetching is not enabled. See <xref linkend="guc-max-wal-prefetch-distance"/> + for more information. The counters in this view are reset whenever the + <xref linkend="guc-max-wal-prefetch-distance"/>, + <xref linkend="guc-wal-prefetch-fpw"/> or + <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and + the server configuration is reloaded. + </para> +
So pg_stat_reset_shared() cannot be used? If so, why?
It sounds like the counters aren't persisted via the stats system - if
so, why?
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();+ /* + * The first time through, or if any relevant settings or the + * WAL source changes, we'll restart the prefetching machinery + * as appropriate. This is simpler than trying to handle + * various complicated state changes. + */ + if (unlikely(reset_wal_prefetcher)) + { + /* If we had one already, destroy it. */ + if (prefetcher) + { + XLogPrefetcherFree(prefetcher); + prefetcher = NULL; + } + /* If we want one, create it. */ + if (max_wal_prefetch_distance > 0) + prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr, + currentSource == XLOG_FROM_STREAM); + reset_wal_prefetcher = false; + }
Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?
+ /* Peform WAL prefetching, if enabled. */ + if (prefetcher) + XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr); + /* * Pause WAL replay, if requested by a hot-standby session via * SetRecoveryPause().
Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).
+/*------------------------------------------------------------------------- + * + * xlogprefetcher.c + * Prefetching support for PostgreSQL write-ahead log manager + *
An architectural overview here would be good.
+struct XLogPrefetcher +{ + /* Reader and current reading state. */ + XLogReaderState *reader; + XLogReadLocalOptions options; + bool have_record; + bool shutdown; + int next_block_id; + + /* Book-keeping required to avoid accessing non-existing blocks. */ + HTAB *filter_table; + dlist_head filter_queue; + + /* Book-keeping required to limit concurrent prefetches. */ + XLogRecPtr *prefetch_queue; + int prefetch_queue_size; + int prefetch_head; + int prefetch_tail; + + /* Details of last prefetch to skip repeats and seq scans. */ + SMgrRelation last_reln; + RelFileNode last_rnode; + BlockNumber last_blkno;
Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).
+/* + * On modern systems this is really just *counter++. On some older systems + * there might be more to it, due to inability to read and write 64 bit values + * atomically. The counters will only be written to by one process, and there + * is no ordering requirement, so there's no point in using higher overhead + * pg_atomic_fetch_add_u64(). + */ +static inline void inc_counter(pg_atomic_uint64 *counter) +{ + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); +}
Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.
+/* + * Create a prefetcher that is ready to begin prefetching blocks referenced by + * WAL that is ahead of the given lsn. + */ +XLogPrefetcher * +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming) +{ + static HASHCTL hash_table_ctl = { + .keysize = sizeof(RelFileNode), + .entrysize = sizeof(XLogPrefetcherFilter) + }; + XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher)); + + prefetcher->options.nowait = true; + if (streaming) + { + /* + * We're only allowed to read as far as the WAL receiver has written. + * We don't have to wait for it to be flushed, though, as recovery + * does, so that gives us a chance to get a bit further ahead. + */ + prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN; + } + else + { + /* We're allowed to read as far as we can. */ + prefetcher->options.read_upto_policy = XLRO_LSN; + prefetcher->options.lsn = (XLogRecPtr) -1; + } + prefetcher->reader = XLogReaderAllocate(wal_segment_size, + NULL, + read_local_xlog_page, + &prefetcher->options); + prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024, + &hash_table_ctl, + HASH_ELEM | HASH_BLOBS); + dlist_init(&prefetcher->filter_queue); + + /* + * The size of the queue is based on the maintenance_io_concurrency + * setting. In theory we might have a separate queue for each tablespace, + * but it's not clear how that should work, so for now we'll just use the + * general GUC to rate-limit all prefetching. + */ + prefetcher->prefetch_queue_size = maintenance_io_concurrency; + prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size); + prefetcher->prefetch_head = prefetcher->prefetch_tail = 0; + + /* Prepare to read at the given LSN. */ + ereport(LOG, + (errmsg("WAL prefetch started at %X/%X", + (uint32) (lsn << 32), (uint32) lsn))); + XLogBeginRead(prefetcher->reader, lsn); + + XLogPrefetcherResetMonitoringStats(); + + return prefetcher; +} + +/* + * Destroy a prefetcher and release all resources. + */ +void +XLogPrefetcherFree(XLogPrefetcher *prefetcher) +{ + double avg_distance = 0; + double avg_queue_depth = 0; + + /* Log final statistics. */ + if (prefetcher->samples > 0) + { + avg_distance = prefetcher->distance_sum / prefetcher->samples; + avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples; + } + ereport(LOG, + (errmsg("WAL prefetch finished at %X/%X; " + "prefetch = " UINT64_FORMAT ", " + "skip_hit = " UINT64_FORMAT ", " + "skip_new = " UINT64_FORMAT ", " + "skip_fpw = " UINT64_FORMAT ", " + "skip_seq = " UINT64_FORMAT ", " + "avg_distance = %f, " + "avg_queue_depth = %f", + (uint32) (prefetcher->reader->EndRecPtr << 32), + (uint32) (prefetcher->reader->EndRecPtr), + pg_atomic_read_u64(&MonitoringStats->prefetch), + pg_atomic_read_u64(&MonitoringStats->skip_hit), + pg_atomic_read_u64(&MonitoringStats->skip_new), + pg_atomic_read_u64(&MonitoringStats->skip_fpw), + pg_atomic_read_u64(&MonitoringStats->skip_seq), + avg_distance, + avg_queue_depth))); + XLogReaderFree(prefetcher->reader); + hash_destroy(prefetcher->filter_table); + pfree(prefetcher->prefetch_queue); + pfree(prefetcher); + + XLogPrefetcherResetMonitoringStats(); +}
It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.
+ /* Can we drop any filters yet, due to problem records begin replayed? */
Odd grammar.
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
Hm, why isn't this part of the loop below?
+ /* Main prefetch loop. */ + for (;;) + {
This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...
The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?
+ /* + * Scan the record for block references. We might already have been + * partway through processing this record when we hit maximum I/O + * concurrency, so start where we left off. + */ + for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i) + {
Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.
Greetings,
Andres Freund
Hi,
Thanks for all that feedback. It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).
It also fixes a couple of other problems I found with the previous version:
1. While streaming, whenever it hit the end of available data (ie LSN
written by WAL receiver), it would close and then reopen the WAL
segment. Fixed by the machinery in 0007 which allows for "would
block" as distinct from other errors.
2. During crash recovery, there were some edge cases where it would
try to read the next WAL segment when there isn't one. Also fixed by
0007.
3. It was maxing out at maintenance_io_concurrency - 1 due to a silly
circular buffer fence post bug.
Note that 0006 is just for illustration, it's not proposed for commit.
On Wed, Mar 25, 2020 at 11:31 AM Andres Freund <andres@anarazel.de> wrote:
On 2020-03-18 18:18:44 +1300, Thomas Munro wrote:
From 1b03eb5ada24c3b23ab8ca6db50e0c5d90d38259 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.The function formerly bearing name was recently renamed to
WalRcvGetFlushRecPtr(), which better described what it does.Hm. I'm a bit weary of reusing the name with a different meaning. If
there's any external references, this'll hide that they need to
adapt. Perhaps, even if it's a bit clunky, name it GetUnflushedRecPtr?
Well, at least external code won't compile due to the change in arguments:
extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart,
TimeLineID *receiveTLI);
extern XLogRecPtr GetWalRcvWriteRecPtr(void);
Anyone who is using that for some kind of data integrity purposes
should hopefully be triggered to investigate, no? I tried to think of
a better naming scheme but...
From c62fde23f70ff06833d743a1c85716e15f3c813c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH 4/5] Allow PrefetchBuffer() to report what happened.Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.We probably should take this into account in nodeBitmapHeapscan.c
Indeed. The naive version would be something like:
diff --git a/src/backend/executor/nodeBitmapHeapscan.c
b/src/backend/executor/nodeBitmapHeapscan.c
index 726d3a2d9a..3cd644d0ac 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -484,13 +484,11 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)
node->prefetch_iterator = NULL;
break;
}
- node->prefetch_pages++;
/*
* If we expect not to have to
actually read this heap page,
* skip this prefetch call, but
continue to run the prefetch
- * logic normally. (Would it be
better not to increment
- * prefetch_pages?)
+ * logic normally.
*
* This depends on the assumption that
the index AM will
* report the same recheck flag for
this future heap page as
@@ -504,7 +502,13 @@ BitmapPrefetch(BitmapHeapScanState *node,
TableScanDesc scan)
&node->pvmbuffer));
if (!skip_fetch)
- PrefetchBuffer(scan->rs_rd,
MAIN_FORKNUM, tbmpre->blockno);
+ {
+ PrefetchBufferResult prefetch;
+
+ prefetch =
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
+ if (prefetch.initiated_io)
+ node->prefetch_pages++;
+ }
}
}
... but that might get arbitrarily far ahead, so it probably needs
some kind of cap, and the parallel version is a bit more complicated.
Something for later, along with more prefetching opportunities.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index d30aed6fd9..4ceb40a856 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -469,11 +469,13 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg); /* * Implementation of PrefetchBuffer() for shared buffers. */ -void +PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln, ForkNumber forkNum, BlockNumber blockNum) { + PrefetchBufferResult result = { InvalidBuffer, false }; + #ifdef USE_PREFETCH BufferTag newTag; /* identity of requested block */ uint32 newHash; /* hash value for newTag */ @@ -497,7 +499,23 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,/* If not in buffers, initiate prefetch */ if (buf_id < 0) - smgrprefetch(smgr_reln, forkNum, blockNum); + { + /* + * Try to initiate an asynchronous read. This returns false in + * recovery if the relation file doesn't exist. + */ + if (smgrprefetch(smgr_reln, forkNum, blockNum)) + result.initiated_io = true; + } + else + { + /* + * Report the buffer it was in at that time. The caller may be able + * to avoid a buffer table lookup, but it's not pinned and it must be + * rechecked! + */ + result.buffer = buf_id + 1;Perhaps it'd be better to name this "last_buffer" or such, to make it
clearer that it may be outdated?
OK. Renamed to "recent_buffer".
-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
#ifdef USE_PREFETCH
@@ -540,13 +564,17 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));/* pass it off to localbuf.c */ - PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum); + return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum); } else { /* pass it to the shared buffer version */ - PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum); + return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum); } +#else + PrefetchBuffer result = { InvalidBuffer, false }; + + return result; #endif /* USE_PREFETCH */ }Hm. Now that results are returned indicating whether the buffer is in
s_b - shouldn't the return value be accurate regardless of USE_PREFETCH?
Yeah. Done.
+/* + * Type returned by PrefetchBuffer(). + */ +typedef struct PrefetchBufferResult +{ + Buffer buffer; /* If valid, a hit (recheck needed!) */I assume there's no user of this yet? Even if there's not, I wonder if
it still is worth adding and referencing a helper to do so correctly?
It *is* used, but only to see if it's valid. 0006 is a not-for-commit
patch to show how you might use it later to read a buffer. To
actually use this for something like bitmap heap scan, you'd first
need to fix the modularity violations in that code (I mean we have
PrefetchBuffer() in nodeBitmapHeapscan.c, but the corresponding
[ReleaseAnd]ReadBuffer() in heapam.c, and you'd need to get these into
the same module and/or to communicate in some graceful way).
From 42ba0a89260d46230ac0df791fae18bfdca0092f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 18 Mar 2020 16:35:27 +1300
Subject: [PATCH 5/5] Prefetch referenced blocks during recovery.Introduce a new GUC max_wal_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is disabled by default.Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion:
/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.comWhy is it disabled by default? Just for "risk management"?
Well, it's not free, and might not help you, so not everyone would
want it on. I think the overheads can be mostly removed with more
work in a later release. Perhaps we could commit it enabled by
default, and then discuss it before release after looking at some more
data? On that basis I have now made it default to on, with
max_wal_prefetch_distance = 256kB, if your build has USE_PREFETCH.
Obviously this number can be discussed.
+ <varlistentry id="guc-max-wal-prefetch-distance" xreflabel="max_wal_prefetch_distance"> + <term><varname>max_wal_prefetch_distance</varname> (<type>integer</type>) + <indexterm> + <primary><varname>max_wal_prefetch_distance</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + The maximum distance to look ahead in the WAL during recovery, to find + blocks to prefetch. Prefetching blocks that will soon be needed can + reduce I/O wait times. The number of concurrent prefetches is limited + by this setting as well as <xref linkend="guc-maintenance-io-concurrency"/>. + If this value is specified without units, it is taken as bytes. + The default is -1, meaning that WAL prefetching is disabled. + </para> + </listitem> + </varlistentry>Is it worth noting that a too large distance could hurt, because the
buffers might get evicted again?
OK, I tried to explain that.
+ <varlistentry id="guc-wal-prefetch-fpw" xreflabel="wal_prefetch_fpw"> + <term><varname>wal_prefetch_fpw</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_prefetch_fpw</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Whether to prefetch blocks with full page images during recovery. + Usually this doesn't help, since such blocks will not be read. However, + on file systems with a block size larger than + <productname>PostgreSQL</productname>'s, prefetching can avoid a costly + read-before-write when a blocks are later written. + This setting has no effect unless + <xref linkend="guc-max-wal-prefetch-distance"/> is set to a positive number. + The default is off. + </para> + </listitem> + </varlistentry>Hm. I think this needs more details - it's not clear enough what this
actually controls. I assume it's about prefetching for WAL records that
contain the FPW, but it also could be read to be about not prefetching
any pages that had FPWs before, or such?
Ok, I have elaborated.
</variablelist> </sect2> <sect2 id="runtime-config-wal-archiving"> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 987580d6df..df4291092b 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </entry> </row>+ <row> + <entry><structname>pg_stat_wal_prefetcher</structname><indexterm><primary>pg_stat_wal_prefetcher</primary></indexterm></entry> + <entry>Only one row, showing statistics about blocks prefetched during recovery. + See <xref linkend="pg-stat-wal-prefetcher-view"/> for details. + </entry> + </row> +'prefetcher' somehow sounds odd to me. I also suspect that we'll want to
have additional prefetching stat tables going forward. Perhaps
'pg_stat_prefetch_wal'?
Works for me, though while thinking about this I realised that the
"WAL" part was bothering me. It sounds like we're prefetching WAL
itself, which would be a different thing. So I renamed this view to
pg_stat_prefetch_recovery.
Then I renamed the main GUCs that control this thing to:
max_recovery_prefetch_distance
recovery_prefetch_fpw
+ <row> + <entry><structfield>distance</structfield></entry> + <entry><type>integer</type></entry> + <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry> + </row> + <row> + <entry><structfield>queue_depth</structfield></entry> + <entry><type>integer</type></entry> + <entry>How many prefetches have been initiated but are not yet known to have completed</entry> + </row> + </tbody> + </tgroup> + </table>Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?
Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.
It'd be useful to somewhere track the time spent initiating prefetch
requests. Otherwise it's quite hard to evaluate whether the queue is too
deep (and just blocks in the OS).
I agree that that sounds useful, and I thought about various ways to
do that that involved new views, until I eventually found myself
wondering: why isn't recovery's I/O already tracked via the existing
stats views? For example, why can't I see blks_read, blks_hit,
blk_read_time etc moving in pg_stat_database due to recovery activity?
I seems like if you made that work first, or created a new view
pgstatio view for that, then you could add prefetching counters and
timing (if track_io_timing is on) to the existing machinery so that
bufmgr.c would automatically capture it, and then not only recovery
but also stuff like bitmap heap scan could also be measured the same
way.
However, time is short, so I'm not attempting to do anything like that
now. You can measure the posix_fadvise() times with OS facilities in
the meantime.
I think it'd be good to have a 'reset time' column.
Done, as stats_reset following other examples.
+ <para> + The <structname>pg_stat_wal_prefetcher</structname> view will contain only + one row. It is filled with nulls if recovery is not running or WAL + prefetching is not enabled. See <xref linkend="guc-max-wal-prefetch-distance"/> + for more information. The counters in this view are reset whenever the + <xref linkend="guc-max-wal-prefetch-distance"/>, + <xref linkend="guc-wal-prefetch-fpw"/> or + <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and + the server configuration is reloaded. + </para> +So pg_stat_reset_shared() cannot be used? If so, why?
Hmm. OK, I made pg_stat_reset_shared('prefetch_recovery') work.
It sounds like the counters aren't persisted via the stats system - if
so, why?
Ok, I made it persist the simple counters by sending to the to stats
collector periodically. The view still shows data straight out of
shmem though, not out of the stats file. Now I'm wondering if I
should have the view show it from the stats file, more like other
things, now that I understand that a bit better... hmm.
@@ -7105,6 +7114,31 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();+ /* + * The first time through, or if any relevant settings or the + * WAL source changes, we'll restart the prefetching machinery + * as appropriate. This is simpler than trying to handle + * various complicated state changes. + */ + if (unlikely(reset_wal_prefetcher)) + { + /* If we had one already, destroy it. */ + if (prefetcher) + { + XLogPrefetcherFree(prefetcher); + prefetcher = NULL; + } + /* If we want one, create it. */ + if (max_wal_prefetch_distance > 0) + prefetcher = XLogPrefetcherAllocate(xlogreader->ReadRecPtr, + currentSource == XLOG_FROM_STREAM); + reset_wal_prefetcher = false; + }Do we really need all of this code in StartupXLOG() itself? Could it be
in HandleStartupProcInterrupts() or at least a helper routine called
here?
It's now done differently, so that StartupXLOG() only has three new
lines: XLogPrefetchBegin() before the loop, XLogPrefetch() in the
loop, and XLogPrefetchEnd() after the loop.
+ /* Peform WAL prefetching, if enabled. */ + if (prefetcher) + XLogPrefetcherReadAhead(prefetcher, xlogreader->ReadRecPtr); + /* * Pause WAL replay, if requested by a hot-standby session via * SetRecoveryPause().Personally, I'd rather have the if () be in
XLogPrefetcherReadAhead(). With an inline wrapper doing the check, if
the call bothers you (but I don't think it needs to).
Done.
+/*------------------------------------------------------------------------- + * + * xlogprefetcher.c + * Prefetching support for PostgreSQL write-ahead log manager + *An architectural overview here would be good.
OK, added.
+struct XLogPrefetcher +{ + /* Reader and current reading state. */ + XLogReaderState *reader; + XLogReadLocalOptions options; + bool have_record; + bool shutdown; + int next_block_id; + + /* Book-keeping required to avoid accessing non-existing blocks. */ + HTAB *filter_table; + dlist_head filter_queue; + + /* Book-keeping required to limit concurrent prefetches. */ + XLogRecPtr *prefetch_queue; + int prefetch_queue_size; + int prefetch_head; + int prefetch_tail; + + /* Details of last prefetch to skip repeats and seq scans. */ + SMgrRelation last_reln; + RelFileNode last_rnode; + BlockNumber last_blkno;Do you have a comment somewhere explaining why you want to avoid
seqscans (I assume it's about avoiding regressions in linux, but only
because I recall chatting with you about it).
I've added a note to the new architectural comments.
+/* + * On modern systems this is really just *counter++. On some older systems + * there might be more to it, due to inability to read and write 64 bit values + * atomically. The counters will only be written to by one process, and there + * is no ordering requirement, so there's no point in using higher overhead + * pg_atomic_fetch_add_u64(). + */ +static inline void inc_counter(pg_atomic_uint64 *counter) +{ + pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1); +}Could be worthwhile to add to the atomics infrastructure itself - on the
platforms where this needs spinlocks this will lead to two acquisitions,
rather than one.
Ok, I added pg_atomic_unlocked_add_fetch_XXX(). (Could also be
"fetch_add", I don't care, I don't use the result).
+/* + * Create a prefetcher that is ready to begin prefetching blocks referenced by + * WAL that is ahead of the given lsn. + */ +XLogPrefetcher * +XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming) +{ + static HASHCTL hash_table_ctl = { + .keysize = sizeof(RelFileNode), + .entrysize = sizeof(XLogPrefetcherFilter) + }; + XLogPrefetcher *prefetcher = palloc0(sizeof(*prefetcher)); + + prefetcher->options.nowait = true; + if (streaming) + { + /* + * We're only allowed to read as far as the WAL receiver has written. + * We don't have to wait for it to be flushed, though, as recovery + * does, so that gives us a chance to get a bit further ahead. + */ + prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN; + } + else + { + /* We're allowed to read as far as we can. */ + prefetcher->options.read_upto_policy = XLRO_LSN; + prefetcher->options.lsn = (XLogRecPtr) -1; + } + prefetcher->reader = XLogReaderAllocate(wal_segment_size, + NULL, + read_local_xlog_page, + &prefetcher->options); + prefetcher->filter_table = hash_create("PrefetchFilterTable", 1024, + &hash_table_ctl, + HASH_ELEM | HASH_BLOBS); + dlist_init(&prefetcher->filter_queue); + + /* + * The size of the queue is based on the maintenance_io_concurrency + * setting. In theory we might have a separate queue for each tablespace, + * but it's not clear how that should work, so for now we'll just use the + * general GUC to rate-limit all prefetching. + */ + prefetcher->prefetch_queue_size = maintenance_io_concurrency; + prefetcher->prefetch_queue = palloc0(sizeof(XLogRecPtr) * prefetcher->prefetch_queue_size); + prefetcher->prefetch_head = prefetcher->prefetch_tail = 0; + + /* Prepare to read at the given LSN. */ + ereport(LOG, + (errmsg("WAL prefetch started at %X/%X", + (uint32) (lsn << 32), (uint32) lsn))); + XLogBeginRead(prefetcher->reader, lsn); + + XLogPrefetcherResetMonitoringStats(); + + return prefetcher; +} + +/* + * Destroy a prefetcher and release all resources. + */ +void +XLogPrefetcherFree(XLogPrefetcher *prefetcher) +{ + double avg_distance = 0; + double avg_queue_depth = 0; + + /* Log final statistics. */ + if (prefetcher->samples > 0) + { + avg_distance = prefetcher->distance_sum / prefetcher->samples; + avg_queue_depth = prefetcher->queue_depth_sum / prefetcher->samples; + } + ereport(LOG, + (errmsg("WAL prefetch finished at %X/%X; " + "prefetch = " UINT64_FORMAT ", " + "skip_hit = " UINT64_FORMAT ", " + "skip_new = " UINT64_FORMAT ", " + "skip_fpw = " UINT64_FORMAT ", " + "skip_seq = " UINT64_FORMAT ", " + "avg_distance = %f, " + "avg_queue_depth = %f", + (uint32) (prefetcher->reader->EndRecPtr << 32), + (uint32) (prefetcher->reader->EndRecPtr), + pg_atomic_read_u64(&MonitoringStats->prefetch), + pg_atomic_read_u64(&MonitoringStats->skip_hit), + pg_atomic_read_u64(&MonitoringStats->skip_new), + pg_atomic_read_u64(&MonitoringStats->skip_fpw), + pg_atomic_read_u64(&MonitoringStats->skip_seq), + avg_distance, + avg_queue_depth))); + XLogReaderFree(prefetcher->reader); + hash_destroy(prefetcher->filter_table); + pfree(prefetcher->prefetch_queue); + pfree(prefetcher); + + XLogPrefetcherResetMonitoringStats(); +}It's possibly overkill, but I think it'd be a good idea to do all the
allocations within a prefetch specific memory context. That makes
detecting potential leaks or such easier.
I looked into that, but in fact it's already pretty clear how much
memory this thing is using, if you call
MemoryContextStats(TopMemoryContext), because it's almost all in a
named hash table:
TopMemoryContext: 155776 total in 6 blocks; 18552 free (8 chunks); 137224 used
XLogPrefetcherFilterTable: 16384 total in 2 blocks; 4520 free (3
chunks); 11864 used
SP-GiST temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
GiST temporary context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
GIN recovery temporary context: 8192 total in 1 blocks; 7928 free (0
chunks); 264 used
Btree recovery temporary context: 8192 total in 1 blocks; 7928 free
(0 chunks); 264 used
RecoveryLockLists: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
PrivateRefCount: 8192 total in 1 blocks; 2584 free (0 chunks); 5608 used
MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Pending ops context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
LOCALLOCK hash: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used
ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
Grand total: 358208 bytes in 20 blocks; 86832 free (15 chunks); 271376 used
The XLogPrefetcher struct itself is not measured seperately, but I
don't think that's a problem, it's small and there's only ever one at
a time. It's that XLogPrefetcherFilterTable that is of variable size
(though it's often empty). While thinking about this, I made
prefetch_queue into a flexible array rather than a pointer to palloc'd
memory, which seemed a bit tidier.
+ /* Can we drop any filters yet, due to problem records begin replayed? */
Odd grammar.
Rewritten.
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
Hm, why isn't this part of the loop below?
It only needs to run when replaying_lsn has advanced (ie when records
have been replayed). I hope the new comment makes that clearer.
+ /* Main prefetch loop. */ + for (;;) + {This kind of looks like a separate process' main loop. The name
indicates similar. And there's no architecture documentation
disinclining one from that view...
OK, I have updated the comment.
The loop body is quite long. I think it should be split into a number of
helper functions. Perhaps one to ensure a block is read, one to maintain
stats, and then one to process block references?
I've broken the function up. It's now:
StartupXLOG()
-> XLogPrefetch()
-> XLogPrefetcherReadAhead()
-> XLogPrefetcherScanRecords()
-> XLogPrefetcherScanBlocks()
+ /* + * Scan the record for block references. We might already have been + * partway through processing this record when we hit maximum I/O + * concurrency, so start where we left off. + */ + for (int i = prefetcher->next_block_id; i <= reader->max_block_id; ++i) + {Super pointless nitpickery: For a loop-body this big I'd rather name 'i'
'blockid' or such.
Done.
Attachments:
v6-0007-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v6-0007-Allow-XLogReadRecord-to-be-non-blocking.patchDownload
From 664ece95655bfba9ed565c77e17a1ca73b5fe11c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v6 7/8] Allow XLogReadRecord() to be non-blocking.
Extend read_local_xlog_page() to support non-blocking modes:
1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlogreader.c | 37 +++++++++++----
src/backend/access/transam/xlogutils.c | 61 +++++++++++++++++++++++--
src/backend/replication/walsender.c | 2 +-
src/include/access/xlogreader.h | 4 ++
src/include/access/xlogutils.h | 16 +++++++
5 files changed, 107 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f3fea5132f..e2f2998911 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -254,6 +254,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
* If the reading fails for some other reason, NULL is also returned, and
* *errormsg is set to a string with details of the failure.
*
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
* The returned pointer (or *errormsg) points to an internal buffer that's
* valid until the next call to XLogReadRecord.
*/
@@ -543,10 +546,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
err:
/*
- * Invalidate the read state. We might read from a different source after
- * failure.
+ * Invalidate the read state, if this was an error. We might read from a
+ * different source after failure.
*/
- XLogReaderInvalReadState(state);
+ if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
if (state->errormsg_buf[0] != '\0')
*errormsg = state->errormsg_buf;
@@ -558,8 +562,9 @@ err:
* Read a single xlog page including at least [pageptr, reqLen] of valid data
* via the read_page() callback.
*
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
*
* We fetch the page from a reader-local cache if we know we have the required
* data and if there hasn't been any error since caching the data.
@@ -656,8 +661,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
return readLen;
err:
+ if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
XLogReaderInvalReadState(state);
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
/*
@@ -936,6 +944,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr found = InvalidXLogRecPtr;
XLogPageHeader header;
char *errormsg;
+ int readLen;
Assert(!XLogRecPtrIsInvalid(RecPtr));
@@ -949,7 +958,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr targetPagePtr;
int targetRecOff;
uint32 pageHeaderSize;
- int readLen;
/*
* Compute targetRecOff. It should typically be equal or greater than
@@ -1030,7 +1038,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
}
err:
- XLogReaderInvalReadState(state);
+ if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
return InvalidXLogRecPtr;
}
@@ -1081,13 +1090,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;
-
if (seg->ws_file >= 0)
close(seg->ws_file);
XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
+ /* callback reported that there was no such file */
+ if (seg->ws_file < 0)
+ {
+ errinfo->wre_errno = errno;
+ errinfo->wre_req = segbytes;
+ errinfo->wre_read = readbytes;
+ errinfo->wre_off = startoff;
+ errinfo->wre_seg = *seg;
+ return false;
+ }
+
/* Update the current segment info. */
seg->ws_tli = tli;
seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..5031877e7c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
}
}
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+ WALSegmentContext *segcxt,
+ TimeLineID *tli_p)
+{
+ TimeLineID tli = *tli_p;
+ char path[MAXPGPATH];
+ int fd;
+
+ XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+ fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+ if (fd >= 0)
+ return fd;
+
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ return -1; /* keep compiler quiet */
+}
+
/* openSegment callback for WALRead */
static int
wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,6 +856,8 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options = (XLogReadLocalOptions *) state->private_data;
+ bool try_read = false;
loc = targetPagePtr + reqLen;
@@ -845,7 +872,24 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* notices recovery finishes, so we only have to maintain it for the
* local process until recovery ends.
*/
- if (!RecoveryInProgress())
+ if (options)
+ {
+ switch (options->read_upto_policy)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ read_upto = GetWalRcvWriteRecPtr();
+ break;
+ case XLRO_END:
+ read_upto = (XLogRecPtr) -1;
+ try_read = true;
+ break;
+ default:
+ read_upto = 0;
+ elog(ERROR, "unknown read_upto_policy value");
+ break;
+ }
+ }
+ else if (!RecoveryInProgress())
read_upto = GetFlushRecPtr();
else
read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
@@ -883,6 +927,10 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
if (loc <= read_upto)
break;
+ /* not enough data there, but we were asked not to wait */
+ if (options && options->nowait)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
@@ -924,7 +972,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
else if (targetPagePtr + reqLen > read_upto)
{
/* not enough data there */
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
else
{
@@ -938,8 +986,15 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
- &state->segcxt, wal_segment_open, &errinfo))
+ &state->segcxt,
+ try_read ? wal_segment_try_open : wal_segment_open,
+ &errinfo))
+ {
+ /* Caller asked for XLRO_END, so there may be no file at all. */
+ if (try_read)
+ return XLOGPAGEREAD_ERROR;
WALReadRaiseError(&errinfo);
+ }
/* number of valid bytes in the buffer */
return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 414cf67d3d..37ec3ddc7b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
/* fail if not (implies we are going to shut down) */
if (flushptr < targetPagePtr + reqLen)
- return -1;
+ return XLOGPAGEREAD_ERROR;
if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
count = XLOG_BLCKSZ; /* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..dc99d02b60 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
typedef struct XLogReaderState XLogReaderState;
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR -1
+#define XLOGPAGEREAD_WOULDBLOCK -2
+
/* Function type definition for the read_page callback */
typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr,
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..440dffac1a 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,22 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /* How far to read. */
+ enum {
+ XLRO_WALRCV_WRITTEN,
+ XLRO_END
+ } read_upto_policy;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
--
2.20.1
v6-0008-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v6-0008-Prefetch-referenced-blocks-during-recovery.patchDownload
From 29fe16d08d3da4bdb6d950f02ba71ae784562663 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v6 8/8] Prefetch referenced blocks during recovery.
Introduce a new GUC max_recovery_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is enabled by default for
now, but we might reconsider that before release.
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 45 +
doc/src/sgml/monitoring.sgml | 71 ++
doc/src/sgml/wal.sgml | 13 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 11 +
src/backend/access/transam/xlogprefetch.c | 900 ++++++++++++++++++
src/backend/catalog/system_views.sql | 14 +
src/backend/postmaster/pgstat.c | 96 +-
src/backend/replication/logical/logical.c | 2 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 45 +-
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/include/access/xlogprefetch.h | 81 ++
src/include/catalog/pg_proc.dat | 8 +
src/include/pgstat.h | 28 +-
src/include/utils/guc.h | 4 +
src/test/regress/expected/rules.out | 11 +
17 files changed, 1334 insertions(+), 4 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetch.c
create mode 100644 src/include/access/xlogprefetch.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f68c992213..3e60f306ff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+ <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as
+ <xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
+ might be counterproductive, if it means that data falls out of the
+ kernel cache before it is needed. If this value is specified without
+ units, it is taken as bytes. A setting of -1 disables prefetching
+ during recovery.
+ The default is 256kB on systems that support
+ <function>posix_fadvise</function>, and otherwise -1.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+ <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks that were logged with full page images,
+ during recovery. Often this doesn't help, since such blocks will not
+ be read the first time they are needed and might remain in the buffer
+ pool after that. However, on file systems with a block size larger
+ than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a
+ costly read-before-write when a blocks are later written. This
+ setting has no effect unless
+ <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+ number. The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..1229a28675 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,68 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+ <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the WAL prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-recovery-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-recovery-prefetch-distance"/>,
+ <xref linkend="guc-recovery-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
@@ -3446,6 +3515,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
counters shown in the <structname>pg_stat_bgwriter</structname> view.
Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
counters shown in the <structname>pg_stat_archiver</structname> view.
+ Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+ counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
</entry>
</row>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+ be used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, prefetching in recovery is enabled,
+ but it can be disabled by setting the distance to -1.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetch.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 658af40816..4b7f902462 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -7116,6 +7117,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetchState prefetch;
InRedo = true;
@@ -7123,6 +7125,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* Prepare to prefetch, if configured. */
+ XLogPrefetchBegin(&prefetch);
+
/*
* main redo apply loop
*/
@@ -7152,6 +7157,10 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /* Peform WAL prefetching, if enabled. */
+ XLogPrefetch(&prefetch, xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7339,6 +7348,7 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ XLogPrefetchEnd(&prefetch);
if (reachedRecoveryTarget)
{
@@ -11970,6 +11980,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
+ XLogPrefetchReconfigure();
break;
case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..c190ffb6bd
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,900 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ * Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop. Currently, this is achieved by using a
+ * separate XLogReader to read ahead. In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed. After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed. These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed. Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * There is some evidence that it's better to let the operating system detect
+ * sequential access and do its own prefetching. Explicit prefetching is
+ * therefore skipped for sequential blocks, counted with "skip_seq".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer(). Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs. When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching. Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery. It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int max_recovery_prefetch_distance = -1;
+bool recovery_prefetch_fpw = false;
+
+int XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object. There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Online averages. */
+ uint64 samples;
+ double avg_queue_depth;
+ double avg_distance;
+ XLogRecPtr next_sample_lsn;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ int prefetch_head;
+ int prefetch_tail;
+ int prefetch_queue_size;
+ XLogRecPtr prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+ pg_atomic_uint64 reset_time; /* Time of last reset. */
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Sequential/repeat blocks skipped. */
+ float avg_distance;
+ float avg_queue_depth;
+
+ /* Reset counters */
+ pg_atomic_uint32 reset_request;
+ uint32 reset_handled;
+
+ /* Dynamic values */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+ return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+ pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_write_u64(&Stats->prefetch, 0);
+ pg_atomic_write_u64(&Stats->skip_hit, 0);
+ pg_atomic_write_u64(&Stats->skip_new, 0);
+ pg_atomic_write_u64(&Stats->skip_fpw, 0);
+ pg_atomic_write_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+ bool found;
+
+ Stats = (XLogPrefetchStats *)
+ ShmemInitStruct("XLogPrefetchStats",
+ sizeof(XLogPrefetchStats),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u32(&Stats->reset_request, 0);
+ Stats->reset_handled = 0;
+ pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_init_u64(&Stats->prefetch, 0);
+ pg_atomic_init_u64(&Stats->skip_hit, 0);
+ pg_atomic_init_u64(&Stats->skip_new, 0);
+ pg_atomic_init_u64(&Stats->skip_fpw, 0);
+ pg_atomic_init_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+ Stats->distance = 0;
+ Stats->queue_depth = 0;
+ }
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+ XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+ pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+ PgStat_RecoveryPrefetchStats serialized = {
+ .prefetch = pg_atomic_read_u64(&Stats->prefetch),
+ .skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+ .skip_new = pg_atomic_read_u64(&Stats->skip_new),
+ .skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+ .skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+ .stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+ };
+
+ pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+ PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+ if (serialized->stat_reset_timestamp != 0)
+ {
+ pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+ pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+ pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+ pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+ pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+ pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+ }
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+ XLogPrefetchRestoreStats();
+
+ /* We'll reconfigure on the first call to XLogPrefetch(). */
+ state->prefetcher = NULL;
+ state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+ XLogPrefetchSaveStats();
+
+ if (state->prefetcher)
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming)
+{
+ XLogPrefetcher *prefetcher;
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching. We add one to the size
+ * because our circular buffer has a gap between head and tail when full.
+ */
+ prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+ sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* Read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_END;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ &prefetcher->options);
+ prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /* Prepare to read at the given LSN. */
+ ereport(LOG,
+ (errmsg("recovery started prefetching at %X/%X",
+ (uint32) (lsn << 32), (uint32) lsn)));
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ /* Log final statistics. */
+ ereport(LOG,
+ (errmsg("recovery finished prefetching at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&Stats->prefetch),
+ pg_atomic_read_u64(&Stats->skip_hit),
+ pg_atomic_read_u64(&Stats->skip_new),
+ pg_atomic_read_u64(&Stats->skip_fpw),
+ pg_atomic_read_u64(&Stats->skip_seq),
+ Stats->avg_distance,
+ Stats->avg_queue_depth)));
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ uint32 reset_request;
+
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early and let recovery catch up.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /*
+ * Can we drop any filters yet? This happens when the LSN that is
+ * currently being replayed has moved past a record that prevents
+ * pretching of a block range, such as relation extension.
+ */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /*
+ * Have we been asked to reset our stats counters? This is checked with
+ * an unsynchronized memory read, but we'll see it eventually and we'll be
+ * accessing that cache line anyway.
+ */
+ reset_request = pg_atomic_read_u32(&Stats->reset_request);
+ if (reset_request != Stats->reset_handled)
+ {
+ XLogPrefetchResetStats();
+ Stats->reset_handled = reset_request;
+ prefetcher->avg_distance = 0;
+ prefetcher->avg_queue_depth = 0;
+ prefetcher->samples = 0;
+ }
+
+ /* OK, we can now try reading ahead. */
+ XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ for (;;)
+ {
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+ prefetcher->shutdown = true;
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ Stats->distance = distance;
+
+ /* Periodically recompute some statistics. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ /* Compute online averages. */
+ prefetcher->samples++;
+ if (prefetcher->samples == 1)
+ {
+ prefetcher->avg_distance = Stats->distance;
+ prefetcher->avg_queue_depth = Stats->queue_depth;
+ }
+ else
+ {
+ prefetcher->avg_distance +=
+ (Stats->distance - prefetcher->avg_distance) /
+ prefetcher->samples;
+ prefetcher->avg_queue_depth +=
+ (Stats->queue_depth - prefetcher->avg_queue_depth) /
+ prefetcher->samples;
+ }
+
+ /* Expose it in shared memory. */
+ Stats->avg_distance = prefetcher->avg_distance;
+ Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+ /* Also periodically save the simple counters. */
+ XLogPrefetchSaveStats();
+
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_recovery_prefetch_distance)
+ break;
+
+ /* Are we not far enough ahead? */
+ if (distance <= 0)
+ {
+ prefetcher->have_record = false; /* skip this record */
+ continue;
+ }
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /* Scan the record's block references. */
+ if (!XLogPrefetcherScanBlocks(prefetcher))
+ return;
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ /*
+ * We might already have been partway through processing this record when
+ * our queue became saturated, so we need to start where we left off.
+ */
+ for (int block_id = prefetcher->next_block_id;
+ block_id <= reader->max_block_id;
+ ++block_id)
+ {
+ PrefetchBufferResult prefetch;
+ DecodedBkpBlock *block = &reader->blocks[block_id];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might think we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !recovery_prefetch_fpw)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat or sequential access, then skip it. We
+ * expect the kernel to detect sequential access on its own and do
+ * a better job than we could.
+ */
+ if (block->blkno == prefetcher->last_blkno ||
+ block->blkno == prefetcher->last_blkno + 1)
+ {
+ prefetcher->last_blkno = block->blkno;
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+ if (BufferIsValid(prefetch.recent_buffer))
+ {
+ /*
+ * It was already cached, so do nothing. Perhaps in future we
+ * could remember the buffer so that recovery doesn't have to look
+ * it up again.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+ }
+ else if (prefetch.initiated_io)
+ {
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before processing
+ * any more blocks from this record, or move to a new record if
+ * that was the last block.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = block_id + 1;
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Neither cached nor initiated. The underlying segment file
+ * doesn't exist. Presumably it will be unlinked by a later WAL
+ * record. When recovery reads this block, it will use the
+ * EXTENSION_CREATE_RECOVERY flag. We certainly don't want to do
+ * that sort of thing while merely prefetching, so let's just
+ * ignore references to this relation until this record is
+ * replayed, and let recovery create the dummy file or complain if
+ * something is wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Expose statistics about WAL prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+ bool nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+ {
+ /* There's an unhandled reset request, so just show NULLs */
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = false;
+ }
+
+ values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+ values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+ values[6] = Int32GetDatum(Stats->distance);
+ values[7] = Int32GetDatum(Stats->queue_depth);
+ values[8] = Float4GetDatum(Stats->avg_distance);
+ values[9] = Float4GetDatum(Stats->avg_queue_depth);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth++;
+ Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth--;
+ Assert(Stats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ max_recovery_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ recovery_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..3d5afb633e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_prefetch_recovery AS
+ SELECT
+ s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9ebde47dea..c0f7333808 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
#include "access/transam.h"
#include "access/twophase_rmgr.h"
#include "access/xact.h"
+#include "access/xlogprefetch.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
#include "common/ip.h"
@@ -276,6 +277,7 @@ static int localNumBackends = 0;
static PgStat_ArchiverStats archiverStats;
static PgStat_GlobalStats globalStats;
static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
/*
* List of OIDs of databases we need to write out. If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
msg.m_resettarget = RESET_ARCHIVER;
else if (strcmp(target, "bgwriter") == 0)
msg.m_resettarget = RESET_BGWRITER;
+ else if (strcmp(target, "prefetch_recovery") == 0)
+ {
+ /*
+ * We can't ask the stats collector to do this for us as it is not
+ * attached to shared memory.
+ */
+ XLogPrefetchRequestResetStats();
+ return;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("unrecognized reset target: \"%s\"", target),
- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
}
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ * Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+ backend_read_statsfile();
+
+ return &recoveryPrefetchStats;
+}
+
+
/* ------------------------------------------------------------
* Functions for management of the shared-memory PgBackendStatus array
* ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
}
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ * Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+ PgStat_MsgRecoveryPrefetch msg;
+
+ pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+ msg.m_stats = *stats;
+ pgstat_send(&msg, sizeof(msg));
+}
+
+
/* ----------
* PgstatCollectorMain() -
*
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
pgstat_recv_slru(&msg.msg_slru, len);
break;
+ case PGSTAT_MTYPE_RECOVERYPREFETCH:
+ pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+ break;
+
case PGSTAT_MTYPE_FUNCSTAT:
pgstat_recv_funcstat(&msg.msg_funcstat, len);
break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
(void) rc; /* we'll check for error with ferror */
+ /*
+ * Write recovery prefetch stats struct
+ */
+ rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+ fpout);
+ (void) rc; /* we'll check for error with ferror */
+
/*
* Walk through the database table.
*/
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
memset(&globalStats, 0, sizeof(globalStats));
memset(&archiverStats, 0, sizeof(archiverStats));
memset(&slruStats, 0, sizeof(slruStats));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
/*
* Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
goto done;
}
+ /*
+ * Read recoveryPrefetchStats struct
+ */
+ if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+ fpin) != sizeof(recoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+ goto done;
+ }
+
/*
* We found an existing collector stats file. Read it and put all the
* hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
PgStat_GlobalStats myGlobalStats;
PgStat_ArchiverStats myArchiverStats;
PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+ PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
FILE *fpin;
int32 format_id;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
return false;
}
+ /*
+ * Read recovery prefetch stats struct
+ */
+ if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+ fpin) != sizeof(myRecoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ FreeFile(fpin);
+ return false;
+ }
+
/* By default, we're going to return the timestamp of the global file. */
*ts = myGlobalStats.stats_timestamp;
@@ -6422,6 +6504,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
slruStats[msg->m_index].truncate += msg->m_truncate;
}
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ * Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+ recoveryPrefetchStats = msg->m_stats;
+}
+
/* ----------
* pgstat_recv_recoveryconflict() -
*
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 5adf253583..792d90ef4c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -169,7 +169,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->slot = slot;
- ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, ctx);
+ ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, read_page, NULL);
if (!ctx->reader)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetch.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetchShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetchShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 03a22d71ac..6fc9ceb196 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
#include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_recovery_prefetch_distance is set to a positive number.")
+ },
+ &recovery_prefetch_fpw,
+ false,
+ NULL, assign_recovery_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable prefetching during recovery."),
+ GUC_UNIT_BYTE
+ },
+ &max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+ 256 * 1024,
+#else
+ -1,
+#endif
+ -1, INT_MAX,
+ NULL, assign_max_recovery_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2955,7 +2985,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11573,6 +11604,18 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /* Reconfigure WAL prefetching, because a setting it depends on
+ * changed. */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1ae8b77306..fd7406b399 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB # -1 disables prefetching
+#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW
+
# - Archiving -
#archive_mode = off # enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..afd807c408
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ * Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+ XLogPrefetcher *prefetcher;
+ int reconfigure_count;
+} XLogPrefetchState;
+
+/* Functions exposed only for use by the static inline wrappers below. */
+extern XLogPrefetcher *XLogPrefetcherAllocate(XLogRecPtr lsn, bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+ XLogRecPtr replaying_lsn);
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+ XLogRecPtr replaying_lsn,
+ bool from_stream)
+{
+ /*
+ * Handle any configuration changes. Rather than trying to deal with
+ * various parameter changes, we just tear down and set up a new
+ * prefetcher if anything we depend on changes.
+ */
+ if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+ {
+ /* If we had a prefetcher, tear it down. */
+ if (state->prefetcher)
+ {
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+ }
+ /* If we want a prefetcher, set it up. */
+ if (max_recovery_prefetch_distance > 0)
+ state->prefetcher = XLogPrefetcherAllocate(replaying_lsn,
+ from_stream);
+ state->reconfigure_count = XLogPrefetchReconfigureCount;
+ }
+
+ if (state->prefetcher)
+ XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2d1862a9d8..a0dabe2d18 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+ prosrc => 'pg_stat_get_prefetch_recovery' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..701eeaeb01 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
PGSTAT_MTYPE_ARCHIVER,
PGSTAT_MTYPE_BGWRITER,
PGSTAT_MTYPE_SLRU,
+ PGSTAT_MTYPE_RECOVERYPREFETCH,
PGSTAT_MTYPE_FUNCSTAT,
PGSTAT_MTYPE_FUNCPURGE,
PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
struct PgStat_TableXactStatus *next; /* next of same subxact */
} PgStat_TableXactStatus;
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+ PgStat_Counter prefetch;
+ PgStat_Counter skip_hit;
+ PgStat_Counter skip_new;
+ PgStat_Counter skip_fpw;
+ PgStat_Counter skip_seq;
+ TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
/* ------------------------------------------------------------
* Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
PgStat_Counter m_truncate;
} PgStat_MsgSLRU;
+/* ----------
+ * PgStat_MsgRecoveryPrefetch Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+ PgStat_MsgHdr m_hdr;
+ PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
/* ----------
* PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
* ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
PgStat_MsgArchiver msg_archiver;
PgStat_MsgBgWriter msg_bgwriter;
PgStat_MsgSLRU msg_slru;
+ PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
PgStat_MsgFuncstat msg_funcstat;
PgStat_MsgFuncpurge msg_funcpurge;
PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -761,7 +786,6 @@ typedef struct PgStat_SLRUStats
TimestampTz stat_reset_timestamp;
} PgStat_SLRUStats;
-
/* ----------
* Backend states
* ----------
@@ -1464,6 +1488,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
extern void pgstat_send_archiver(const char *xlog, bool failed);
extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
/* ----------
* Support functions for the SQL-callable functions to
@@ -1479,6 +1504,7 @@ extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6eec8ec568..9eda632b3c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1855,6 +1855,17 @@ pg_stat_gssapi| SELECT s.pid,
s.gss_enc AS encrypted
FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
pg_stat_progress_analyze| SELECT s.pid,
s.datid,
d.datname,
--
2.20.1
v6-0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Allow-PrefetchBuffer-to-be-called-with-a-SMgrRela.patchDownload
From 956224dfcd9dff3327751323bbb03fdb098dc0a0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:25:55 +1300
Subject: [PATCH v6 1/8] Allow PrefetchBuffer() to be called with a
SMgrRelation.
Previously a Relation was required, but it's annoying to have to create
a "fake" one in recovery. A new function PrefetchSharedBuffer() is
provided that works with SMgrRelation, and LocalPrefetchBuffer() is
renamed to PrefetchLocalBuffer() to fit with that more natural naming
scheme.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 84 ++++++++++++++++-----------
src/backend/storage/buffer/localbuf.c | 4 +-
src/include/storage/buf_internals.h | 2 +-
src/include/storage/bufmgr.h | 3 +
4 files changed, 56 insertions(+), 37 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7317ac8a2c..22087a1c3c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -480,6 +480,53 @@ static int ckpt_buforder_comparator(const void *pa, const void *pb);
static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+/*
+ * Implementation of PrefetchBuffer() for shared buffers.
+ */
+void
+PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum)
+{
+#ifdef USE_PREFETCH
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
+ LWLock *newPartitionLock; /* buffer partition lock for it */
+ int buf_id;
+
+ Assert(BlockNumberIsValid(blockNum));
+
+ /* create a tag so we can lookup the buffer */
+ INIT_BUFFERTAG(newTag, smgr_reln->smgr_rnode.node,
+ forkNum, blockNum);
+
+ /* determine its hash code and partition lock ID */
+ newHash = BufTableHashCode(&newTag);
+ newPartitionLock = BufMappingPartitionLock(newHash);
+
+ /* see if the block is in the buffer pool already */
+ LWLockAcquire(newPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&newTag, newHash);
+ LWLockRelease(newPartitionLock);
+
+ /* If not in buffers, initiate prefetch */
+ if (buf_id < 0)
+ smgrprefetch(smgr_reln, forkNum, blockNum);
+
+ /*
+ * If the block *is* in buffers, we do nothing. This is not really ideal:
+ * the block might be just about to be evicted, which would be stupid
+ * since we know we are going to need it soon. But the only easy answer
+ * is to bump the usage_count, which does not seem like a great solution:
+ * when the caller does ultimately touch the block, usage_count would get
+ * bumped again, resulting in too much favoritism for blocks that are
+ * involved in a prefetch sequence. A real fix would involve some
+ * additional per-buffer state, and it's not clear that there's enough of
+ * a problem to justify that.
+ */
+#endif /* USE_PREFETCH */
+}
+
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
@@ -507,43 +554,12 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- LocalPrefetchBuffer(reln->rd_smgr, forkNum, blockNum);
+ PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
- LWLock *newPartitionLock; /* buffer partition lock for it */
- int buf_id;
-
- /* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
- forkNum, blockNum);
-
- /* determine its hash code and partition lock ID */
- newHash = BufTableHashCode(&newTag);
- newPartitionLock = BufMappingPartitionLock(newHash);
-
- /* see if the block is in the buffer pool already */
- LWLockAcquire(newPartitionLock, LW_SHARED);
- buf_id = BufTableLookup(&newTag, newHash);
- LWLockRelease(newPartitionLock);
-
- /* If not in buffers, initiate prefetch */
- if (buf_id < 0)
- smgrprefetch(reln->rd_smgr, forkNum, blockNum);
-
- /*
- * If the block *is* in buffers, we do nothing. This is not really
- * ideal: the block might be just about to be evicted, which would be
- * stupid since we know we are going to need it soon. But the only
- * easy answer is to bump the usage_count, which does not seem like a
- * great solution: when the caller does ultimately touch the block,
- * usage_count would get bumped again, resulting in too much
- * favoritism for blocks that are involved in a prefetch sequence. A
- * real fix would involve some additional per-buffer state, and it's
- * not clear that there's enough of a problem to justify that.
- */
+ /* pass it to the shared buffer version */
+ PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index cac08e1b1a..b528bc9553 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -54,14 +54,14 @@ static Block GetLocalBufferStorage(void);
/*
- * LocalPrefetchBuffer -
+ * PrefetchLocalBuffer -
* initiate asynchronous read of a block of a relation
*
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
void
-LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
#ifdef USE_PREFETCH
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bf3b8ad340..166fe334c7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,7 +327,7 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
+extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bf3b12a2de..39660aacba 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -162,6 +162,9 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
+extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.20.1
v6-0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Rename-GetWalRcvWriteRecPtr-to-GetWalRcvFlushRecP.patchDownload
From 260563b1400f32a94ee4cc5e4552d17fecfb4ea3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH v6 2/8] Rename GetWalRcvWriteRecPtr() to
GetWalRcvFlushRecPtr().
The new name better reflects the fact that the value it returns is
updated only when received data has been flushed to disk. Also rename a
couple of variables relating to this value.
An upcoming patch will make use of the latest data that was written
without waiting for it to be flushed, so let's use more precise function
names.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlog.c | 20 ++++++++++----------
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/README | 2 +-
src/backend/replication/walreceiver.c | 10 +++++-----
src/backend/replication/walreceiverfuncs.c | 12 ++++++------
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 8 ++++----
7 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abf954ba39..658af40816 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -207,8 +207,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
static XLogRecPtr LastRec;
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
static TimeLineID receiveTLI = 0;
/*
@@ -9335,7 +9335,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -11732,7 +11732,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto < targetPagePtr + reqLen))
+ flushedUpto < targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -11763,10 +11763,10 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
- readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+ readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
targetPageOff;
}
else
@@ -12181,7 +12181,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName,
wal_receiver_create_temp_slot);
- receivedUpto = 0;
+ flushedUpto = 0;
}
/*
@@ -12205,14 +12205,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* XLogReceiptTime will not advance, so the grace time
* allotted to conflicting queries will decrease.
*/
- if (RecPtr < receivedUpto)
+ if (RecPtr < flushedUpto)
havedata = true;
else
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
- if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+ flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+ if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
{
havedata = true;
if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b84ba57259..00e1b33ed5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
WalRcvData->receiveStart.
As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
the startup process to know how far WAL replay can advance.
Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index aee67c61aa..1363c3facc 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
* in the primary server), and then keeps receiving XLOG records and
* writing them to the disk as long as the connection is alive. As XLOG
* records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
* process of how far it can proceed with XLOG replay.
*
* A WAL receiver cannot directly load GUC parameters used when establishing
@@ -1005,10 +1005,10 @@ XLogWalRcvFlush(bool dying)
/* Update shared-memory status */
SpinLockAcquire(&walrcv->mutex);
- if (walrcv->receivedUpto < LogstreamResult.Flush)
+ if (walrcv->flushedUpto < LogstreamResult.Flush)
{
- walrcv->latestChunkStart = walrcv->receivedUpto;
- walrcv->receivedUpto = LogstreamResult.Flush;
+ walrcv->latestChunkStart = walrcv->flushedUpto;
+ walrcv->flushedUpto = LogstreamResult.Flush;
walrcv->receivedTLI = ThisTimeLineID;
}
SpinLockRelease(&walrcv->mutex);
@@ -1361,7 +1361,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
state = WalRcv->walRcvState;
receive_start_lsn = WalRcv->receiveStart;
receive_start_tli = WalRcv->receiveStartTLI;
- received_lsn = WalRcv->receivedUpto;
+ received_lsn = WalRcv->flushedUpto;
received_tli = WalRcv->receivedTLI;
last_send_time = WalRcv->lastMsgSendTime;
last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 21d1823607..32260c2236 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -282,11 +282,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
/*
* If this is the first startup of walreceiver (on this timeline),
- * initialize receivedUpto and latestChunkStart to the starting point.
+ * initialize flushedUpto and latestChunkStart to the starting point.
*/
if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
{
- walrcv->receivedUpto = recptr;
+ walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
walrcv->latestChunkStart = recptr;
}
@@ -304,7 +304,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -312,13 +312,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
SpinLockAcquire(&walrcv->mutex);
- recptr = walrcv->receivedUpto;
+ recptr = walrcv->flushedUpto;
if (latestChunkStart)
*latestChunkStart = walrcv->latestChunkStart;
if (receiveTLI)
@@ -345,7 +345,7 @@ GetReplicationApplyDelay(void)
TimestampTz chunkReplayStartTime;
SpinLockAcquire(&walrcv->mutex);
- receivePtr = walrcv->receivedUpto;
+ receivePtr = walrcv->flushedUpto;
SpinLockRelease(&walrcv->mutex);
replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9e5611574c..414cf67d3d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2949,7 +2949,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cf3e43128c..6298ca07be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -73,19 +73,19 @@ typedef struct
TimeLineID receiveStartTLI;
/*
- * receivedUpto-1 is the last byte position that has already been
+ * flushedUpto-1 is the last byte position that has already been
* received, and receivedTLI is the timeline it came from. At the first
* startup of walreceiver, these are set to receiveStart and
* receiveStartTLI. After that, walreceiver updates these whenever it
* flushes the received WAL to disk.
*/
- XLogRecPtr receivedUpto;
+ XLogRecPtr flushedUpto;
TimeLineID receivedTLI;
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
- * receivedUpto before the last flush to disk. Startup process can use
+ * flushedUpto before the last flush to disk. Startup process can use
* this to detect whether it's keeping up or not.
*/
XLogRecPtr latestChunkStart;
@@ -322,7 +322,7 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname,
bool create_temp_slot);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
v6-0003-Add-GetWalRcvWriteRecPtr-new-definition.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Add-GetWalRcvWriteRecPtr-new-definition.patchDownload
From 1ae59996301aefd91efd10fcbee50b2c3d7c140e Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 9 Dec 2019 17:22:07 +1300
Subject: [PATCH v6 3/8] Add GetWalRcvWriteRecPtr() (new definition).
A later patch will read received WAL to prefetch referenced blocks,
without waiting for the data to be flushed to disk. To do that, it
needs to be able to see the write pointer advancing in shared memory.
The function formerly bearing name was recently renamed to
GetWalRcvFlushRecPtr(), which better described what it does.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/replication/walreceiver.c | 5 +++++
src/backend/replication/walreceiverfuncs.c | 12 ++++++++++++
src/include/replication/walreceiver.h | 10 ++++++++++
3 files changed, 27 insertions(+)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 1363c3facc..d69fb90132 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -261,6 +261,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -984,6 +986,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 32260c2236..4afad83539 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -328,6 +328,18 @@ GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ WalRcvData *walrcv = WalRcv;
+
+ return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 6298ca07be..f1aa6e9977 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -141,6 +142,14 @@ typedef struct
slock_t mutex; /* locks shared variables shown above */
+ /*
+ * Like flushedUpto, but advanced after writing and before flushing,
+ * without the need to acquire the spin lock. Data can be read by another
+ * process up to this point, but shouldn't be used for data integrity
+ * purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* force walreceiver reply? This doesn't need to be locked; memory
* barriers for ordering are sufficient. But we do need atomic fetch and
@@ -323,6 +332,7 @@ extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname,
bool create_temp_slot);
extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
v6-0004-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload
From 62556d7ecaed1e225afdf4c8a7b51e66d9affab4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v6 4/8] Add pg_atomic_unlocked_add_fetch_XXX().
Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/include/port/atomics.h | 24 ++++++++++++++++++++++
src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
return pg_atomic_add_fetch_u32_impl(ptr, add_);
}
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ AssertPointerAlignment(ptr, 4);
+ return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
/*
* pg_atomic_sub_fetch_u32 - atomically subtract from variable
*
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+ AssertPointerAlignment(ptr, 8);
+#endif
+ return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
#undef INSIDE_ATOMICS_H
#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
}
#endif
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ ptr->value += add_;
+ return ptr->value;
+}
+#endif
+
#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
#define PG_HAVE_ATOMIC_SUB_FETCH_U32
static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
}
#endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+ !defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ ptr->value += val;
+ return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
--
2.20.1
v6-0005-Allow-PrefetchBuffer-to-report-what-happened.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Allow-PrefetchBuffer-to-report-what-happened.patchDownload
From 6fffe00e39ec837cb08afb57bce413b8fad456ed Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 17:26:41 +1300
Subject: [PATCH v6 5/8] Allow PrefetchBuffer() to report what happened.
Report whether a prefetch was actually initiated due to a cache miss, so
that callers can limit the number of concurrent I/Os they try to issue,
without counting the prefetch calls that did nothing because the page
was already in our buffers.
If the requested block was already cached, return a valid buffer. This
might enable future code to avoid a buffer mapping lookup, though it
will need to recheck the buffer before using it because it's not pinned
so could be reclaimed at any time.
Report neither hit nor miss when a relation's backing file is missing,
to prepare for use during recovery. This will be used to handle cases
of relations that are referenced in the WAL but have been unlinked
already due to actions covered by WAL records that haven't been replayed
yet, after a crash.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/storage/buffer/bufmgr.c | 57 +++++++++++++++++++++------
src/backend/storage/buffer/localbuf.c | 18 ++++++---
src/backend/storage/smgr/md.c | 9 ++++-
src/backend/storage/smgr/smgr.c | 10 +++--
src/include/storage/buf_internals.h | 5 ++-
src/include/storage/bufmgr.h | 19 ++++++---
src/include/storage/md.h | 2 +-
src/include/storage/smgr.h | 2 +-
8 files changed, 90 insertions(+), 32 deletions(-)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 22087a1c3c..23f269ae74 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -483,14 +483,14 @@ static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
/*
* Implementation of PrefetchBuffer() for shared buffers.
*/
-void
+PrefetchBufferResult
PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
ForkNumber forkNum,
BlockNumber blockNum)
{
-#ifdef USE_PREFETCH
- BufferTag newTag; /* identity of requested block */
- uint32 newHash; /* hash value for newTag */
+ PrefetchBufferResult result = {InvalidBuffer, false};
+ BufferTag newTag; /* identity of requested block */
+ uint32 newHash; /* hash value for newTag */
LWLock *newPartitionLock; /* buffer partition lock for it */
int buf_id;
@@ -511,7 +511,25 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
/* If not in buffers, initiate prefetch */
if (buf_id < 0)
- smgrprefetch(smgr_reln, forkNum, blockNum);
+ {
+#ifdef USE_PREFETCH
+ /*
+ * Try to initiate an asynchronous read. This returns false in
+ * recovery if the relation file doesn't exist.
+ */
+ if (smgrprefetch(smgr_reln, forkNum, blockNum))
+ result.initiated_io = true;
+#endif /* USE_PREFETCH */
+ }
+ else
+ {
+ /*
+ * Report the buffer it was in at that time. The caller may be able
+ * to avoid a buffer table lookup, but it's not pinned and it must be
+ * rechecked!
+ */
+ result.recent_buffer = buf_id + 1;
+ }
/*
* If the block *is* in buffers, we do nothing. This is not really ideal:
@@ -524,7 +542,8 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* additional per-buffer state, and it's not clear that there's enough of
* a problem to justify that.
*/
-#endif /* USE_PREFETCH */
+
+ return result;
}
/*
@@ -533,12 +552,27 @@ PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
* This is named by analogy to ReadBuffer but doesn't actually allocate a
* buffer. Instead it tries to ensure that a future ReadBuffer for the given
* block will not be delayed by the I/O. Prefetching is optional.
- * No-op if prefetching isn't compiled in.
+ *
+ * There are three possible outcomes:
+ *
+ * 1. If the block is already cached, the result includes a valid buffer that
+ * could be used by the caller to avoid the need for a later buffer lookup, but
+ * it's not pinned, so the caller must recheck it.
+ *
+ * 2. If the kernel has been asked to initiate I/O, the initated_io member is
+ * true. Currently there is no way to know if the data was already cached by
+ * the kernel and therefore didn't really initiate I/O, and no way to know when
+ * the I/O completes other than using synchronous ReadBuffer().
+ *
+ * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either
+ * USE_PREFETCH is not defined (this build doesn't support prefetching due to
+ * lack of a kernel facility), or the underlying relation file was found and we
+ * are in recovery. (If the relation file isn't found and we are not in
+ * recovery, an error is raised).
*/
-void
+PrefetchBufferResult
PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
{
-#ifdef USE_PREFETCH
Assert(RelationIsValid(reln));
Assert(BlockNumberIsValid(blockNum));
@@ -554,14 +588,13 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
errmsg("cannot access temporary tables of other sessions")));
/* pass it off to localbuf.c */
- PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchLocalBuffer(reln->rd_smgr, forkNum, blockNum);
}
else
{
/* pass it to the shared buffer version */
- PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
+ return PrefetchSharedBuffer(reln->rd_smgr, forkNum, blockNum);
}
-#endif /* USE_PREFETCH */
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b528bc9553..1614ca03ea 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -60,11 +60,11 @@ static Block GetLocalBufferStorage(void);
* Do PrefetchBuffer's work for temporary relations.
* No-op if prefetching isn't compiled in.
*/
-void
+PrefetchBufferResult
PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum)
{
-#ifdef USE_PREFETCH
+ PrefetchBufferResult result = { InvalidBuffer, false };
BufferTag newTag; /* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -81,12 +81,18 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
if (hresult)
{
/* Yes, so nothing to do */
- return;
+ result.recent_buffer = -hresult->id - 1;
}
-
- /* Not in buffers, so initiate prefetch */
- smgrprefetch(smgr, forkNum, blockNum);
+ else
+ {
+#ifdef USE_PREFETCH
+ /* Not in buffers, so initiate prefetch */
+ smgrprefetch(smgr, forkNum, blockNum);
+ result.initiated_io = true;
#endif /* USE_PREFETCH */
+ }
+
+ return result;
}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ee9822c6e1..e0b020da11 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -524,14 +524,17 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
/*
* mdprefetch() -- Initiate asynchronous read of the specified block of a relation
*/
-void
+bool
mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
#ifdef USE_PREFETCH
off_t seekpos;
MdfdVec *v;
- v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+ v = _mdfd_getseg(reln, forknum, blocknum, false,
+ InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+ if (v == NULL)
+ return false;
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
@@ -539,6 +542,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
#endif /* USE_PREFETCH */
+
+ return true;
}
/*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 72c9696ad1..b053a4dc76 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
- void (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
+ bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_read) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
@@ -524,11 +524,15 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
/*
* smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
+ *
+ * In recovery only, this can return false to indicate that a file
+ * doesn't exist (presumably it has been dropped by a later WAL
+ * record).
*/
-void
+bool
smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
{
- smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+ return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
}
/*
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 166fe334c7..e57f84ee9c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -327,8 +327,9 @@ extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
/* localbuf.c */
-extern void PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
BlockNumber blockNum, bool *foundPtr);
extern void MarkLocalBufferDirty(Buffer buffer);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 39660aacba..ee91b8fa26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -46,6 +46,15 @@ typedef enum
* replay; otherwise same as RBM_NORMAL */
} ReadBufferMode;
+/*
+ * Type returned by PrefetchBuffer().
+ */
+typedef struct PrefetchBufferResult
+{
+ Buffer recent_buffer; /* If valid, a hit (recheck needed!) */
+ bool initiated_io; /* If true, a miss resulting in async I/O */
+} PrefetchBufferResult;
+
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
@@ -162,11 +171,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
-extern void PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
- ForkNumber forkNum,
- BlockNumber blockNum);
-extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
- BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
+ ForkNumber forkNum,
+ BlockNumber blockNum);
+extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
+ BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ec7630ce3b..07fd1bb7d0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,7 +28,7 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
extern void mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 79dfe0e373..bb8428f27f 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -93,7 +93,7 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
+extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
extern void smgrread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, char *buffer);
--
2.20.1
v6-0006-Add-ReadBufferPrefetched-POC-only.patchtext/x-patch; charset=US-ASCII; name=v6-0006-Add-ReadBufferPrefetched-POC-only.patchDownload
From a00a528be26b06d7de40d59381b7ee864f06f3a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 26 Mar 2020 22:34:29 +1300
Subject: [PATCH v6 6/8] Add ReadBufferPrefetched() (POC only)
Provide a potentially faster version of ReadBuffer(), or cases where you
have a PrefetchBufferResult. We might be able to avoid an extra buffer
mapping table lookup.
NOT FOR COMMIT -- PROOF OF CONCEPT ONLY
---
src/backend/storage/buffer/bufmgr.c | 49 +++++++++++++++++++++++++++++
src/include/storage/bufmgr.h | 3 ++
2 files changed, 52 insertions(+)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 23f269ae74..f00c837f5a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -597,6 +597,55 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
}
}
+/*
+ * ReadBufferPrefetched -- read a buffer for which a prefetch was issued
+ *
+ * Like ReadBuffer(), but try to use the result of a recent PrefetchBuffer()
+ * call to avoid a buffer mapping table lookup.
+ */
+Buffer
+ReadBufferPrefetched(PrefetchBufferResult *prefetch,
+ Relation reln,
+ BlockNumber blockNum)
+{
+ /*
+ * If PrefetchBuffer() found this block in a buffer recently, try to pin it
+ * and then double check that it still holds the block we want.
+ */
+ if (BufferIsValid(prefetch->recent_buffer))
+ {
+ BufferDesc *bufHdr;
+ BufferTag tag;
+
+ if (BufferIsLocal(prefetch->recent_buffer))
+ {
+ bufHdr = GetBufferDescriptor(-prefetch->recent_buffer - 1);
+ }
+ else
+ {
+ bufHdr = GetBufferDescriptor(prefetch->recent_buffer - 1);
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+ if (!PinBuffer(bufHdr, NULL))
+ {
+ /* not valid, forget about it */
+ UnpinBuffer(bufHdr, true);
+ bufHdr = NULL;
+ }
+ }
+
+ /* If we managed to pin it or it's local, check tag. */
+ if (bufHdr)
+ {
+ RelationOpenSmgr(reln);
+ INIT_BUFFERTAG(tag, reln->rd_smgr->smgr_rnode.node, MAIN_FORKNUM,
+ blockNum);
+ if (BUFFERTAGS_EQUAL(tag, bufHdr->tag))
+ return BufferDescriptorGetBuffer(bufHdr);
+ }
+ }
+
+ return ReadBuffer(reln, blockNum);
+}
/*
* ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..8f6b19e6ac 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -183,6 +183,9 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
ForkNumber forkNum, BlockNumber blockNum,
ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer ReadBufferPrefetched(PrefetchBufferResult *prefetch,
+ Relation reln,
+ BlockNumber blockNum);
extern void ReleaseBuffer(Buffer buffer);
extern void UnlockReleaseBuffer(Buffer buffer);
extern void MarkBufferDirty(Buffer buffer);
--
2.20.1
On Wed, Apr 8, 2020 at 4:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Thanks for all that feedback. It's been a strange couple of weeks,
but I finally have a new version that addresses most of that feedback
(but punts on a couple of suggestions for later development, due to
lack of time).
Here's an executive summary of an off-list chat with Andres:
* he withdrew his objection to the new definition of
GetWalRcvWriteRecPtr() based on my argument that any external code
will fail to compile anyway
* he doesn't like the naive code that detects sequential access and
skips prefetching; I agreed to rip it out for now and revisit if/when
we have better evidence that that's worth bothering with; the code
path that does that and the pg_stat_recovery_prefetch.skip_seq counter
will remain, but be used only to skip prefetching of repeated access
to the *same* block for now
* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode
* he +1s the plan to commit with the feature enabled, and revisit before release
* he thinks the idea of a variant of ReadBuffer() that takes a
PrefetchBufferResult (as sketched by the v6 0006 patch) broadly makes
sense as a stepping stone towards his asychronous I/O proposal, but
there's no point in committing something like 0006 without a user
I'm going to go and commit the first few patches in this series, and
come back in a bit with a new version of the main patch to fix the
above and a compiler warning reported by cfbot.
On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
mode
So... logical.c wants to give its LogicalDecodingContext to any
XLogPageReadCB you give it, via "private_data"; that is, it really
only accepts XLogPageReadCB implementations that understand that (or
ignore it). What I want to do is give every XLogPageReadCB the chance
to have its own state that it is control of (to receive settings
specific to the implementation, or whatever), that you supply along
with it. We can't do both kinds of things with private_data, so I
have added a second member read_page_data to XLogReaderState. If you
pass in read_local_xlog_page as read_page, then you can optionally
install a pointer to XLogReadLocalOptions as reader->read_page_data,
to activate the new behaviours I added for prefetching purposes.
While working on that, I realised the readahead XLogReader was
breaking a rule expressed in XLogReadDetermineTimeLine(). Timelines
are really confusing and there were probably several subtle or not to
subtle bugs there. So I added an option to skip all of that logic,
and just say "I command you to read only from TLI X". It reads the
same TLI as recovery is reading, until it hits the end of readable
data and that causes prefetching to shut down. Then the main recovery
loop resets the prefetching module when it sees a TLI switch, so then
it starts up again. This seems to work reliably, but I've obviously
had limited time to test. Does this scheme sound sane?
I think this is basically committable (though of course I wish I had
more time to test and review). Ugh. Feature freeze in half an hour.
Attachments:
v7-0001-Rationalize-GetWalRcv-Write-Flush-RecPtr.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Rationalize-GetWalRcv-Write-Flush-RecPtr.patchDownload
From 8654ea7f2ed61de7ab3f0b305e37d190932ad97c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 17 Mar 2020 15:28:08 +1300
Subject: [PATCH v7 1/4] Rationalize GetWalRcv{Write,Flush}RecPtr().
GetWalRcvWriteRecPtr() previously reported the latest *flushed*
location. Adopt the conventional terminology used elsewhere in the tree
by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related
variables that used the term "received".
Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest
*written* value. This will allow later patches to use the value for
non-data-integrity purposes, without having to wait for the flush
pointer to advance.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlog.c | 20 +++++++++---------
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/replication/README | 2 +-
src/backend/replication/walreceiver.c | 15 +++++++++-----
src/backend/replication/walreceiverfuncs.c | 24 ++++++++++++++++------
src/backend/replication/walsender.c | 2 +-
src/include/replication/walreceiver.h | 18 ++++++++++++----
7 files changed, 55 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8a4c1743e5..c60842ea03 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -209,8 +209,8 @@ HotStandbyState standbyState = STANDBY_DISABLED;
static XLogRecPtr LastRec;
-/* Local copy of WalRcv->receivedUpto */
-static XLogRecPtr receivedUpto = 0;
+/* Local copy of WalRcv->flushedUpto */
+static XLogRecPtr flushedUpto = 0;
static TimeLineID receiveTLI = 0;
/*
@@ -9376,7 +9376,7 @@ CreateRestartPoint(int flags)
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
@@ -11869,7 +11869,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
- receivedUpto < targetPagePtr + reqLen))
+ flushedUpto < targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -11900,10 +11900,10 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
else
- readLen = XLogSegmentOffset(receivedUpto, wal_segment_size) -
+ readLen = XLogSegmentOffset(flushedUpto, wal_segment_size) -
targetPageOff;
}
else
@@ -12318,7 +12318,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName,
wal_receiver_create_temp_slot);
- receivedUpto = 0;
+ flushedUpto = 0;
}
/*
@@ -12342,14 +12342,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* XLogReceiptTime will not advance, so the grace time
* allotted to conflicting queries will decrease.
*/
- if (RecPtr < receivedUpto)
+ if (RecPtr < flushedUpto)
havedata = true;
else
{
XLogRecPtr latestChunkStart;
- receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
- if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
+ flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI);
+ if (RecPtr < flushedUpto && receiveTLI == curFileTLI)
{
havedata = true;
if (latestChunkStart <= RecPtr)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b84ba57259..00e1b33ed5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -398,7 +398,7 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
{
XLogRecPtr recptr;
- recptr = GetWalRcvWriteRecPtr(NULL, NULL);
+ recptr = GetWalRcvFlushRecPtr(NULL, NULL);
if (recptr == 0)
PG_RETURN_NULL();
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 0cbb990613..8ccdd86e74 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -54,7 +54,7 @@ and WalRcvData->slotname, and initializes the starting point in
WalRcvData->receiveStart.
As walreceiver receives WAL from the master server, and writes and flushes
-it to disk (in pg_wal), it updates WalRcvData->receivedUpto and signals
+it to disk (in pg_wal), it updates WalRcvData->flushedUpto and signals
the startup process to know how far WAL replay can advance.
Walreceiver sends information about replication progress to the master server
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index aee67c61aa..d69fb90132 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -12,7 +12,7 @@
* in the primary server), and then keeps receiving XLOG records and
* writing them to the disk as long as the connection is alive. As XLOG
* records are received and flushed to disk, it updates the
- * WalRcv->receivedUpto variable in shared memory, to inform the startup
+ * WalRcv->flushedUpto variable in shared memory, to inform the startup
* process of how far it can proceed with XLOG replay.
*
* A WAL receiver cannot directly load GUC parameters used when establishing
@@ -261,6 +261,8 @@ WalReceiverMain(void)
SpinLockRelease(&walrcv->mutex);
+ pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -984,6 +986,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
LogstreamResult.Write = recptr;
}
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
/*
@@ -1005,10 +1010,10 @@ XLogWalRcvFlush(bool dying)
/* Update shared-memory status */
SpinLockAcquire(&walrcv->mutex);
- if (walrcv->receivedUpto < LogstreamResult.Flush)
+ if (walrcv->flushedUpto < LogstreamResult.Flush)
{
- walrcv->latestChunkStart = walrcv->receivedUpto;
- walrcv->receivedUpto = LogstreamResult.Flush;
+ walrcv->latestChunkStart = walrcv->flushedUpto;
+ walrcv->flushedUpto = LogstreamResult.Flush;
walrcv->receivedTLI = ThisTimeLineID;
}
SpinLockRelease(&walrcv->mutex);
@@ -1361,7 +1366,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
state = WalRcv->walRcvState;
receive_start_lsn = WalRcv->receiveStart;
receive_start_tli = WalRcv->receiveStartTLI;
- received_lsn = WalRcv->receivedUpto;
+ received_lsn = WalRcv->flushedUpto;
received_tli = WalRcv->receivedTLI;
last_send_time = WalRcv->lastMsgSendTime;
last_receipt_time = WalRcv->lastMsgReceiptTime;
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 21d1823607..4afad83539 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -282,11 +282,11 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
/*
* If this is the first startup of walreceiver (on this timeline),
- * initialize receivedUpto and latestChunkStart to the starting point.
+ * initialize flushedUpto and latestChunkStart to the starting point.
*/
if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
{
- walrcv->receivedUpto = recptr;
+ walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
walrcv->latestChunkStart = recptr;
}
@@ -304,7 +304,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
}
/*
- * Returns the last+1 byte position that walreceiver has written.
+ * Returns the last+1 byte position that walreceiver has flushed.
*
* Optionally, returns the previous chunk start, that is the first byte
* written in the most recent walreceiver flush cycle. Callers not
@@ -312,13 +312,13 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
* receiveTLI.
*/
XLogRecPtr
-GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
+GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
{
WalRcvData *walrcv = WalRcv;
XLogRecPtr recptr;
SpinLockAcquire(&walrcv->mutex);
- recptr = walrcv->receivedUpto;
+ recptr = walrcv->flushedUpto;
if (latestChunkStart)
*latestChunkStart = walrcv->latestChunkStart;
if (receiveTLI)
@@ -328,6 +328,18 @@ GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI)
return recptr;
}
+/*
+ * Returns the last+1 byte position that walreceiver has written.
+ * This returns a recently written value without taking a lock.
+ */
+XLogRecPtr
+GetWalRcvWriteRecPtr(void)
+{
+ WalRcvData *walrcv = WalRcv;
+
+ return pg_atomic_read_u64(&walrcv->writtenUpto);
+}
+
/*
* Returns the replication apply delay in ms or -1
* if the apply delay info is not available
@@ -345,7 +357,7 @@ GetReplicationApplyDelay(void)
TimestampTz chunkReplayStartTime;
SpinLockAcquire(&walrcv->mutex);
- receivePtr = walrcv->receivedUpto;
+ receivePtr = walrcv->flushedUpto;
SpinLockRelease(&walrcv->mutex);
replayPtr = GetXLogReplayRecPtr(NULL);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 06e8b79036..122d884f3e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2949,7 +2949,7 @@ GetStandbyFlushRecPtr(void)
* has streamed, but hasn't been replayed yet.
*/
- receivePtr = GetWalRcvWriteRecPtr(NULL, &receiveTLI);
+ receivePtr = GetWalRcvFlushRecPtr(NULL, &receiveTLI);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
ThisTimeLineID = replayTLI;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index cf3e43128c..f1aa6e9977 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "getaddrinfo.h" /* for NI_MAXHOST */
#include "pgtime.h"
+#include "port/atomics.h"
#include "replication/logicalproto.h"
#include "replication/walsender.h"
#include "storage/latch.h"
@@ -73,19 +74,19 @@ typedef struct
TimeLineID receiveStartTLI;
/*
- * receivedUpto-1 is the last byte position that has already been
+ * flushedUpto-1 is the last byte position that has already been
* received, and receivedTLI is the timeline it came from. At the first
* startup of walreceiver, these are set to receiveStart and
* receiveStartTLI. After that, walreceiver updates these whenever it
* flushes the received WAL to disk.
*/
- XLogRecPtr receivedUpto;
+ XLogRecPtr flushedUpto;
TimeLineID receivedTLI;
/*
* latestChunkStart is the starting byte position of the current "batch"
* of received WAL. It's actually the same as the previous value of
- * receivedUpto before the last flush to disk. Startup process can use
+ * flushedUpto before the last flush to disk. Startup process can use
* this to detect whether it's keeping up or not.
*/
XLogRecPtr latestChunkStart;
@@ -141,6 +142,14 @@ typedef struct
slock_t mutex; /* locks shared variables shown above */
+ /*
+ * Like flushedUpto, but advanced after writing and before flushing,
+ * without the need to acquire the spin lock. Data can be read by another
+ * process up to this point, but shouldn't be used for data integrity
+ * purposes.
+ */
+ pg_atomic_uint64 writtenUpto;
+
/*
* force walreceiver reply? This doesn't need to be locked; memory
* barriers for ordering are sufficient. But we do need atomic fetch and
@@ -322,7 +331,8 @@ extern bool WalRcvRunning(void);
extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
const char *conninfo, const char *slotname,
bool create_temp_slot);
-extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
+extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
extern void WalRcvForceReply(void);
--
2.20.1
v7-0002-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload
From d0a1b60cbe589a4023b94db35ce3b830f5cbde04 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v7 2/4] Add pg_atomic_unlocked_add_fetch_XXX().
Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/include/port/atomics.h | 24 ++++++++++++++++++++++
src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
return pg_atomic_add_fetch_u32_impl(ptr, add_);
}
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ AssertPointerAlignment(ptr, 4);
+ return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
/*
* pg_atomic_sub_fetch_u32 - atomically subtract from variable
*
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+ AssertPointerAlignment(ptr, 8);
+#endif
+ return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
#undef INSIDE_ATOMICS_H
#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
}
#endif
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ ptr->value += add_;
+ return ptr->value;
+}
+#endif
+
#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
#define PG_HAVE_ATOMIC_SUB_FETCH_U32
static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
}
#endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+ !defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ ptr->value += val;
+ return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
--
2.20.1
v7-0003-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Allow-XLogReadRecord-to-be-non-blocking.patchDownload
From dea9a3c46d35b12bbea8469e44223f73e4004d22 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v7 3/4] Allow XLogReadRecord() to be non-blocking.
Extend read_local_xlog_page() to support non-blocking modes:
1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlogreader.c | 37 ++++--
src/backend/access/transam/xlogutils.c | 151 +++++++++++++++++-------
src/backend/replication/walsender.c | 2 +-
src/include/access/xlogreader.h | 20 +++-
src/include/access/xlogutils.h | 23 ++++
5 files changed, 178 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..554b2029da 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -257,6 +257,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
* If the reading fails for some other reason, NULL is also returned, and
* *errormsg is set to a string with details of the failure.
*
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
* The returned pointer (or *errormsg) points to an internal buffer that's
* valid until the next call to XLogReadRecord.
*/
@@ -546,10 +549,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
err:
/*
- * Invalidate the read state. We might read from a different source after
- * failure.
+ * Invalidate the read state, if this was an error. We might read from a
+ * different source after failure.
*/
- XLogReaderInvalReadState(state);
+ if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
if (state->errormsg_buf[0] != '\0')
*errormsg = state->errormsg_buf;
@@ -561,8 +565,9 @@ err:
* Read a single xlog page including at least [pageptr, reqLen] of valid data
* via the read_page() callback.
*
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
*
* We fetch the page from a reader-local cache if we know we have the required
* data and if there hasn't been any error since caching the data.
@@ -659,8 +664,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
return readLen;
err:
+ if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
XLogReaderInvalReadState(state);
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
/*
@@ -939,6 +947,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr found = InvalidXLogRecPtr;
XLogPageHeader header;
char *errormsg;
+ int readLen;
Assert(!XLogRecPtrIsInvalid(RecPtr));
@@ -952,7 +961,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr targetPagePtr;
int targetRecOff;
uint32 pageHeaderSize;
- int readLen;
/*
* Compute targetRecOff. It should typically be equal or greater than
@@ -1033,7 +1041,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
}
err:
- XLogReaderInvalReadState(state);
+ if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
return InvalidXLogRecPtr;
}
@@ -1084,13 +1093,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;
-
if (seg->ws_file >= 0)
close(seg->ws_file);
XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
+ /* callback reported that there was no such file */
+ if (seg->ws_file < 0)
+ {
+ errinfo->wre_errno = errno;
+ errinfo->wre_req = 0;
+ errinfo->wre_read = 0;
+ errinfo->wre_off = startoff;
+ errinfo->wre_seg = *seg;
+ return false;
+ }
+
/* Update the current segment info. */
seg->ws_tli = tli;
seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..2d702437dd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
}
}
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+ WALSegmentContext *segcxt,
+ TimeLineID *tli_p)
+{
+ TimeLineID tli = *tli_p;
+ char path[MAXPGPATH];
+ int fd;
+
+ XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+ fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+ if (fd >= 0)
+ return fd;
+
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ return -1; /* keep compiler quiet */
+}
+
/* openSegment callback for WALRead */
static int
wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,58 +856,92 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ bool try_read = false;
+ XLogReadLocalOptions *options =
+ (XLogReadLocalOptions *) state->read_page_data;
loc = targetPagePtr + reqLen;
/* Loop waiting for xlog to be available if necessary */
while (1)
{
- /*
- * Determine the limit of xlog we can currently read to, and what the
- * most recent timeline is.
- *
- * RecoveryInProgress() will update ThisTimeLineID when it first
- * notices recovery finishes, so we only have to maintain it for the
- * local process until recovery ends.
- */
- if (!RecoveryInProgress())
- read_upto = GetFlushRecPtr();
- else
- read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
- tli = ThisTimeLineID;
+ switch (options ? options->read_upto_policy : -1)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ /*
+ * We'll try to read as far as has been written by the WAL
+ * receiver, on the requested timeline. When we run out of valid
+ * data, we'll return an error. This is used by xlogprefetch.c
+ * while streaming.
+ */
+ read_upto = GetWalRcvWriteRecPtr();
+ try_read = true;
+ state->currTLI = tli = options->tli;
+ break;
- /*
- * Check which timeline to get the record from.
- *
- * We have to do it each time through the loop because if we're in
- * recovery as a cascading standby, the current timeline might've
- * become historical. We can't rely on RecoveryInProgress() because in
- * a standby configuration like
- *
- * A => B => C
- *
- * if we're a logical decoding session on C, and B gets promoted, our
- * timeline will change while we remain in recovery.
- *
- * We can't just keep reading from the old timeline as the last WAL
- * archive in the timeline will get renamed to .partial by
- * StartupXLOG().
- *
- * If that happens after our caller updated ThisTimeLineID but before
- * we actually read the xlog page, we might still try to read from the
- * old (now renamed) segment and fail. There's not much we can do
- * about this, but it can only happen when we're a leaf of a cascading
- * standby whose master gets promoted while we're decoding, so a
- * one-off ERROR isn't too bad.
- */
- XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ case XLRO_END:
+ /*
+ * We'll try to read as far as we can on one timeline. This is
+ * used by xlogprefetch.c for crash recovery.
+ */
+ read_upto = (XLogRecPtr) -1;
+ try_read = true;
+ state->currTLI = tli = options->tli;
+ break;
+
+ default:
+ /*
+ * Determine the limit of xlog we can currently read to, and what the
+ * most recent timeline is.
+ *
+ * RecoveryInProgress() will update ThisTimeLineID when it first
+ * notices recovery finishes, so we only have to maintain it for
+ * the local process until recovery ends.
+ */
+ if (!RecoveryInProgress())
+ read_upto = GetFlushRecPtr();
+ else
+ read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+ tli = ThisTimeLineID;
+
+ /*
+ * Check which timeline to get the record from.
+ *
+ * We have to do it each time through the loop because if we're in
+ * recovery as a cascading standby, the current timeline might've
+ * become historical. We can't rely on RecoveryInProgress()
+ * because in a standby configuration like
+ *
+ * A => B => C
+ *
+ * if we're a logical decoding session on C, and B gets promoted,
+ * our timeline will change while we remain in recovery.
+ *
+ * We can't just keep reading from the old timeline as the last
+ * WAL archive in the timeline will get renamed to .partial by
+ * StartupXLOG().
+ *
+ * If that happens after our caller updated ThisTimeLineID but
+ * before we actually read the xlog page, we might still try to
+ * read from the old (now renamed) segment and fail. There's not
+ * much we can do about this, but it can only happen when we're a
+ * leaf of a cascading standby whose master gets promoted while
+ * we're decoding, so a one-off ERROR isn't too bad.
+ */
+ XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ break;
+ }
- if (state->currTLI == ThisTimeLineID)
+ if (state->currTLI == tli)
{
if (loc <= read_upto)
break;
+ /* not enough data there, but we were asked not to wait */
+ if (options && options->nowait)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
@@ -924,7 +983,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
else if (targetPagePtr + reqLen > read_upto)
{
/* not enough data there */
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
else
{
@@ -938,8 +997,18 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
- &state->segcxt, wal_segment_open, &errinfo))
+ &state->segcxt,
+ try_read ? wal_segment_try_open : wal_segment_open,
+ &errinfo))
+ {
+ /*
+ * When on one single timeline, we may read past the end of available
+ * segments. Report lack of file as an error.
+ */
+ if (try_read)
+ return XLOGPAGEREAD_ERROR;
WALReadRaiseError(&errinfo);
+ }
/* number of valid bytes in the buffer */
return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884f3e..15ff3d35e4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
/* fail if not (implies we are going to shut down) */
if (flushptr < targetPagePtr + reqLen)
- return -1;
+ return XLOGPAGEREAD_ERROR;
if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
count = XLOG_BLCKSZ; /* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..a3ac7f414b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
typedef struct XLogReaderState XLogReaderState;
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR -1
+#define XLOGPAGEREAD_WOULDBLOCK -2
+
/* Function type definition for the read_page callback */
typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr,
@@ -99,10 +103,13 @@ struct XLogReaderState
* This callback shall read at least reqLen valid bytes of the xlog page
* starting at targetPagePtr, and store them in readBuf. The callback
* shall return the number of bytes read (never more than XLOG_BLCKSZ), or
- * -1 on failure. The callback shall sleep, if necessary, to wait for the
- * requested bytes to become available. The callback will not be invoked
- * again for the same page unless more than the returned number of bytes
- * are needed.
+ * XLOGPAGEREAD_ERROR on failure. The callback may either sleep or return
+ * XLOGPAGEREAD_WOULDBLOCK, if necessary, to wait for the requested bytes
+ * to become available. If a callback that can return
+ * XLOGPAGEREAD_WOULDBLOCK is installed, the reader client must expect to
+ * fail to read when there is not enough data. The callback will not be
+ * invoked again for the same page unless more than the returned number of
+ * bytes are needed.
*
* targetRecPtr is the position of the WAL record we're reading. Usually
* it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -126,6 +133,11 @@ struct XLogReaderState
*/
void *private_data;
+ /*
+ * Opaque data for callbacks to use. Not used by XLogReader.
+ */
+ void *read_page_data;
+
/*
* Start and end point of last record read. EndRecPtr is also used as the
* position to read next. Calling XLogBeginRead() sets EndRecPtr to the
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..89c9ce90f8 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,29 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /*
+ * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+ * provided.
+ */
+ TimeLineID tli;
+
+ /* How far to read. */
+ enum {
+ XLRO_STANDARD,
+ XLRO_WALRCV_WRITTEN,
+ XLRO_END
+ } read_upto_policy;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
--
2.20.1
v7-0004-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v7-0004-Prefetch-referenced-blocks-during-recovery.patchDownload
From 85c2ea245c03c6a859e652cc2d9df3b2ca323bb4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v7 4/4] Prefetch referenced blocks during recovery.
Introduce a new GUC max_recovery_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is enabled by default for
now, but we might reconsider that before release.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 45 +
doc/src/sgml/monitoring.sgml | 81 ++
doc/src/sgml/wal.sgml | 13 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 16 +
src/backend/access/transam/xlogprefetch.c | 905 ++++++++++++++++++
src/backend/catalog/system_views.sql | 14 +
src/backend/postmaster/pgstat.c | 96 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 47 +-
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/include/access/xlogprefetch.h | 85 ++
src/include/catalog/pg_proc.dat | 8 +
src/include/pgstat.h | 27 +
src/include/utils/guc.h | 4 +
src/test/regress/expected/rules.out | 11 +
16 files changed, 1359 insertions(+), 2 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetch.c
create mode 100644 src/include/access/xlogprefetch.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a0da4aabac..18979d0496 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+ <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as
+ <xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
+ might be counterproductive, if it means that data falls out of the
+ kernel cache before it is needed. If this value is specified without
+ units, it is taken as bytes. A setting of -1 disables prefetching
+ during recovery.
+ The default is 256kB on systems that support
+ <function>posix_fadvise</function>, and otherwise -1.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+ <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks that were logged with full page images,
+ during recovery. Often this doesn't help, since such blocks will not
+ be read the first time they are needed and might remain in the buffer
+ pool after that. However, on file systems with a block size larger
+ than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a
+ costly read-before-write when a blocks are later written. This
+ setting has no effect unless
+ <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+ number. The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..ddf2ee1f96 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+ <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_distance</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_queue_depth</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>Average number of prefetches in flight while recovery is not idle</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-recovery-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-recovery-prefetch-distance"/>,
+ <xref linkend="guc-recovery-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
@@ -3446,6 +3525,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
counters shown in the <structname>pg_stat_bgwriter</structname> view.
Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
counters shown in the <structname>pg_stat_archiver</structname> view.
+ Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+ counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
</entry>
</row>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+ be used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, prefetching in recovery is enabled,
+ but it can be disabled by setting the distance to -1.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetch.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c60842ea03..6b2e95c06c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -7144,6 +7145,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetchState prefetch;
InRedo = true;
@@ -7151,6 +7153,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* Prepare to prefetch, if configured. */
+ XLogPrefetchBegin(&prefetch);
+
/*
* main redo apply loop
*/
@@ -7181,6 +7186,12 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /* Peform WAL prefetching, if enabled. */
+ XLogPrefetch(&prefetch,
+ ThisTimeLineID,
+ xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7352,6 +7363,9 @@ StartupXLOG(void)
*/
if (switchedTLI && AllowCascadeReplication())
WalSndWakeup();
+
+ /* Reset the prefetcher. */
+ XLogPrefetchReconfigure();
}
/* Exit loop if we reached inclusive recovery target */
@@ -7379,6 +7393,7 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ XLogPrefetchEnd(&prefetch);
if (reachedRecoveryTarget)
{
@@ -12107,6 +12122,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
+ XLogPrefetchReconfigure();
break;
case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..7d3aea53f7
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,905 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ * Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop. Currently, this is achieved by using a
+ * separate XLogReader to read ahead. In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed. After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed. These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed. Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq". Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer(). Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs. When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching. Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery. It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int max_recovery_prefetch_distance = -1;
+bool recovery_prefetch_fpw = false;
+
+int XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object. There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Online averages. */
+ uint64 samples;
+ double avg_queue_depth;
+ double avg_distance;
+ XLogRecPtr next_sample_lsn;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ int prefetch_head;
+ int prefetch_tail;
+ int prefetch_queue_size;
+ XLogRecPtr prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+ pg_atomic_uint64 reset_time; /* Time of last reset. */
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Repeat blocks skipped. */
+ float avg_distance;
+ float avg_queue_depth;
+
+ /* Reset counters */
+ pg_atomic_uint32 reset_request;
+ uint32 reset_handled;
+
+ /* Dynamic values */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+ return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+ pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_write_u64(&Stats->prefetch, 0);
+ pg_atomic_write_u64(&Stats->skip_hit, 0);
+ pg_atomic_write_u64(&Stats->skip_new, 0);
+ pg_atomic_write_u64(&Stats->skip_fpw, 0);
+ pg_atomic_write_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+ bool found;
+
+ Stats = (XLogPrefetchStats *)
+ ShmemInitStruct("XLogPrefetchStats",
+ sizeof(XLogPrefetchStats),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u32(&Stats->reset_request, 0);
+ Stats->reset_handled = 0;
+ pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_init_u64(&Stats->prefetch, 0);
+ pg_atomic_init_u64(&Stats->skip_hit, 0);
+ pg_atomic_init_u64(&Stats->skip_new, 0);
+ pg_atomic_init_u64(&Stats->skip_fpw, 0);
+ pg_atomic_init_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+ Stats->distance = 0;
+ Stats->queue_depth = 0;
+ }
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+ XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+ pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+ PgStat_RecoveryPrefetchStats serialized = {
+ .prefetch = pg_atomic_read_u64(&Stats->prefetch),
+ .skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+ .skip_new = pg_atomic_read_u64(&Stats->skip_new),
+ .skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+ .skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+ .stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+ };
+
+ pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+ PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+ if (serialized->stat_reset_timestamp != 0)
+ {
+ pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+ pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+ pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+ pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+ pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+ pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+ }
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+ XLogPrefetchRestoreStats();
+
+ /* We'll reconfigure on the first call to XLogPrefetch(). */
+ state->prefetcher = NULL;
+ state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+ XLogPrefetchSaveStats();
+
+ if (state->prefetcher)
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+ XLogPrefetcher *prefetcher;
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching. We add one to the size
+ * because our circular buffer has a gap between head and tail when full.
+ */
+ prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+ sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+ prefetcher->options.tli = tli;
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* Read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_END;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ NULL);
+ prefetcher->reader->read_page_data = &prefetcher->options;
+ prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /* Prepare to read at the given LSN. */
+ ereport(LOG,
+ (errmsg("recovery started prefetching on timeline %u at %X/%X",
+ tli,
+ (uint32) (lsn << 32), (uint32) lsn)));
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ /* Log final statistics. */
+ ereport(LOG,
+ (errmsg("recovery finished prefetching at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&Stats->prefetch),
+ pg_atomic_read_u64(&Stats->skip_hit),
+ pg_atomic_read_u64(&Stats->skip_new),
+ pg_atomic_read_u64(&Stats->skip_fpw),
+ pg_atomic_read_u64(&Stats->skip_seq),
+ Stats->avg_distance,
+ Stats->avg_queue_depth)));
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ uint32 reset_request;
+
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early and let recovery catch up.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /*
+ * Can we drop any filters yet? This happens when the LSN that is
+ * currently being replayed has moved past a record that prevents
+ * pretching of a block range, such as relation extension.
+ */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /*
+ * Have we been asked to reset our stats counters? This is checked with
+ * an unsynchronized memory read, but we'll see it eventually and we'll be
+ * accessing that cache line anyway.
+ */
+ reset_request = pg_atomic_read_u32(&Stats->reset_request);
+ if (reset_request != Stats->reset_handled)
+ {
+ XLogPrefetchResetStats();
+ Stats->reset_handled = reset_request;
+ prefetcher->avg_distance = 0;
+ prefetcher->avg_queue_depth = 0;
+ prefetcher->samples = 0;
+ }
+
+ /* OK, we can now try reading ahead. */
+ XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ for (;;)
+ {
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+ prefetcher->shutdown = true;
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ Stats->distance = distance;
+
+ /* Periodically recompute some statistics. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ /* Compute online averages. */
+ prefetcher->samples++;
+ if (prefetcher->samples == 1)
+ {
+ prefetcher->avg_distance = Stats->distance;
+ prefetcher->avg_queue_depth = Stats->queue_depth;
+ }
+ else
+ {
+ prefetcher->avg_distance +=
+ (Stats->distance - prefetcher->avg_distance) /
+ prefetcher->samples;
+ prefetcher->avg_queue_depth +=
+ (Stats->queue_depth - prefetcher->avg_queue_depth) /
+ prefetcher->samples;
+ }
+
+ /* Expose it in shared memory. */
+ Stats->avg_distance = prefetcher->avg_distance;
+ Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+ /* Also periodically save the simple counters. */
+ XLogPrefetchSaveStats();
+
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_recovery_prefetch_distance)
+ break;
+
+ /* Are we not far enough ahead? */
+ if (distance <= 0)
+ {
+ prefetcher->have_record = false; /* skip this record */
+ continue;
+ }
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /* Scan the record's block references. */
+ if (!XLogPrefetcherScanBlocks(prefetcher))
+ return;
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ /*
+ * We might already have been partway through processing this record when
+ * our queue became saturated, so we need to start where we left off.
+ */
+ for (int block_id = prefetcher->next_block_id;
+ block_id <= reader->max_block_id;
+ ++block_id)
+ {
+ PrefetchBufferResult prefetch;
+ DecodedBkpBlock *block = &reader->blocks[block_id];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might think we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !recovery_prefetch_fpw)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat access to the same block, then skip it.
+ *
+ * XXX We could also check for last_blkno + 1 too, and also update
+ * last_blkno; it's not clear if the kernel would do a better job
+ * of sequential prefetching.
+ */
+ if (block->blkno == prefetcher->last_blkno)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+ if (BufferIsValid(prefetch.recent_buffer))
+ {
+ /*
+ * It was already cached, so do nothing. Perhaps in future we
+ * could remember the buffer so that recovery doesn't have to look
+ * it up again.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+ }
+ else if (prefetch.initiated_io)
+ {
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before processing
+ * any more blocks from this record, or move to a new record if
+ * that was the last block.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = block_id + 1;
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Neither cached nor initiated. The underlying segment file
+ * doesn't exist. Presumably it will be unlinked by a later WAL
+ * record. When recovery reads this block, it will use the
+ * EXTENSION_CREATE_RECOVERY flag. We certainly don't want to do
+ * that sort of thing while merely prefetching, so let's just
+ * ignore references to this relation until this record is
+ * replayed, and let recovery create the dummy file or complain if
+ * something is wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+ bool nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+ {
+ /* There's an unhandled reset request, so just show NULLs */
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = false;
+ }
+
+ values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+ values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+ values[6] = Int32GetDatum(Stats->distance);
+ values[7] = Int32GetDatum(Stats->queue_depth);
+ values[8] = Float4GetDatum(Stats->avg_distance);
+ values[9] = Float4GetDatum(Stats->avg_queue_depth);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth++;
+ Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth--;
+ Assert(Stats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ max_recovery_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ recovery_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8118..3b15f5ef8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_prefetch_recovery AS
+ SELECT
+ s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..6c9ac5b29b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
#include "access/transam.h"
#include "access/twophase_rmgr.h"
#include "access/xact.h"
+#include "access/xlogprefetch.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
#include "common/ip.h"
@@ -276,6 +277,7 @@ static int localNumBackends = 0;
static PgStat_ArchiverStats archiverStats;
static PgStat_GlobalStats globalStats;
static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
/*
* List of OIDs of databases we need to write out. If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
msg.m_resettarget = RESET_ARCHIVER;
else if (strcmp(target, "bgwriter") == 0)
msg.m_resettarget = RESET_BGWRITER;
+ else if (strcmp(target, "prefetch_recovery") == 0)
+ {
+ /*
+ * We can't ask the stats collector to do this for us as it is not
+ * attached to shared memory.
+ */
+ XLogPrefetchRequestResetStats();
+ return;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("unrecognized reset target: \"%s\"", target),
- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
}
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ * Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+ backend_read_statsfile();
+
+ return &recoveryPrefetchStats;
+}
+
+
/* ------------------------------------------------------------
* Functions for management of the shared-memory PgBackendStatus array
* ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
}
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ * Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+ PgStat_MsgRecoveryPrefetch msg;
+
+ pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+ msg.m_stats = *stats;
+ pgstat_send(&msg, sizeof(msg));
+}
+
+
/* ----------
* PgstatCollectorMain() -
*
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
pgstat_recv_slru(&msg.msg_slru, len);
break;
+ case PGSTAT_MTYPE_RECOVERYPREFETCH:
+ pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+ break;
+
case PGSTAT_MTYPE_FUNCSTAT:
pgstat_recv_funcstat(&msg.msg_funcstat, len);
break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
(void) rc; /* we'll check for error with ferror */
+ /*
+ * Write recovery prefetch stats struct
+ */
+ rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+ fpout);
+ (void) rc; /* we'll check for error with ferror */
+
/*
* Walk through the database table.
*/
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
memset(&globalStats, 0, sizeof(globalStats));
memset(&archiverStats, 0, sizeof(archiverStats));
memset(&slruStats, 0, sizeof(slruStats));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
/*
* Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
goto done;
}
+ /*
+ * Read recoveryPrefetchStats struct
+ */
+ if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+ fpin) != sizeof(recoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+ goto done;
+ }
+
/*
* We found an existing collector stats file. Read it and put all the
* hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
PgStat_GlobalStats myGlobalStats;
PgStat_ArchiverStats myArchiverStats;
PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+ PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
FILE *fpin;
int32 format_id;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
return false;
}
+ /*
+ * Read recovery prefetch stats struct
+ */
+ if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+ fpin) != sizeof(myRecoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ FreeFile(fpin);
+ return false;
+ }
+
/* By default, we're going to return the timestamp of the global file. */
*ts = myGlobalStats.stats_timestamp;
@@ -6420,6 +6502,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
slruStats[msg->m_index].truncate += msg->m_truncate;
}
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ * Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+ recoveryPrefetchStats = msg->m_stats;
+}
+
/* ----------
* pgstat_recv_recoveryconflict() -
*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 417840a8f1..a965ab9d35 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetch.h"
#include "commands/async.h"
#include "commands/wait.h"
#include "miscadmin.h"
@@ -125,6 +126,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetchShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -214,6 +216,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetchShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..5ed7ed13e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
#include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_recovery_prefetch_distance is set to a positive number.")
+ },
+ &recovery_prefetch_fpw,
+ false,
+ NULL, assign_recovery_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable prefetching during recovery."),
+ GUC_UNIT_BYTE
+ },
+ &max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+ 256 * 1024,
+#else
+ -1,
+#endif
+ -1, INT_MAX,
+ NULL, assign_max_recovery_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /*
+ * Reconfigure recovery prefetching, because a setting it depends on
+ * changed.
+ */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..55cce90763 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB # -1 disables prefetching
+#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW
+
# - Archiving -
#archive_mode = off # enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ * Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+ XLogPrefetcher *prefetcher;
+ int reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+ XLogRecPtr lsn,
+ bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+ XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+ TimeLineID replaying_tli,
+ XLogRecPtr replaying_lsn,
+ bool from_stream)
+{
+ /*
+ * Handle any configuration changes. Rather than trying to deal with
+ * various parameter changes, we just tear down and set up a new
+ * prefetcher if anything we depend on changes.
+ */
+ if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+ {
+ /* If we had a prefetcher, tear it down. */
+ if (state->prefetcher)
+ {
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+ }
+ /* If we want a prefetcher, set it up. */
+ if (max_recovery_prefetch_distance > 0)
+ state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+ replaying_lsn,
+ from_stream);
+ state->reconfigure_count = XLogPrefetchReconfigureCount;
+ }
+
+ if (state->prefetcher)
+ XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9f5f0ed4c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+ prosrc => 'pg_stat_get_prefetch_recovery' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..105c2e77d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
PGSTAT_MTYPE_ARCHIVER,
PGSTAT_MTYPE_BGWRITER,
PGSTAT_MTYPE_SLRU,
+ PGSTAT_MTYPE_RECOVERYPREFETCH,
PGSTAT_MTYPE_FUNCSTAT,
PGSTAT_MTYPE_FUNCPURGE,
PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
struct PgStat_TableXactStatus *next; /* next of same subxact */
} PgStat_TableXactStatus;
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+ PgStat_Counter prefetch;
+ PgStat_Counter skip_hit;
+ PgStat_Counter skip_new;
+ PgStat_Counter skip_fpw;
+ PgStat_Counter skip_seq;
+ TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
/* ------------------------------------------------------------
* Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
PgStat_Counter m_truncate;
} PgStat_MsgSLRU;
+/* ----------
+ * PgStat_MsgRecoveryPrefetch Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+ PgStat_MsgHdr m_hdr;
+ PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
/* ----------
* PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
* ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
PgStat_MsgArchiver msg_archiver;
PgStat_MsgBgWriter msg_bgwriter;
PgStat_MsgSLRU msg_slru;
+ PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
PgStat_MsgFuncstat msg_funcstat;
PgStat_MsgFuncpurge msg_funcpurge;
PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1464,6 +1489,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
extern void pgstat_send_archiver(const char *xlog, bool failed);
extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
/* ----------
* Support functions for the SQL-callable functions to
@@ -1479,6 +1505,7 @@ extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840739..942a07ffee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
s.gss_enc AS encrypted
FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
pg_stat_progress_analyze| SELECT s.pid,
s.datid,
d.datname,
--
2.20.1
On Wed, Apr 8, 2020 at 11:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Apr 8, 2020 at 12:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
* he gave some feedback on the read_local_xlog_page() modifications: I
probably need to reconsider the change to logical.c that passes NULL
instead of cxt to the read_page callback; and the switch statement in
read_local_xlog_page() probably should have a case for the preexisting
modeSo... logical.c wants to give its LogicalDecodingContext to any
XLogPageReadCB you give it, via "private_data"; that is, it really
only accepts XLogPageReadCB implementations that understand that (or
ignore it). What I want to do is give every XLogPageReadCB the chance
to have its own state that it is control of (to receive settings
specific to the implementation, or whatever), that you supply along
with it. We can't do both kinds of things with private_data, so I
have added a second member read_page_data to XLogReaderState. If you
pass in read_local_xlog_page as read_page, then you can optionally
install a pointer to XLogReadLocalOptions as reader->read_page_data,
to activate the new behaviours I added for prefetching purposes.While working on that, I realised the readahead XLogReader was
breaking a rule expressed in XLogReadDetermineTimeLine(). Timelines
are really confusing and there were probably several subtle or not to
subtle bugs there. So I added an option to skip all of that logic,
and just say "I command you to read only from TLI X". It reads the
same TLI as recovery is reading, until it hits the end of readable
data and that causes prefetching to shut down. Then the main recovery
loop resets the prefetching module when it sees a TLI switch, so then
it starts up again. This seems to work reliably, but I've obviously
had limited time to test. Does this scheme sound sane?I think this is basically committable (though of course I wish I had
more time to test and review). Ugh. Feature freeze in half an hour.
Ok, so the following parts of this work have been committed:
b09ff536: Simplify the effective_io_concurrency setting.
fc34b0d9: Introduce a maintenance_io_concurrency setting.
3985b600: Support PrefetchBuffer() in recovery.
d140f2f3: Rationalize GetWalRcv{Write,Flush}RecPtr().
However, I didn't want to push the main patch into the tree at
(literally) the last minute after doing such much work on it in the
last few days, without more review from recovery code experts and some
independent testing. Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.
On 4/8/20 8:12 AM, Thomas Munro wrote:
Ok, so the following parts of this work have been committed:
b09ff536: Simplify the effective_io_concurrency setting.
fc34b0d9: Introduce a maintenance_io_concurrency setting.
3985b600: Support PrefetchBuffer() in recovery.
d140f2f3: Rationalize GetWalRcv{Write,Flush}RecPtr().However, I didn't want to push the main patch into the tree at
(literally) the last minute after doing such much work on it in the
last few days, without more review from recovery code experts and some
independent testing.
I definitely think that was the right call.
Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.
That's up to the RMT, of course, but we did already have an extra week.
Might be best to just get this in at the beginning of the PG14 cycle.
FWIW, I do think the feature is really valuable.
Looks like you'll need to rebase, so I'll move this to the next CF in
WoA state.
Regards,
--
-David
david@pgmasters.net
On Thu, Apr 9, 2020 at 12:27 AM David Steele <david@pgmasters.net> wrote:
On 4/8/20 8:12 AM, Thomas Munro wrote:
Judging by the comments made in this thread and
elsewhere, I think the feature is in demand so I hope there is a way
we could get it into 13 in the next couple of days, but I totally
accept the release management team's prerogative on that.That's up to the RMT, of course, but we did already have an extra week.
Might be best to just get this in at the beginning of the PG14 cycle.
FWIW, I do think the feature is really valuable.Looks like you'll need to rebase, so I'll move this to the next CF in
WoA state.
Thanks. Here's a rebase.
Attachments:
v8-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload
From db0d2774ac0faf9284e14ad243fefb940e1bc173 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v8 1/3] Add pg_atomic_unlocked_add_fetch_XXX().
Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/include/port/atomics.h | 24 ++++++++++++++++++++++
src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
return pg_atomic_add_fetch_u32_impl(ptr, add_);
}
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ AssertPointerAlignment(ptr, 4);
+ return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
/*
* pg_atomic_sub_fetch_u32 - atomically subtract from variable
*
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+ AssertPointerAlignment(ptr, 8);
+#endif
+ return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
#undef INSIDE_ATOMICS_H
#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
}
#endif
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ ptr->value += add_;
+ return ptr->value;
+}
+#endif
+
#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
#define PG_HAVE_ATOMIC_SUB_FETCH_U32
static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
}
#endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+ !defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ ptr->value += val;
+ return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
--
2.20.1
v8-0002-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Allow-XLogReadRecord-to-be-non-blocking.patchDownload
From 743e11495e81af1f96ca304baf130b20dba056e5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v8 2/3] Allow XLogReadRecord() to be non-blocking.
Extend read_local_xlog_page() to support non-blocking modes:
1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlogreader.c | 37 ++++--
src/backend/access/transam/xlogutils.c | 151 +++++++++++++++++-------
src/backend/replication/walsender.c | 2 +-
src/include/access/xlogreader.h | 20 +++-
src/include/access/xlogutils.h | 23 ++++
5 files changed, 178 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 79ff976474..554b2029da 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -257,6 +257,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
* If the reading fails for some other reason, NULL is also returned, and
* *errormsg is set to a string with details of the failure.
*
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
* The returned pointer (or *errormsg) points to an internal buffer that's
* valid until the next call to XLogReadRecord.
*/
@@ -546,10 +549,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
err:
/*
- * Invalidate the read state. We might read from a different source after
- * failure.
+ * Invalidate the read state, if this was an error. We might read from a
+ * different source after failure.
*/
- XLogReaderInvalReadState(state);
+ if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
if (state->errormsg_buf[0] != '\0')
*errormsg = state->errormsg_buf;
@@ -561,8 +565,9 @@ err:
* Read a single xlog page including at least [pageptr, reqLen] of valid data
* via the read_page() callback.
*
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the read_page callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the read_page callback).
*
* We fetch the page from a reader-local cache if we know we have the required
* data and if there hasn't been any error since caching the data.
@@ -659,8 +664,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
return readLen;
err:
+ if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
XLogReaderInvalReadState(state);
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
/*
@@ -939,6 +947,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr found = InvalidXLogRecPtr;
XLogPageHeader header;
char *errormsg;
+ int readLen;
Assert(!XLogRecPtrIsInvalid(RecPtr));
@@ -952,7 +961,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr targetPagePtr;
int targetRecOff;
uint32 pageHeaderSize;
- int readLen;
/*
* Compute targetRecOff. It should typically be equal or greater than
@@ -1033,7 +1041,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
}
err:
- XLogReaderInvalReadState(state);
+ if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
return InvalidXLogRecPtr;
}
@@ -1084,13 +1093,23 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
tli != seg->ws_tli)
{
XLogSegNo nextSegNo;
-
if (seg->ws_file >= 0)
close(seg->ws_file);
XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
+ /* callback reported that there was no such file */
+ if (seg->ws_file < 0)
+ {
+ errinfo->wre_errno = errno;
+ errinfo->wre_req = 0;
+ errinfo->wre_read = 0;
+ errinfo->wre_off = startoff;
+ errinfo->wre_seg = *seg;
+ return false;
+ }
+
/* Update the current segment info. */
seg->ws_tli = tli;
seg->ws_segno = nextSegNo;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 6cb143e161..2d702437dd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -783,6 +784,30 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
}
}
+/* openSegment callback for WALRead */
+static int
+wal_segment_try_open(XLogSegNo nextSegNo,
+ WALSegmentContext *segcxt,
+ TimeLineID *tli_p)
+{
+ TimeLineID tli = *tli_p;
+ char path[MAXPGPATH];
+ int fd;
+
+ XLogFilePath(path, tli, nextSegNo, segcxt->ws_segsize);
+ fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+ if (fd >= 0)
+ return fd;
+
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+
+ return -1; /* keep compiler quiet */
+}
+
/* openSegment callback for WALRead */
static int
wal_segment_open(XLogSegNo nextSegNo, WALSegmentContext * segcxt,
@@ -831,58 +856,92 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ bool try_read = false;
+ XLogReadLocalOptions *options =
+ (XLogReadLocalOptions *) state->read_page_data;
loc = targetPagePtr + reqLen;
/* Loop waiting for xlog to be available if necessary */
while (1)
{
- /*
- * Determine the limit of xlog we can currently read to, and what the
- * most recent timeline is.
- *
- * RecoveryInProgress() will update ThisTimeLineID when it first
- * notices recovery finishes, so we only have to maintain it for the
- * local process until recovery ends.
- */
- if (!RecoveryInProgress())
- read_upto = GetFlushRecPtr();
- else
- read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
- tli = ThisTimeLineID;
+ switch (options ? options->read_upto_policy : -1)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ /*
+ * We'll try to read as far as has been written by the WAL
+ * receiver, on the requested timeline. When we run out of valid
+ * data, we'll return an error. This is used by xlogprefetch.c
+ * while streaming.
+ */
+ read_upto = GetWalRcvWriteRecPtr();
+ try_read = true;
+ state->currTLI = tli = options->tli;
+ break;
- /*
- * Check which timeline to get the record from.
- *
- * We have to do it each time through the loop because if we're in
- * recovery as a cascading standby, the current timeline might've
- * become historical. We can't rely on RecoveryInProgress() because in
- * a standby configuration like
- *
- * A => B => C
- *
- * if we're a logical decoding session on C, and B gets promoted, our
- * timeline will change while we remain in recovery.
- *
- * We can't just keep reading from the old timeline as the last WAL
- * archive in the timeline will get renamed to .partial by
- * StartupXLOG().
- *
- * If that happens after our caller updated ThisTimeLineID but before
- * we actually read the xlog page, we might still try to read from the
- * old (now renamed) segment and fail. There's not much we can do
- * about this, but it can only happen when we're a leaf of a cascading
- * standby whose master gets promoted while we're decoding, so a
- * one-off ERROR isn't too bad.
- */
- XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ case XLRO_END:
+ /*
+ * We'll try to read as far as we can on one timeline. This is
+ * used by xlogprefetch.c for crash recovery.
+ */
+ read_upto = (XLogRecPtr) -1;
+ try_read = true;
+ state->currTLI = tli = options->tli;
+ break;
+
+ default:
+ /*
+ * Determine the limit of xlog we can currently read to, and what the
+ * most recent timeline is.
+ *
+ * RecoveryInProgress() will update ThisTimeLineID when it first
+ * notices recovery finishes, so we only have to maintain it for
+ * the local process until recovery ends.
+ */
+ if (!RecoveryInProgress())
+ read_upto = GetFlushRecPtr();
+ else
+ read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+ tli = ThisTimeLineID;
+
+ /*
+ * Check which timeline to get the record from.
+ *
+ * We have to do it each time through the loop because if we're in
+ * recovery as a cascading standby, the current timeline might've
+ * become historical. We can't rely on RecoveryInProgress()
+ * because in a standby configuration like
+ *
+ * A => B => C
+ *
+ * if we're a logical decoding session on C, and B gets promoted,
+ * our timeline will change while we remain in recovery.
+ *
+ * We can't just keep reading from the old timeline as the last
+ * WAL archive in the timeline will get renamed to .partial by
+ * StartupXLOG().
+ *
+ * If that happens after our caller updated ThisTimeLineID but
+ * before we actually read the xlog page, we might still try to
+ * read from the old (now renamed) segment and fail. There's not
+ * much we can do about this, but it can only happen when we're a
+ * leaf of a cascading standby whose master gets promoted while
+ * we're decoding, so a one-off ERROR isn't too bad.
+ */
+ XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ break;
+ }
- if (state->currTLI == ThisTimeLineID)
+ if (state->currTLI == tli)
{
if (loc <= read_upto)
break;
+ /* not enough data there, but we were asked not to wait */
+ if (options && options->nowait)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
@@ -924,7 +983,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
else if (targetPagePtr + reqLen > read_upto)
{
/* not enough data there */
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
else
{
@@ -938,8 +997,18 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &state->seg,
- &state->segcxt, wal_segment_open, &errinfo))
+ &state->segcxt,
+ try_read ? wal_segment_try_open : wal_segment_open,
+ &errinfo))
+ {
+ /*
+ * When on one single timeline, we may read past the end of available
+ * segments. Report lack of file as an error.
+ */
+ if (try_read)
+ return XLOGPAGEREAD_ERROR;
WALReadRaiseError(&errinfo);
+ }
/* number of valid bytes in the buffer */
return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 122d884f3e..15ff3d35e4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -818,7 +818,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
/* fail if not (implies we are going to shut down) */
if (flushptr < targetPagePtr + reqLen)
- return -1;
+ return XLOGPAGEREAD_ERROR;
if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
count = XLOG_BLCKSZ; /* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 4582196e18..a3ac7f414b 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -50,6 +50,10 @@ typedef struct WALSegmentContext
typedef struct XLogReaderState XLogReaderState;
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR -1
+#define XLOGPAGEREAD_WOULDBLOCK -2
+
/* Function type definition for the read_page callback */
typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr,
@@ -99,10 +103,13 @@ struct XLogReaderState
* This callback shall read at least reqLen valid bytes of the xlog page
* starting at targetPagePtr, and store them in readBuf. The callback
* shall return the number of bytes read (never more than XLOG_BLCKSZ), or
- * -1 on failure. The callback shall sleep, if necessary, to wait for the
- * requested bytes to become available. The callback will not be invoked
- * again for the same page unless more than the returned number of bytes
- * are needed.
+ * XLOGPAGEREAD_ERROR on failure. The callback may either sleep or return
+ * XLOGPAGEREAD_WOULDBLOCK, if necessary, to wait for the requested bytes
+ * to become available. If a callback that can return
+ * XLOGPAGEREAD_WOULDBLOCK is installed, the reader client must expect to
+ * fail to read when there is not enough data. The callback will not be
+ * invoked again for the same page unless more than the returned number of
+ * bytes are needed.
*
* targetRecPtr is the position of the WAL record we're reading. Usually
* it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -126,6 +133,11 @@ struct XLogReaderState
*/
void *private_data;
+ /*
+ * Opaque data for callbacks to use. Not used by XLogReader.
+ */
+ void *read_page_data;
+
/*
* Start and end point of last record read. EndRecPtr is also used as the
* position to read next. Calling XLogBeginRead() sets EndRecPtr to the
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5181a077d9..89c9ce90f8 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,6 +47,29 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /*
+ * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+ * provided.
+ */
+ TimeLineID tli;
+
+ /* How far to read. */
+ enum {
+ XLRO_STANDARD,
+ XLRO_WALRCV_WRITTEN,
+ XLRO_END
+ } read_upto_policy;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
--
2.20.1
v8-0003-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Prefetch-referenced-blocks-during-recovery.patchDownload
From 4e5ac5a6dbaa3ff519fbd2d8acf9b7d9756ad2cb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v8 3/3] Prefetch referenced blocks during recovery.
Introduce a new GUC max_recovery_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is enabled by default for
now, but we might reconsider that before release.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 45 +
doc/src/sgml/monitoring.sgml | 81 ++
doc/src/sgml/wal.sgml | 13 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 16 +
src/backend/access/transam/xlogprefetch.c | 905 ++++++++++++++++++
src/backend/catalog/system_views.sql | 14 +
src/backend/postmaster/pgstat.c | 96 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 47 +-
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/include/access/xlogprefetch.h | 85 ++
src/include/catalog/pg_proc.dat | 8 +
src/include/pgstat.h | 27 +
src/include/utils/guc.h | 4 +
src/test/regress/expected/rules.out | 11 +
16 files changed, 1359 insertions(+), 2 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetch.c
create mode 100644 src/include/access/xlogprefetch.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a0da4aabac..18979d0496 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+ <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as
+ <xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
+ might be counterproductive, if it means that data falls out of the
+ kernel cache before it is needed. If this value is specified without
+ units, it is taken as bytes. A setting of -1 disables prefetching
+ during recovery.
+ The default is 256kB on systems that support
+ <function>posix_fadvise</function>, and otherwise -1.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+ <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks that were logged with full page images,
+ during recovery. Often this doesn't help, since such blocks will not
+ be read the first time they are needed and might remain in the buffer
+ pool after that. However, on file systems with a block size larger
+ than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a
+ costly read-before-write when a blocks are later written. This
+ setting has no effect unless
+ <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+ number. The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c50b72137f..ddf2ee1f96 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2223,6 +2230,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+ <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_distance</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_queue_depth</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>Average number of prefetches in flight while recovery is not idle</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-recovery-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-recovery-prefetch-distance"/>,
+ <xref linkend="guc-recovery-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="3">
@@ -3446,6 +3525,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
counters shown in the <structname>pg_stat_bgwriter</structname> view.
Calling <literal>pg_stat_reset_shared('archiver')</literal> will zero all the
counters shown in the <structname>pg_stat_archiver</structname> view.
+ Calling <literal>pg_stat_reset_shared('prefetch_recovery')</literal> will zero all the
+ counters shown in the <structname>pg_stat_prefetch_recovery</structname> view.
</entry>
</row>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+ be used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, prefetching in recovery is enabled,
+ but it can be disabled by setting the distance to -1.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetch.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c38bc1412d..05a1c0ded8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -7143,6 +7144,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetchState prefetch;
InRedo = true;
@@ -7150,6 +7152,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* Prepare to prefetch, if configured. */
+ XLogPrefetchBegin(&prefetch);
+
/*
* main redo apply loop
*/
@@ -7179,6 +7184,12 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /* Peform WAL prefetching, if enabled. */
+ XLogPrefetch(&prefetch,
+ ThisTimeLineID,
+ xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7350,6 +7361,9 @@ StartupXLOG(void)
*/
if (switchedTLI && AllowCascadeReplication())
WalSndWakeup();
+
+ /* Reset the prefetcher. */
+ XLogPrefetchReconfigure();
}
/* Exit loop if we reached inclusive recovery target */
@@ -7366,6 +7380,7 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ XLogPrefetchEnd(&prefetch);
if (reachedRecoveryTarget)
{
@@ -12094,6 +12109,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
+ XLogPrefetchReconfigure();
break;
case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..7d3aea53f7
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,905 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ * Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop. Currently, this is achieved by using a
+ * separate XLogReader to read ahead. In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed. After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed. These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed. Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq". Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer(). Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs. When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching. Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery. It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int max_recovery_prefetch_distance = -1;
+bool recovery_prefetch_fpw = false;
+
+int XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object. There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Online averages. */
+ uint64 samples;
+ double avg_queue_depth;
+ double avg_distance;
+ XLogRecPtr next_sample_lsn;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ int prefetch_head;
+ int prefetch_tail;
+ int prefetch_queue_size;
+ XLogRecPtr prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+ pg_atomic_uint64 reset_time; /* Time of last reset. */
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Repeat blocks skipped. */
+ float avg_distance;
+ float avg_queue_depth;
+
+ /* Reset counters */
+ pg_atomic_uint32 reset_request;
+ uint32 reset_handled;
+
+ /* Dynamic values */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+ return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+ pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_write_u64(&Stats->prefetch, 0);
+ pg_atomic_write_u64(&Stats->skip_hit, 0);
+ pg_atomic_write_u64(&Stats->skip_new, 0);
+ pg_atomic_write_u64(&Stats->skip_fpw, 0);
+ pg_atomic_write_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+ bool found;
+
+ Stats = (XLogPrefetchStats *)
+ ShmemInitStruct("XLogPrefetchStats",
+ sizeof(XLogPrefetchStats),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u32(&Stats->reset_request, 0);
+ Stats->reset_handled = 0;
+ pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_init_u64(&Stats->prefetch, 0);
+ pg_atomic_init_u64(&Stats->skip_hit, 0);
+ pg_atomic_init_u64(&Stats->skip_new, 0);
+ pg_atomic_init_u64(&Stats->skip_fpw, 0);
+ pg_atomic_init_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+ Stats->distance = 0;
+ Stats->queue_depth = 0;
+ }
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+ XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+ pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+ PgStat_RecoveryPrefetchStats serialized = {
+ .prefetch = pg_atomic_read_u64(&Stats->prefetch),
+ .skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+ .skip_new = pg_atomic_read_u64(&Stats->skip_new),
+ .skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+ .skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+ .stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+ };
+
+ pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+ PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+ if (serialized->stat_reset_timestamp != 0)
+ {
+ pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+ pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+ pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+ pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+ pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+ pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+ }
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+ XLogPrefetchRestoreStats();
+
+ /* We'll reconfigure on the first call to XLogPrefetch(). */
+ state->prefetcher = NULL;
+ state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+ XLogPrefetchSaveStats();
+
+ if (state->prefetcher)
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+ XLogPrefetcher *prefetcher;
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching. We add one to the size
+ * because our circular buffer has a gap between head and tail when full.
+ */
+ prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+ sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+ prefetcher->options.tli = tli;
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* Read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_END;
+ }
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ read_local_xlog_page,
+ NULL);
+ prefetcher->reader->read_page_data = &prefetcher->options;
+ prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /* Prepare to read at the given LSN. */
+ ereport(LOG,
+ (errmsg("recovery started prefetching on timeline %u at %X/%X",
+ tli,
+ (uint32) (lsn << 32), (uint32) lsn)));
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ /* Log final statistics. */
+ ereport(LOG,
+ (errmsg("recovery finished prefetching at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&Stats->prefetch),
+ pg_atomic_read_u64(&Stats->skip_hit),
+ pg_atomic_read_u64(&Stats->skip_new),
+ pg_atomic_read_u64(&Stats->skip_fpw),
+ pg_atomic_read_u64(&Stats->skip_seq),
+ Stats->avg_distance,
+ Stats->avg_queue_depth)));
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ uint32 reset_request;
+
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early and let recovery catch up.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /*
+ * Can we drop any filters yet? This happens when the LSN that is
+ * currently being replayed has moved past a record that prevents
+ * pretching of a block range, such as relation extension.
+ */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /*
+ * Have we been asked to reset our stats counters? This is checked with
+ * an unsynchronized memory read, but we'll see it eventually and we'll be
+ * accessing that cache line anyway.
+ */
+ reset_request = pg_atomic_read_u32(&Stats->reset_request);
+ if (reset_request != Stats->reset_handled)
+ {
+ XLogPrefetchResetStats();
+ Stats->reset_handled = reset_request;
+ prefetcher->avg_distance = 0;
+ prefetcher->avg_queue_depth = 0;
+ prefetcher->samples = 0;
+ }
+
+ /* OK, we can now try reading ahead. */
+ XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ for (;;)
+ {
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+ prefetcher->shutdown = true;
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ Stats->distance = distance;
+
+ /* Periodically recompute some statistics. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ /* Compute online averages. */
+ prefetcher->samples++;
+ if (prefetcher->samples == 1)
+ {
+ prefetcher->avg_distance = Stats->distance;
+ prefetcher->avg_queue_depth = Stats->queue_depth;
+ }
+ else
+ {
+ prefetcher->avg_distance +=
+ (Stats->distance - prefetcher->avg_distance) /
+ prefetcher->samples;
+ prefetcher->avg_queue_depth +=
+ (Stats->queue_depth - prefetcher->avg_queue_depth) /
+ prefetcher->samples;
+ }
+
+ /* Expose it in shared memory. */
+ Stats->avg_distance = prefetcher->avg_distance;
+ Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+ /* Also periodically save the simple counters. */
+ XLogPrefetchSaveStats();
+
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_recovery_prefetch_distance)
+ break;
+
+ /* Are we not far enough ahead? */
+ if (distance <= 0)
+ {
+ prefetcher->have_record = false; /* skip this record */
+ continue;
+ }
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /* Scan the record's block references. */
+ if (!XLogPrefetcherScanBlocks(prefetcher))
+ return;
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ /*
+ * We might already have been partway through processing this record when
+ * our queue became saturated, so we need to start where we left off.
+ */
+ for (int block_id = prefetcher->next_block_id;
+ block_id <= reader->max_block_id;
+ ++block_id)
+ {
+ PrefetchBufferResult prefetch;
+ DecodedBkpBlock *block = &reader->blocks[block_id];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might think we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !recovery_prefetch_fpw)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat access to the same block, then skip it.
+ *
+ * XXX We could also check for last_blkno + 1 too, and also update
+ * last_blkno; it's not clear if the kernel would do a better job
+ * of sequential prefetching.
+ */
+ if (block->blkno == prefetcher->last_blkno)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+ if (BufferIsValid(prefetch.recent_buffer))
+ {
+ /*
+ * It was already cached, so do nothing. Perhaps in future we
+ * could remember the buffer so that recovery doesn't have to look
+ * it up again.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+ }
+ else if (prefetch.initiated_io)
+ {
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before processing
+ * any more blocks from this record, or move to a new record if
+ * that was the last block.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = block_id + 1;
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Neither cached nor initiated. The underlying segment file
+ * doesn't exist. Presumably it will be unlinked by a later WAL
+ * record. When recovery reads this block, it will use the
+ * EXTENSION_CREATE_RECOVERY flag. We certainly don't want to do
+ * that sort of thing while merely prefetching, so let's just
+ * ignore references to this relation until this record is
+ * replayed, and let recovery create the dummy file or complain if
+ * something is wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+ bool nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+ {
+ /* There's an unhandled reset request, so just show NULLs */
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = false;
+ }
+
+ values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+ values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+ values[6] = Int32GetDatum(Stats->distance);
+ values[7] = Int32GetDatum(Stats->queue_depth);
+ values[8] = Float4GetDatum(Stats->avg_distance);
+ values[9] = Float4GetDatum(Stats->avg_queue_depth);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth++;
+ Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth--;
+ Assert(Stats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ max_recovery_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ recovery_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d406ea8118..3b15f5ef8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -825,6 +825,20 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_prefetch_recovery AS
+ SELECT
+ s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..6c9ac5b29b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
#include "access/transam.h"
#include "access/twophase_rmgr.h"
#include "access/xact.h"
+#include "access/xlogprefetch.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
#include "common/ip.h"
@@ -276,6 +277,7 @@ static int localNumBackends = 0;
static PgStat_ArchiverStats archiverStats;
static PgStat_GlobalStats globalStats;
static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
/*
* List of OIDs of databases we need to write out. If an entry is InvalidOid,
@@ -348,6 +350,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1364,11 +1367,20 @@ pgstat_reset_shared_counters(const char *target)
msg.m_resettarget = RESET_ARCHIVER;
else if (strcmp(target, "bgwriter") == 0)
msg.m_resettarget = RESET_BGWRITER;
+ else if (strcmp(target, "prefetch_recovery") == 0)
+ {
+ /*
+ * We can't ask the stats collector to do this for us as it is not
+ * attached to shared memory.
+ */
+ XLogPrefetchRequestResetStats();
+ return;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("unrecognized reset target: \"%s\"", target),
- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
pgstat_send(&msg, sizeof(msg));
@@ -2690,6 +2702,22 @@ pgstat_fetch_slru(void)
}
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ * Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+ backend_read_statsfile();
+
+ return &recoveryPrefetchStats;
+}
+
+
/* ------------------------------------------------------------
* Functions for management of the shared-memory PgBackendStatus array
* ------------------------------------------------------------
@@ -4440,6 +4468,23 @@ pgstat_send_slru(void)
}
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ * Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+ PgStat_MsgRecoveryPrefetch msg;
+
+ pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+ msg.m_stats = *stats;
+ pgstat_send(&msg, sizeof(msg));
+}
+
+
/* ----------
* PgstatCollectorMain() -
*
@@ -4636,6 +4681,10 @@ PgstatCollectorMain(int argc, char *argv[])
pgstat_recv_slru(&msg.msg_slru, len);
break;
+ case PGSTAT_MTYPE_RECOVERYPREFETCH:
+ pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+ break;
+
case PGSTAT_MTYPE_FUNCSTAT:
pgstat_recv_funcstat(&msg.msg_funcstat, len);
break;
@@ -4911,6 +4960,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
(void) rc; /* we'll check for error with ferror */
+ /*
+ * Write recovery prefetch stats struct
+ */
+ rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+ fpout);
+ (void) rc; /* we'll check for error with ferror */
+
/*
* Walk through the database table.
*/
@@ -5170,6 +5226,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
memset(&globalStats, 0, sizeof(globalStats));
memset(&archiverStats, 0, sizeof(archiverStats));
memset(&slruStats, 0, sizeof(slruStats));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
/*
* Set the current timestamp (will be kept only in case we can't load an
@@ -5257,6 +5314,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
goto done;
}
+ /*
+ * Read recoveryPrefetchStats struct
+ */
+ if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+ fpin) != sizeof(recoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+ goto done;
+ }
+
/*
* We found an existing collector stats file. Read it and put all the
* hashtable entries into place.
@@ -5556,6 +5625,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
PgStat_GlobalStats myGlobalStats;
PgStat_ArchiverStats myArchiverStats;
PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+ PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
FILE *fpin;
int32 format_id;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5621,6 +5691,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
return false;
}
+ /*
+ * Read recovery prefetch stats struct
+ */
+ if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+ fpin) != sizeof(myRecoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ FreeFile(fpin);
+ return false;
+ }
+
/* By default, we're going to return the timestamp of the global file. */
*ts = myGlobalStats.stats_timestamp;
@@ -6420,6 +6502,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
slruStats[msg->m_index].truncate += msg->m_truncate;
}
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ * Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+ recoveryPrefetchStats = msg->m_stats;
+}
+
/* ----------
* pgstat_recv_recoveryconflict() -
*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetch.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetchShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetchShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..5ed7ed13e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
#include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_recovery_prefetch_distance is set to a positive number.")
+ },
+ &recovery_prefetch_fpw,
+ false,
+ NULL, assign_recovery_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable prefetching during recovery."),
+ GUC_UNIT_BYTE
+ },
+ &max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+ 256 * 1024,
+#else
+ -1,
+#endif
+ -1, INT_MAX,
+ NULL, assign_max_recovery_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /*
+ * Reconfigure recovery prefetching, because a setting it depends on
+ * changed.
+ */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..55cce90763 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB # -1 disables prefetching
+#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW
+
# - Archiving -
#archive_mode = off # enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ * Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+ XLogPrefetcher *prefetcher;
+ int reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+ XLogRecPtr lsn,
+ bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+ XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+ TimeLineID replaying_tli,
+ XLogRecPtr replaying_lsn,
+ bool from_stream)
+{
+ /*
+ * Handle any configuration changes. Rather than trying to deal with
+ * various parameter changes, we just tear down and set up a new
+ * prefetcher if anything we depend on changes.
+ */
+ if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+ {
+ /* If we had a prefetcher, tear it down. */
+ if (state->prefetcher)
+ {
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+ }
+ /* If we want a prefetcher, set it up. */
+ if (max_recovery_prefetch_distance > 0)
+ state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+ replaying_lsn,
+ from_stream);
+ state->reconfigure_count = XLogPrefetchReconfigureCount;
+ }
+
+ if (state->prefetcher)
+ XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4bce3ad8de..9f5f0ed4c8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6138,6 +6138,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+ prosrc => 'pg_stat_get_prefetch_recovery' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..105c2e77d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -63,6 +63,7 @@ typedef enum StatMsgType
PGSTAT_MTYPE_ARCHIVER,
PGSTAT_MTYPE_BGWRITER,
PGSTAT_MTYPE_SLRU,
+ PGSTAT_MTYPE_RECOVERYPREFETCH,
PGSTAT_MTYPE_FUNCSTAT,
PGSTAT_MTYPE_FUNCPURGE,
PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -183,6 +184,19 @@ typedef struct PgStat_TableXactStatus
struct PgStat_TableXactStatus *next; /* next of same subxact */
} PgStat_TableXactStatus;
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+ PgStat_Counter prefetch;
+ PgStat_Counter skip_hit;
+ PgStat_Counter skip_new;
+ PgStat_Counter skip_fpw;
+ PgStat_Counter skip_seq;
+ TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
/* ------------------------------------------------------------
* Message formats follow
@@ -454,6 +468,16 @@ typedef struct PgStat_MsgSLRU
PgStat_Counter m_truncate;
} PgStat_MsgSLRU;
+/* ----------
+ * PgStat_MsgRecoveryPrefetch Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+ PgStat_MsgHdr m_hdr;
+ PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
/* ----------
* PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
* ----------
@@ -598,6 +622,7 @@ typedef union PgStat_Msg
PgStat_MsgArchiver msg_archiver;
PgStat_MsgBgWriter msg_bgwriter;
PgStat_MsgSLRU msg_slru;
+ PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
PgStat_MsgFuncstat msg_funcstat;
PgStat_MsgFuncpurge msg_funcpurge;
PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1464,6 +1489,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
extern void pgstat_send_archiver(const char *xlog, bool failed);
extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
/* ----------
* Support functions for the SQL-callable functions to
@@ -1479,6 +1505,7 @@ extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
extern void pgstat_count_slru_page_hit(SlruCtl ctl);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac31840739..942a07ffee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
s.gss_enc AS encrypted
FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
pg_stat_progress_analyze| SELECT s.pid,
s.datid,
d.datname,
--
2.20.1
On Thu, Apr 09, 2020 at 09:55:25AM +1200, Thomas Munro wrote:
Thanks. Here's a rebase.
Thanks for working on this patch, it seems like a great feature. I'm
probably a bit late to the party, but still want to make couple of
commentaries.
The patch indeed looks good, I couldn't find any significant issues so
far and almost all my questions I had while reading it were actually
answered in this thread. I'm still busy with benchmarking, mostly to see
how prefetching would work with different workload distributions and how
much the kernel will actually prefetch.
In the meantime I have a few questions:
On Wed, Feb 12, 2020 at 07:52:42PM +1300, Thomas Munro wrote:
On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).Here is a new WIP version of the patch set that does that. Changes:
1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.
This totally makes sense, I believe the question "how much to prefetch"
eventually depends equally on a type of workload (correlates with how
far in WAL to read) and how much resources are available for prefetching
(correlates with queue depth). But in the documentation it looks like
maintenance-io-concurrency is just an "unimportant" option, and I'm
almost sure will be overlooked by many readers:
The maximum distance to look ahead in the WAL during recovery, to find
blocks to prefetch. Prefetching blocks that will soon be needed can
reduce I/O wait times. The number of concurrent prefetches is limited
by this setting as well as
<xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
might be counterproductive, if it means that data falls out of the
kernel cache before it is needed. If this value is specified without
units, it is taken as bytes. A setting of -1 disables prefetching
during recovery.
Maybe it makes also sense to emphasize that maintenance-io-concurrency
directly affects resource consumption and it's a "primary control"?
On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.
I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?
On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:
Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.
Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?
On Sun, Apr 19, 2020 at 11:46 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Thanks for working on this patch, it seems like a great feature. I'm
probably a bit late to the party, but still want to make couple of
commentaries.
Hi Dmitry,
Thanks for your feedback and your interest in this work!
The patch indeed looks good, I couldn't find any significant issues so
far and almost all my questions I had while reading it were actually
answered in this thread. I'm still busy with benchmarking, mostly to see
how prefetching would work with different workload distributions and how
much the kernel will actually prefetch.
Cool.
One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1]/messages/by-id/CA+hUKG+NPZeEdLXAcNr+w0YOZVb0Un0_MwTBpgmmVDh7No2jbg@mail.gmail.com become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.
In the meantime I have a few questions:
1. It now uses effective_io_concurrency to control how many
concurrent prefetches to allow. It's possible that we should have a
different GUC to control "maintenance" users of concurrency I/O as
discussed elsewhere[1], but I'm staying out of that for now; if we
agree to do that for VACUUM etc, we can change it easily here. Note
that the value is percolated through the ComputeIoConcurrency()
function which I think we should discuss, but again that's off topic,
I just want to use the standard infrastructure here.This totally makes sense, I believe the question "how much to prefetch"
eventually depends equally on a type of workload (correlates with how
far in WAL to read) and how much resources are available for prefetching
(correlates with queue depth). But in the documentation it looks like
maintenance-io-concurrency is just an "unimportant" option, and I'm
almost sure will be overlooked by many readers:The maximum distance to look ahead in the WAL during recovery, to find
blocks to prefetch. Prefetching blocks that will soon be needed can
reduce I/O wait times. The number of concurrent prefetches is limited
by this setting as well as
<xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
might be counterproductive, if it means that data falls out of the
kernel cache before it is needed. If this value is specified without
units, it is taken as bytes. A setting of -1 disables prefetching
during recovery.Maybe it makes also sense to emphasize that maintenance-io-concurrency
directly affects resource consumption and it's a "primary control"?
You're right. I will add something in the next version to emphasise that.
On Wed, Mar 18, 2020 at 06:18:44PM +1300, Thomas Munro wrote:
Here's a new version that changes that part just a bit more, after a
brief chat with Andres about his async I/O plans. It seems clear that
returning an enum isn't very extensible, so I decided to try making
PrefetchBufferResult a struct whose contents can be extended in the
future. In this patch set it's still just used to distinguish 3 cases
(hit, miss, no file), but it's now expressed as a buffer and a flag to
indicate whether I/O was initiated. You could imagine that the second
thing might be replaced by a pointer to an async I/O handle you can
wait on or some other magical thing from the future.I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?
Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls. Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3]https://kernel.dk/io_uring.pdf or lio_listio() on a
hypothetical POSIX AIO implementation[4]https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html. (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).
Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.
As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question. I didn't
see how to do it, but maybe I didn't try hard enough. I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure. In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?
Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue. I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience. Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting... I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.
On Wed, Apr 08, 2020 at 04:24:21AM +1200, Thomas Munro wrote:
Is there a way we could have a "historical" version of at least some of
these? An average queue depth, or such?Ok, I added simple online averages for distance and queue depth that
take a sample every time recovery advances by 256kB.Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?
Hmm.
I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so. I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it. Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z? I think we should
provide primary measurements and let people compute derived statistics
from those.
I suppose the reason for this request was the analogy with Linux
iostat -x's "aqu-sz", which is the primary way that people understand
device queue depth on that OS. This number is actually computed by
iostat, not the kernel, so by analogy I could argue that a
hypothetical pg_iostat program compute that for you from raw
ingredients. AFAIK iostat computes the *unweighted* average queue
depth during the time between output lines, by observing changes in
the "aveq" ("the sum of how long all requests have spent in flight, in
milliseconds") and "use" ("how many milliseconds there has been at
least one IO in flight") fields of /proc/diskstats. But it's OK that
it's unweighted, because it computes a new value for every line it
output (ie every 5 seconds or whatever you asked for). It's not too
clear how to do something like that here, but all suggestions are
weclome.
Or maybe we'll have something more general that makes this more
specific thing irrelevant, in future AIO infrastructure work.
On a more superficial note, one thing I don't like about the last
version of the patch is the difference in the ordering of the words in
the GUC recovery_prefetch_distance and the view
pg_stat_prefetch_recovery. Hrmph.
[1]: /messages/by-id/CA+hUKG+NPZeEdLXAcNr+w0YOZVb0Un0_MwTBpgmmVDh7No2jbg@mail.gmail.com
[2]: https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf
[3]: https://kernel.dk/io_uring.pdf
[4]: https://pubs.opengroup.org/onlinepubs/009695399/functions/lio_listio.html
On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:
One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.
At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.
If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.
# with prefetch
Function = b'heap2_redo' [8064]
nsecs : count distribution
4096 -> 8191 : 1213 | |
8192 -> 16383 : 66639 |****************************************|
16384 -> 32767 : 27846 |**************** |
32768 -> 65535 : 873 | |
# without prefetch
Function = b'heap2_redo' [17980]
nsecs : count distribution
4096 -> 8191 : 1 | |
8192 -> 16383 : 66997 |****************************************|
16384 -> 32767 : 30966 |****************** |
32768 -> 65535 : 1602 | |
# with prefetch
Function = b'btree_redo' [8064]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 246 |****************************************|
8192 -> 16383 : 5 | |
16384 -> 32767 : 2 | |
# without prefetch
Function = b'btree_redo' [17980]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 82 |******************** |
8192 -> 16383 : 19 |**** |
16384 -> 32767 : 160 |****************************************|
Of course it doesn't take into account time we spend doing extra
syscalls for prefetching, but still can give some interesting
information.
I like the idea of extensible PrefetchBufferResult. Just one commentary,
if I understand correctly the way how it is being used together with
prefetch_queue assumes one IO operation at a time. This limits potential
extension of the underlying code, e.g. one can't implement some sort of
buffering of requests and submitting an iovec to a sycall, then
prefetch_queue will no longer correctly represent inflight IO. Also,
taking into account that "we don't have any awareness of when I/O really
completes", maybe in the future it makes to reconsider having queue in
the prefetcher itself and rather ask for this information from the
underlying code?Yeah, you're right that it'd be good to be able to do some kind of
batching up of these requests to reduce system calls. Of course
posix_fadvise() doesn't support that, but clearly in the AIO future[2]
it would indeed make sense to buffer up a few of these and then make a
single call to io_uring_enter() on Linux[3] or lio_listio() on a
hypothetical POSIX AIO implementation[4]. (I'm not sure if there is a
thing like that on Windows; at a glance, ReadFileScatter() is
asynchronous ("overlapped") but works only on a single handle so it's
like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
call at appropriate times, but you obviously can't call it too
infrequently.As for how to make the prefetch queue a reusable component, rather
than having a custom thing like that for each part of our system that
wants to support prefetching: that's a really good question. I didn't
see how to do it, but maybe I didn't try hard enough. I looked at the
three users I'm aware of, namely this patch, a btree prefetching patch
I haven't shared yet, and the existing bitmap heap scan code, and they
all needed to have their own custom book keeping for this, and I
couldn't figure out how to share more infrastructure. In the case of
this patch, you currently need to do LSN based book keeping to
simulate "completion", and that doesn't make sense for other users.
Maybe it'll become clearer when we have support for completion
notification?
Yes, definitely.
Some related questions are why all these parts of our system that know
how to prefetch are allowed to do so independently without any kind of
shared accounting, and why we don't give each tablespace (= our model
of a device?) its own separate queue. I think it's OK to put these
questions off a bit longer until we have more infrastructure and
experience. Our current non-answer is at least consistent with our
lack of an approach to system-wide memory and CPU accounting... I
personally think that a better XLogReader that can be used for
prefetching AND recovery would be a higher priority than that.
Sure, this patch is quite valuable as it is, and those questions I've
mentioned are targeting mostly future development.
Maybe it was discussed in the past in other threads. But if I understand
correctly, this implementation weights all the samples. Since at the
moment it depends directly on replaying speed (so a lot of IO involved),
couldn't it lead to a single outlier at the beginning skewing this value
and make it less useful? Does it make sense to decay old values?Hmm.
I wondered about a reporting one or perhaps three exponential moving
averages (like Unix 1/5/15 minute load averages), but I didn't propose
it because: (1) In crash recovery, you can't query it, you just get
the log message at the end, and mean unweighted seems OK in that case,
no? (you are not more interested in the I/O saturation at the end of
the recovery compared to the start of recovery are you?), and (2) on a
streaming replica, if you want to sample the instantaneous depth and
compute an exponential moving average or some more exotic statistical
concoction in your monitoring tool, you're free to do so. I suppose
(2) is an argument for removing the existing average completely from
the stat view; I put it in there at Andres's suggestion, but I'm not
sure I really believe in it. Where is our average replication lag,
and why don't we compute the stddev of X, Y or Z? I think we should
provide primary measurements and let people compute derived statistics
from those.
For once I disagree, since I believe this very approach, widely applied,
leads to a slightly chaotic situation with monitoring. But of course
you're right, it has nothing to do with the patch itself. I also would
be in favour of removing the existing averages, unless Andres has more
arguments to keep it.
On Sat, Apr 25, 2020 at 09:19:35PM +0200, Dmitry Dolgov wrote:
On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:
One report I heard recently said that if you get rid of I/O stalls,
pread() becomes cheap enough that the much higher frequency lseek()
calls I've complained about elsewhere[1] become the main thing
recovery is doing, at least on some systems, but I haven't pieced
together the conditions required yet. I'd be interested to know if
you see that.At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.
I've finally performed couple of tests involving more IO. The
not-that-big dataset of 1.5 GB for the replica with the memory allowing
fitting ~ 1/6 of it, default prefetching parameters and an update
workload with uniform distribution. Rather a small setup, but causes
stable reading into the page cache on the replica and allows to see a
visible influence of the patch (more measurement samples tend to happen
at lower latencies):
# with patch
Function = b'heap_redo' [206]
nsecs : count distribution
1024 -> 2047 : 0 | |
2048 -> 4095 : 32833 |********************** |
4096 -> 8191 : 59476 |****************************************|
8192 -> 16383 : 18617 |************ |
16384 -> 32767 : 3992 |** |
32768 -> 65535 : 425 | |
65536 -> 131071 : 5 | |
131072 -> 262143 : 326 | |
262144 -> 524287 : 6 | |
# without patch
Function = b'heap_redo' [130]
nsecs : count distribution
1024 -> 2047 : 0 | |
2048 -> 4095 : 20062 |*********** |
4096 -> 8191 : 70662 |****************************************|
8192 -> 16383 : 12895 |******* |
16384 -> 32767 : 9123 |***** |
32768 -> 65535 : 560 | |
65536 -> 131071 : 1 | |
131072 -> 262143 : 460 | |
262144 -> 524287 : 3 | |
Not that there were any doubts, but at the same time it was surprising
to me how good linux readahead works in this situation. The results
above are shown with disabled readahead for filesystem and device, and
without that there was almost no difference, since a lot of IO was
avoided by readahead (which was in fact the majority of all reads):
# with patch
flags = Read
usecs : count distribution
16 -> 31 : 0 | |
32 -> 63 : 1 |******** |
64 -> 127 : 5 |****************************************|
flags = ReadAhead-Read
usecs : count distribution
32 -> 63 : 0 | |
64 -> 127 : 131 |****************************************|
128 -> 255 : 12 |*** |
256 -> 511 : 6 |* |
# without patch
flags = Read
usecs : count distribution
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 4 |****************************************|
flags = ReadAhead-Read
usecs : count distribution
32 -> 63 : 0 | |
64 -> 127 : 143 |****************************************|
128 -> 255 : 20 |***** |
Numbers of reads in this case were similar with and without patch, which
means it couldn't be attributed to the situation when a page was read
too early, then evicted and read again later.
On Sun, May 3, 2020 at 3:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I've finally performed couple of tests involving more IO. The
not-that-big dataset of 1.5 GB for the replica with the memory allowing
fitting ~ 1/6 of it, default prefetching parameters and an update
workload with uniform distribution. Rather a small setup, but causes
stable reading into the page cache on the replica and allows to see a
visible influence of the patch (more measurement samples tend to happen
at lower latencies):
Thanks for these tests Dmitry. You didn't mention the details of the
workload, but one thing I'd recommend for a uniform/random workload
that's generating a lot of misses on the primary server using N
backends is to make sure that maintenance_io_concurrency is set to a
number like N*2 or higher, and to look at the queue depth on both
systems with iostat -x 1. Then you can experiment with ALTER SYSTEM
SET maintenance_io_concurrency = X; SELECT pg_reload_conf(); to try to
understand the way it works; there is a point where you've set it high
enough and the replica is able to handle the same rate of concurrent
I/Os as the primary. The default of 10 is actually pretty low unless
you've only got ~4 backends generating random updates on the primary.
That's with full_page_writes=off; if you leave it on, it takes a while
to get into a scenario where it has much effect.
Here's a rebase, after the recent XLogReader refactoring.
Attachments:
v9-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patchDownload
From a7fd3f728d64c3c94387e9e424dba507b166bcab Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 28 Mar 2020 11:42:59 +1300
Subject: [PATCH v9 1/3] Add pg_atomic_unlocked_add_fetch_XXX().
Add a variant of pg_atomic_add_fetch_XXX with no barrier semantics, for
cases where you only want to avoid the possibility that a concurrent
pg_atomic_read_XXX() sees a torn/partial value.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/include/port/atomics.h | 24 ++++++++++++++++++++++
src/include/port/atomics/generic.h | 33 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 4956ec55cb..2abb852893 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -389,6 +389,21 @@ pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
return pg_atomic_add_fetch_u32_impl(ptr, add_);
}
+/*
+ * pg_atomic_unlocked_add_fetch_u32 - atomically add to variable
+ *
+ * Like pg_atomic_unlocked_write_u32, guarantees only that partial values
+ * cannot be observed.
+ *
+ * No barrier semantics.
+ */
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ AssertPointerAlignment(ptr, 4);
+ return pg_atomic_unlocked_add_fetch_u32_impl(ptr, add_);
+}
+
/*
* pg_atomic_sub_fetch_u32 - atomically subtract from variable
*
@@ -519,6 +534,15 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
}
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
+ AssertPointerAlignment(ptr, 8);
+#endif
+ return pg_atomic_unlocked_add_fetch_u64_impl(ptr, add_);
+}
+
#undef INSIDE_ATOMICS_H
#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
index d3ba89a58f..1683653ca6 100644
--- a/src/include/port/atomics/generic.h
+++ b/src/include/port/atomics/generic.h
@@ -234,6 +234,16 @@ pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
}
#endif
+#if !defined(PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32)
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U32
+static inline uint32
+pg_atomic_unlocked_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ ptr->value += add_;
+ return ptr->value;
+}
+#endif
+
#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
#define PG_HAVE_ATOMIC_SUB_FETCH_U32
static inline uint32
@@ -399,3 +409,26 @@ pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
}
#endif
+
+#if defined(PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY) && \
+ !defined(PG_HAVE_ATOMIC_U64_SIMULATION)
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+#define PG_HAVE_ATOMIC_UNLOCKED_ADD_FETCH_U64
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ ptr->value += val;
+ return ptr->value;
+}
+#endif
+
+#else
+
+static inline uint64
+pg_atomic_unlocked_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+ return pg_atomic_add_fetch_u64_impl(ptr, val);
+}
+
+#endif
--
2.20.1
v9-0002-Allow-XLogReadRecord-to-be-non-blocking.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Allow-XLogReadRecord-to-be-non-blocking.patchDownload
From 6ed95fffba6751ddc9607659183c072cb11fa4a8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 7 Apr 2020 22:56:27 +1200
Subject: [PATCH v9 2/3] Allow XLogReadRecord() to be non-blocking.
Extend read_local_xlog_page() to support non-blocking modes:
1. Reading as far as the WAL receiver has written so far.
2. Reading all the way to the end, when the end LSN is unknown.
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
src/backend/access/transam/xlogreader.c | 37 ++++--
src/backend/access/transam/xlogutils.c | 149 +++++++++++++++++-------
src/backend/replication/walsender.c | 2 +-
src/include/access/xlogreader.h | 14 ++-
src/include/access/xlogutils.h | 26 +++++
5 files changed, 173 insertions(+), 55 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5995798b58..897efaf682 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -259,6 +259,9 @@ XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
* If the reading fails for some other reason, NULL is also returned, and
* *errormsg is set to a string with details of the failure.
*
+ * If the read_page callback is one that returns XLOGPAGEREAD_WOULDBLOCK rather
+ * than waiting for WAL to arrive, NULL is also returned in that case.
+ *
* The returned pointer (or *errormsg) points to an internal buffer that's
* valid until the next call to XLogReadRecord.
*/
@@ -548,10 +551,11 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
err:
/*
- * Invalidate the read state. We might read from a different source after
- * failure.
+ * Invalidate the read state, if this was an error. We might read from a
+ * different source after failure.
*/
- XLogReaderInvalReadState(state);
+ if (readOff != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
if (state->errormsg_buf[0] != '\0')
*errormsg = state->errormsg_buf;
@@ -563,8 +567,9 @@ err:
* Read a single xlog page including at least [pageptr, reqLen] of valid data
* via the page_read() callback.
*
- * Returns -1 if the required page cannot be read for some reason; errormsg_buf
- * is set in that case (unless the error occurs in the page_read callback).
+ * Returns XLOGPAGEREAD_ERROR or XLOGPAGEREAD_WOULDBLOCK if the required page
+ * cannot be read for some reason; errormsg_buf is set in the former case
+ * (unless the error occurs in the page_read callback).
*
* We fetch the page from a reader-local cache if we know we have the required
* data and if there hasn't been any error since caching the data.
@@ -661,8 +666,11 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
return readLen;
err:
+ if (readLen == XLOGPAGEREAD_WOULDBLOCK)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
XLogReaderInvalReadState(state);
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
/*
@@ -941,6 +949,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr found = InvalidXLogRecPtr;
XLogPageHeader header;
char *errormsg;
+ int readLen;
Assert(!XLogRecPtrIsInvalid(RecPtr));
@@ -954,7 +963,6 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
XLogRecPtr targetPagePtr;
int targetRecOff;
uint32 pageHeaderSize;
- int readLen;
/*
* Compute targetRecOff. It should typically be equal or greater than
@@ -1035,7 +1043,8 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
}
err:
- XLogReaderInvalReadState(state);
+ if (readLen != XLOGPAGEREAD_WOULDBLOCK)
+ XLogReaderInvalReadState(state);
return InvalidXLogRecPtr;
}
@@ -1094,8 +1103,16 @@ WALRead(XLogReaderState *state,
XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
state->routine.segment_open(state, nextSegNo, &tli);
- /* This shouldn't happen -- indicates a bug in segment_open */
- Assert(state->seg.ws_file >= 0);
+ /* callback reported that there was no such file */
+ if (state->seg.ws_file < 0)
+ {
+ errinfo->wre_errno = errno;
+ errinfo->wre_req = 0;
+ errinfo->wre_read = 0;
+ errinfo->wre_off = startoff;
+ errinfo->wre_seg = state->seg;
+ return false;
+ }
/* Update the current segment info. */
state->seg.ws_tli = tli;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 322b0e8ff5..18aa499831 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -25,6 +25,7 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
@@ -808,6 +809,29 @@ wal_segment_open(XLogReaderState *state, XLogSegNo nextSegNo,
path)));
}
+/*
+ * XLogReaderRoutine->segment_open callback that reports missing files rather
+ * than raising an error.
+ */
+void
+wal_segment_try_open(XLogReaderState *state, XLogSegNo nextSegNo,
+ TimeLineID *tli_p)
+{
+ TimeLineID tli = *tli_p;
+ char path[MAXPGPATH];
+
+ XLogFilePath(path, tli, nextSegNo, state->segcxt.ws_segsize);
+ state->seg.ws_file = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+ if (state->seg.ws_file >= 0)
+ return;
+
+ if (errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+}
+
/* stock XLogReaderRoutine->segment_close callback */
void
wal_segment_close(XLogReaderState *state)
@@ -823,6 +847,10 @@ wal_segment_close(XLogReaderState *state)
* Public because it would likely be very helpful for someone writing another
* output method outside walsender, e.g. in a bgworker.
*
+ * A pointer to an XLogReadLocalOptions struct may be passed in as
+ * XLogReaderRouter->page_read_private to control the behavior of this
+ * function.
+ *
* TODO: The walsender has its own version of this, but it relies on the
* walsender's latch being set whenever WAL is flushed. No such infrastructure
* exists for normal backends, so we have to do a check/sleep/repeat style of
@@ -837,58 +865,89 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
TimeLineID tli;
int count;
WALReadError errinfo;
+ XLogReadLocalOptions *options =
+ (XLogReadLocalOptions *) state->routine.page_read_private;
loc = targetPagePtr + reqLen;
/* Loop waiting for xlog to be available if necessary */
while (1)
{
- /*
- * Determine the limit of xlog we can currently read to, and what the
- * most recent timeline is.
- *
- * RecoveryInProgress() will update ThisTimeLineID when it first
- * notices recovery finishes, so we only have to maintain it for the
- * local process until recovery ends.
- */
- if (!RecoveryInProgress())
- read_upto = GetFlushRecPtr();
- else
- read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
- tli = ThisTimeLineID;
+ switch (options ? options->read_upto_policy : -1)
+ {
+ case XLRO_WALRCV_WRITTEN:
+ /*
+ * We'll try to read as far as has been written by the WAL
+ * receiver, on the requested timeline. When we run out of valid
+ * data, we'll return an error. This is used by xlogprefetch.c
+ * while streaming.
+ */
+ read_upto = GetWalRcvWriteRecPtr();
+ state->currTLI = tli = options->tli;
+ break;
- /*
- * Check which timeline to get the record from.
- *
- * We have to do it each time through the loop because if we're in
- * recovery as a cascading standby, the current timeline might've
- * become historical. We can't rely on RecoveryInProgress() because in
- * a standby configuration like
- *
- * A => B => C
- *
- * if we're a logical decoding session on C, and B gets promoted, our
- * timeline will change while we remain in recovery.
- *
- * We can't just keep reading from the old timeline as the last WAL
- * archive in the timeline will get renamed to .partial by
- * StartupXLOG().
- *
- * If that happens after our caller updated ThisTimeLineID but before
- * we actually read the xlog page, we might still try to read from the
- * old (now renamed) segment and fail. There's not much we can do
- * about this, but it can only happen when we're a leaf of a cascading
- * standby whose master gets promoted while we're decoding, so a
- * one-off ERROR isn't too bad.
- */
- XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ case XLRO_END:
+ /*
+ * We'll try to read as far as we can on one timeline. This is
+ * used by xlogprefetch.c for crash recovery.
+ */
+ read_upto = (XLogRecPtr) -1;
+ state->currTLI = tli = options->tli;
+ break;
+
+ default:
+ /*
+ * Determine the limit of xlog we can currently read to, and what the
+ * most recent timeline is.
+ *
+ * RecoveryInProgress() will update ThisTimeLineID when it first
+ * notices recovery finishes, so we only have to maintain it for
+ * the local process until recovery ends.
+ */
+ if (!RecoveryInProgress())
+ read_upto = GetFlushRecPtr();
+ else
+ read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
+ tli = ThisTimeLineID;
+
+ /*
+ * Check which timeline to get the record from.
+ *
+ * We have to do it each time through the loop because if we're in
+ * recovery as a cascading standby, the current timeline might've
+ * become historical. We can't rely on RecoveryInProgress()
+ * because in a standby configuration like
+ *
+ * A => B => C
+ *
+ * if we're a logical decoding session on C, and B gets promoted,
+ * our timeline will change while we remain in recovery.
+ *
+ * We can't just keep reading from the old timeline as the last
+ * WAL archive in the timeline will get renamed to .partial by
+ * StartupXLOG().
+ *
+ * If that happens after our caller updated ThisTimeLineID but
+ * before we actually read the xlog page, we might still try to
+ * read from the old (now renamed) segment and fail. There's not
+ * much we can do about this, but it can only happen when we're a
+ * leaf of a cascading standby whose master gets promoted while
+ * we're decoding, so a one-off ERROR isn't too bad.
+ */
+ XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
+ break;
+ }
- if (state->currTLI == ThisTimeLineID)
+ if (state->currTLI == tli)
{
if (loc <= read_upto)
break;
+ /* not enough data there, but we were asked not to wait */
+ if (options && options->nowait)
+ return XLOGPAGEREAD_WOULDBLOCK;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(1000L);
}
@@ -930,7 +989,7 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
else if (targetPagePtr + reqLen > read_upto)
{
/* not enough data there */
- return -1;
+ return XLOGPAGEREAD_ERROR;
}
else
{
@@ -945,7 +1004,17 @@ read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
&errinfo))
+ {
+ /*
+ * When not following timeline changes, we may read past the end of
+ * available segments. Report lack of file as an error rather than
+ * raising an error.
+ */
+ if (errinfo.wre_errno == ENOENT)
+ return XLOGPAGEREAD_ERROR;
+
WALReadRaiseError(&errinfo);
+ }
/* number of valid bytes in the buffer */
return count;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..448c83b684 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -835,7 +835,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
/* fail if not (implies we are going to shut down) */
if (flushptr < targetPagePtr + reqLen)
- return -1;
+ return XLOGPAGEREAD_ERROR;
if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
count = XLOG_BLCKSZ; /* more than one block available */
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index d930fe957d..3a5ab4b3ce 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -57,6 +57,10 @@ typedef struct WALSegmentContext
typedef struct XLogReaderState XLogReaderState;
+/* Special negative return values for XLogPageReadCB functions */
+#define XLOGPAGEREAD_ERROR -1
+#define XLOGPAGEREAD_WOULDBLOCK -2
+
/* Function type definitions for various xlogreader interactions */
typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr,
@@ -76,10 +80,11 @@ typedef struct XLogReaderRoutine
* This callback shall read at least reqLen valid bytes of the xlog page
* starting at targetPagePtr, and store them in readBuf. The callback
* shall return the number of bytes read (never more than XLOG_BLCKSZ), or
- * -1 on failure. The callback shall sleep, if necessary, to wait for the
- * requested bytes to become available. The callback will not be invoked
- * again for the same page unless more than the returned number of bytes
- * are needed.
+ * XLOGPAGEREAD_ERROR on failure. The callback shall either sleep, if
+ * necessary, to wait for the requested bytes to become available, or
+ * return XLOGPAGEREAD_WOULDBLOCK. The callback will not be invoked again
+ * for the same page unless more than the returned number of bytes are
+ * needed.
*
* targetRecPtr is the position of the WAL record we're reading. Usually
* it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
@@ -91,6 +96,7 @@ typedef struct XLogReaderRoutine
* read from.
*/
XLogPageReadCB page_read;
+ void *page_read_private;
/*
* Callback to open the specified WAL segment for reading. ->seg.ws_file
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index e59b6cf3a9..6325c23dc2 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -47,12 +47,38 @@ extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+/*
+ * A pointer to an XLogReadLocalOptions struct can supplied as the private data
+ * for an XLogReader, causing read_local_xlog_page() to modify its behavior.
+ */
+typedef struct XLogReadLocalOptions
+{
+ /* Don't block waiting for new WAL to arrive. */
+ bool nowait;
+
+ /*
+ * For XLRO_WALRCV_WRITTEN and XLRO_END modes, the timeline ID must be
+ * provided.
+ */
+ TimeLineID tli;
+
+ /* How far to read. */
+ enum {
+ XLRO_STANDARD,
+ XLRO_WALRCV_WRITTEN,
+ XLRO_END
+ } read_upto_policy;
+} XLogReadLocalOptions;
+
extern int read_local_xlog_page(XLogReaderState *state,
XLogRecPtr targetPagePtr, int reqLen,
XLogRecPtr targetRecPtr, char *cur_page);
extern void wal_segment_open(XLogReaderState *state,
XLogSegNo nextSegNo,
TimeLineID *tli_p);
+extern void wal_segment_try_open(XLogReaderState *state,
+ XLogSegNo nextSegNo,
+ TimeLineID *tli_p);
extern void wal_segment_close(XLogReaderState *state);
extern void XLogReadDetermineTimeline(XLogReaderState *state,
--
2.20.1
v9-0003-Prefetch-referenced-blocks-during-recovery.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Prefetch-referenced-blocks-during-recovery.patchDownload
From 68cbfa9e553359a57a4806cab8af60b0450f7e5b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2020 23:04:51 +1200
Subject: [PATCH v9 3/3] Prefetch referenced blocks during recovery.
Introduce a new GUC max_recovery_prefetch_distance. If it is set to a
positive number of bytes, then read ahead in the WAL at most that
distance, and initiate asynchronous reading of referenced blocks. The
goal is to avoid I/O stalls and benefit from concurrent I/O. The number
of concurrency asynchronous reads is capped by the existing
maintenance_io_concurrency GUC. The feature is enabled by default for
now, but we might reconsider that before release.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 45 +
doc/src/sgml/monitoring.sgml | 85 +-
doc/src/sgml/wal.sgml | 13 +
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/xlog.c | 16 +
src/backend/access/transam/xlogprefetch.c | 910 ++++++++++++++++++
src/backend/catalog/system_views.sql | 14 +
src/backend/postmaster/pgstat.c | 96 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/utils/misc/guc.c | 47 +-
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/include/access/xlogprefetch.h | 85 ++
src/include/catalog/pg_proc.dat | 8 +
src/include/pgstat.h | 27 +
src/include/utils/guc.h | 4 +
src/test/regress/expected/rules.out | 11 +
16 files changed, 1366 insertions(+), 4 deletions(-)
create mode 100644 src/backend/access/transam/xlogprefetch.c
create mode 100644 src/include/access/xlogprefetch.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a2694e548a..0c9842b0f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3121,6 +3121,51 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-max-recovery-prefetch-distance" xreflabel="max_recovery_prefetch_distance">
+ <term><varname>max_recovery_prefetch_distance</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>max_recovery_prefetch_distance</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ The maximum distance to look ahead in the WAL during recovery, to find
+ blocks to prefetch. Prefetching blocks that will soon be needed can
+ reduce I/O wait times. The number of concurrent prefetches is limited
+ by this setting as well as
+ <xref linkend="guc-maintenance-io-concurrency"/>. Setting it too high
+ might be counterproductive, if it means that data falls out of the
+ kernel cache before it is needed. If this value is specified without
+ units, it is taken as bytes. A setting of -1 disables prefetching
+ during recovery.
+ The default is 256kB on systems that support
+ <function>posix_fadvise</function>, and otherwise -1.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
+ <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Whether to prefetch blocks that were logged with full page images,
+ during recovery. Often this doesn't help, since such blocks will not
+ be read the first time they are needed and might remain in the buffer
+ pool after that. However, on file systems with a block size larger
+ than
+ <productname>PostgreSQL</productname>'s, prefetching can avoid a
+ costly read-before-write when a blocks are later written. This
+ setting has no effect unless
+ <xref linkend="guc-max-recovery-prefetch-distance"/> is set to a positive
+ number. The default is off.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
<sect2 id="runtime-config-wal-archiving">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 49d4bb13b9..0ab278e087 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -320,6 +320,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
</entry>
</row>
+ <row>
+ <entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
+ <entry>Only one row, showing statistics about blocks prefetched during recovery.
+ See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
+ </entry>
+ </row>
+
<row>
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
<entry>At least one row per subscription, showing information about
@@ -2674,6 +2681,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
connected server.
</para>
+ <table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
+ <title><structname>pg_stat_prefetch_recovery</structname> View</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Column</entry>
+ <entry>Type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry><structfield>prefetch</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks prefetched because they were not in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_hit</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_new</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
+ </row>
+ <row>
+ <entry><structfield>skip_fpw</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
+ </row>
+ <row>
+ <entry><structfield>skip_seq</structfield></entry>
+ <entry><type>bigint</type></entry>
+ <entry>Number of blocks not prefetched because of repeated or sequential access</entry>
+ </row>
+ <row>
+ <entry><structfield>distance</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
+ </row>
+ <row>
+ <entry><structfield>queue_depth</structfield></entry>
+ <entry><type>integer</type></entry>
+ <entry>How many prefetches have been initiated but are not yet known to have completed</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_distance</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
+ </row>
+ <row>
+ <entry><structfield>avg_queue_depth</structfield></entry>
+ <entry><type>float4</type></entry>
+ <entry>Average number of prefetches in flight while recovery is not idle</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-max-recovery-prefetch-distance"/>
+ for more information. The counters in this view are reset whenever the
+ <xref linkend="guc-max-recovery-prefetch-distance"/>,
+ <xref linkend="guc-recovery-prefetch-fpw"/> or
+ <xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
+ the server configuration is reloaded.
+ </para>
+
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
<title><structname>pg_stat_subscription</structname> View</title>
<tgroup cols="1">
@@ -4494,8 +4573,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
argument. The argument can be <literal>bgwriter</literal> to reset
all the counters shown in
the <structname>pg_stat_bgwriter</structname>
- view,or <literal>archiver</literal> to reset all the counters shown in
- the <structname>pg_stat_archiver</structname> view.
+ view, <literal>archiver</literal> to reset all the counters shown in
+ the <structname>pg_stat_archiver</structname> view, and
+ <literal>prefetch_recovery</literal> to reset all the counters shown
+ in the <structname>pg_stat_prefetch_recovery</structname> view.
</para>
<para>
This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index bd9fae544c..38fc8149a8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -719,6 +719,19 @@
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>
+
+ <para>
+ The <xref linkend="guc-max-recovery-prefetch-distance"/> parameter can
+ be used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed, in combination with the
+ <xref linkend="guc-maintenance-io-concurrency"/> parameter. The
+ prefetching mechanism is most likely to be effective on systems
+ with <varname>full_page_writes</varname> set to
+ <varname>off</varname> (where that is safe), and where the working
+ set is larger than RAM. By default, prefetching in recovery is enabled,
+ but it can be disabled by setting the distance to -1.
+ </para>
</sect1>
<sect1 id="wal-internals">
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..39f9d4e77d 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -31,6 +31,7 @@ OBJS = \
xlogarchive.o \
xlogfuncs.o \
xloginsert.o \
+ xlogprefetch.o \
xlogreader.o \
xlogutils.o
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..81147d5f59 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -35,6 +35,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xloginsert.h"
+#include "access/xlogprefetch.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
@@ -7169,6 +7170,7 @@ StartupXLOG(void)
{
ErrorContextCallback errcallback;
TimestampTz xtime;
+ XLogPrefetchState prefetch;
InRedo = true;
@@ -7176,6 +7178,9 @@ StartupXLOG(void)
(errmsg("redo starts at %X/%X",
(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
+ /* Prepare to prefetch, if configured. */
+ XLogPrefetchBegin(&prefetch);
+
/*
* main redo apply loop
*/
@@ -7205,6 +7210,12 @@ StartupXLOG(void)
/* Handle interrupt signals of startup process */
HandleStartupProcInterrupts();
+ /* Peform WAL prefetching, if enabled. */
+ XLogPrefetch(&prefetch,
+ ThisTimeLineID,
+ xlogreader->ReadRecPtr,
+ currentSource == XLOG_FROM_STREAM);
+
/*
* Pause WAL replay, if requested by a hot-standby session via
* SetRecoveryPause().
@@ -7376,6 +7387,9 @@ StartupXLOG(void)
*/
if (switchedTLI && AllowCascadeReplication())
WalSndWakeup();
+
+ /* Reset the prefetcher. */
+ XLogPrefetchReconfigure();
}
/* Exit loop if we reached inclusive recovery target */
@@ -7392,6 +7406,7 @@ StartupXLOG(void)
/*
* end of main redo apply loop
*/
+ XLogPrefetchEnd(&prefetch);
if (reachedRecoveryTarget)
{
@@ -12138,6 +12153,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
+ XLogPrefetchReconfigure();
break;
case XLOG_FROM_STREAM:
diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c
new file mode 100644
index 0000000000..6d8cff12c6
--- /dev/null
+++ b/src/backend/access/transam/xlogprefetch.c
@@ -0,0 +1,910 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.c
+ * Prefetching support for recovery.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogprefetch.c
+ *
+ * The goal of this module is to read future WAL records and issue
+ * PrefetchSharedBuffer() calls for referenced blocks, so that we avoid I/O
+ * stalls in the main recovery loop. Currently, this is achieved by using a
+ * separate XLogReader to read ahead. In future, we should find a way to
+ * avoid reading and decoding each record twice.
+ *
+ * When examining a WAL record from the future, we need to consider that a
+ * referenced block or segment file might not exist on disk until this record
+ * or some earlier record has been replayed. After a crash, a file might also
+ * be missing because it was dropped by a later WAL record; in that case, it
+ * will be recreated when this record is replayed. These cases are handled by
+ * recognizing them and adding a "filter" that prevents all prefetching of a
+ * certain block range until the present WAL record has been replayed. Blocks
+ * skipped for these reasons are counted as "skip_new" (that is, cases where we
+ * didn't try to prefetch "new" blocks).
+ *
+ * Blocks found in the buffer pool already are counted as "skip_hit".
+ * Repeated access to the same buffer is detected and skipped, and this is
+ * counted with "skip_seq". Blocks that were logged with FPWs are skipped if
+ * recovery_prefetch_fpw is off, since on most systems there will be no I/O
+ * stall; this is counted with "skip_fpw".
+ *
+ * The only way we currently have to know that an I/O initiated with
+ * PrefetchSharedBuffer() has completed is to call ReadBuffer(). Therefore,
+ * we track the number of potentially in-flight I/Os by using a circular
+ * buffer of LSNs. When it's full, we have to wait for recovery to replay
+ * records so that the queue depth can be reduced, before we can do any more
+ * prefetching. Ideally, this keeps us the right distance ahead to respect
+ * maintenance_io_concurrency.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlogprefetch.h"
+#include "access/xlogreader.h"
+#include "access/xlogutils.h"
+#include "catalog/storage_xlog.h"
+#include "utils/fmgrprotos.h"
+#include "utils/timestamp.h"
+#include "funcapi.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "storage/bufmgr.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "utils/guc.h"
+#include "utils/hsearch.h"
+
+/*
+ * Sample the queue depth and distance every time we replay this much WAL.
+ * This is used to compute avg_queue_depth and avg_distance for the log
+ * message that appears at the end of crash recovery. It's also used to send
+ * messages periodically to the stats collector, to save the counters on disk.
+ */
+#define XLOGPREFETCHER_SAMPLE_DISTANCE 0x40000
+
+/* GUCs */
+int max_recovery_prefetch_distance = -1;
+bool recovery_prefetch_fpw = false;
+
+int XLogPrefetchReconfigureCount;
+
+/*
+ * A prefetcher object. There is at most one of these in existence at a time,
+ * recreated whenever there is a configuration change.
+ */
+struct XLogPrefetcher
+{
+ /* Reader and current reading state. */
+ XLogReaderState *reader;
+ XLogReadLocalOptions options;
+ bool have_record;
+ bool shutdown;
+ int next_block_id;
+
+ /* Details of last prefetch to skip repeats and seq scans. */
+ SMgrRelation last_reln;
+ RelFileNode last_rnode;
+ BlockNumber last_blkno;
+
+ /* Online averages. */
+ uint64 samples;
+ double avg_queue_depth;
+ double avg_distance;
+ XLogRecPtr next_sample_lsn;
+
+ /* Book-keeping required to avoid accessing non-existing blocks. */
+ HTAB *filter_table;
+ dlist_head filter_queue;
+
+ /* Book-keeping required to limit concurrent prefetches. */
+ int prefetch_head;
+ int prefetch_tail;
+ int prefetch_queue_size;
+ XLogRecPtr prefetch_queue[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * A temporary filter used to track block ranges that haven't been created
+ * yet, whole relations that haven't been created yet, and whole relations
+ * that we must assume have already been dropped.
+ */
+typedef struct XLogPrefetcherFilter
+{
+ RelFileNode rnode;
+ XLogRecPtr filter_until_replayed;
+ BlockNumber filter_from_block;
+ dlist_node link;
+} XLogPrefetcherFilter;
+
+/*
+ * Counters exposed in shared memory for pg_stat_prefetch_recovery.
+ */
+typedef struct XLogPrefetchStats
+{
+ pg_atomic_uint64 reset_time; /* Time of last reset. */
+ pg_atomic_uint64 prefetch; /* Prefetches initiated. */
+ pg_atomic_uint64 skip_hit; /* Blocks already buffered. */
+ pg_atomic_uint64 skip_new; /* New/missing blocks filtered. */
+ pg_atomic_uint64 skip_fpw; /* FPWs skipped. */
+ pg_atomic_uint64 skip_seq; /* Repeat blocks skipped. */
+ float avg_distance;
+ float avg_queue_depth;
+
+ /* Reset counters */
+ pg_atomic_uint32 reset_request;
+ uint32 reset_handled;
+
+ /* Dynamic values */
+ int distance; /* Number of bytes ahead in the WAL. */
+ int queue_depth; /* Number of I/Os possibly in progress. */
+} XLogPrefetchStats;
+
+static inline void XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno,
+ XLogRecPtr lsn);
+static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
+ RelFileNode rnode,
+ BlockNumber blockno);
+static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline void XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn);
+static inline void XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static inline bool XLogPrefetcherSaturated(XLogPrefetcher *prefetcher);
+static void XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher,
+ XLogRecPtr replaying_lsn);
+static bool XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher);
+static void XLogPrefetchSaveStats(void);
+static void XLogPrefetchRestoreStats(void);
+
+static XLogPrefetchStats *Stats;
+
+size_t
+XLogPrefetchShmemSize(void)
+{
+ return sizeof(XLogPrefetchStats);
+}
+
+static void
+XLogPrefetchResetStats(void)
+{
+ pg_atomic_write_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_write_u64(&Stats->prefetch, 0);
+ pg_atomic_write_u64(&Stats->skip_hit, 0);
+ pg_atomic_write_u64(&Stats->skip_new, 0);
+ pg_atomic_write_u64(&Stats->skip_fpw, 0);
+ pg_atomic_write_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+}
+
+void
+XLogPrefetchShmemInit(void)
+{
+ bool found;
+
+ Stats = (XLogPrefetchStats *)
+ ShmemInitStruct("XLogPrefetchStats",
+ sizeof(XLogPrefetchStats),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u32(&Stats->reset_request, 0);
+ Stats->reset_handled = 0;
+ pg_atomic_init_u64(&Stats->reset_time, GetCurrentTimestamp());
+ pg_atomic_init_u64(&Stats->prefetch, 0);
+ pg_atomic_init_u64(&Stats->skip_hit, 0);
+ pg_atomic_init_u64(&Stats->skip_new, 0);
+ pg_atomic_init_u64(&Stats->skip_fpw, 0);
+ pg_atomic_init_u64(&Stats->skip_seq, 0);
+ Stats->avg_distance = 0;
+ Stats->avg_queue_depth = 0;
+ Stats->distance = 0;
+ Stats->queue_depth = 0;
+ }
+}
+
+/*
+ * Called when any GUC is changed that affects prefetching.
+ */
+void
+XLogPrefetchReconfigure(void)
+{
+ XLogPrefetchReconfigureCount++;
+}
+
+/*
+ * Called by any backend to request that the stats be reset.
+ */
+void
+XLogPrefetchRequestResetStats(void)
+{
+ pg_atomic_fetch_add_u32(&Stats->reset_request, 1);
+}
+
+/*
+ * Tell the stats collector to serialize the shared memory counters into the
+ * stats file.
+ */
+static void
+XLogPrefetchSaveStats(void)
+{
+ PgStat_RecoveryPrefetchStats serialized = {
+ .prefetch = pg_atomic_read_u64(&Stats->prefetch),
+ .skip_hit = pg_atomic_read_u64(&Stats->skip_hit),
+ .skip_new = pg_atomic_read_u64(&Stats->skip_new),
+ .skip_fpw = pg_atomic_read_u64(&Stats->skip_fpw),
+ .skip_seq = pg_atomic_read_u64(&Stats->skip_seq),
+ .stat_reset_timestamp = pg_atomic_read_u64(&Stats->reset_time)
+ };
+
+ pgstat_send_recoveryprefetch(&serialized);
+}
+
+/*
+ * Try to restore the shared memory counters from the stats file.
+ */
+static void
+XLogPrefetchRestoreStats(void)
+{
+ PgStat_RecoveryPrefetchStats *serialized = pgstat_fetch_recoveryprefetch();
+
+ if (serialized->stat_reset_timestamp != 0)
+ {
+ pg_atomic_write_u64(&Stats->prefetch, serialized->prefetch);
+ pg_atomic_write_u64(&Stats->skip_hit, serialized->skip_hit);
+ pg_atomic_write_u64(&Stats->skip_new, serialized->skip_new);
+ pg_atomic_write_u64(&Stats->skip_fpw, serialized->skip_fpw);
+ pg_atomic_write_u64(&Stats->skip_seq, serialized->skip_seq);
+ pg_atomic_write_u64(&Stats->reset_time, serialized->stat_reset_timestamp);
+ }
+}
+
+/*
+ * Initialize an XLogPrefetchState object and restore the last saved
+ * statistics from disk.
+ */
+void
+XLogPrefetchBegin(XLogPrefetchState *state)
+{
+ XLogPrefetchRestoreStats();
+
+ /* We'll reconfigure on the first call to XLogPrefetch(). */
+ state->prefetcher = NULL;
+ state->reconfigure_count = XLogPrefetchReconfigureCount - 1;
+}
+
+/*
+ * Shut down the prefetching infrastructure, if configured.
+ */
+void
+XLogPrefetchEnd(XLogPrefetchState *state)
+{
+ XLogPrefetchSaveStats();
+
+ if (state->prefetcher)
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+}
+
+/*
+ * Create a prefetcher that is ready to begin prefetching blocks referenced by
+ * WAL that is ahead of the given lsn.
+ */
+XLogPrefetcher *
+XLogPrefetcherAllocate(TimeLineID tli, XLogRecPtr lsn, bool streaming)
+{
+ XLogPrefetcher *prefetcher;
+ static HASHCTL hash_table_ctl = {
+ .keysize = sizeof(RelFileNode),
+ .entrysize = sizeof(XLogPrefetcherFilter)
+ };
+ XLogReaderRoutine reader_routines = {
+ .page_read = read_local_xlog_page,
+ .segment_open = wal_segment_try_open,
+ .segment_close = wal_segment_close
+ };
+
+ /*
+ * The size of the queue is based on the maintenance_io_concurrency
+ * setting. In theory we might have a separate queue for each tablespace,
+ * but it's not clear how that should work, so for now we'll just use the
+ * general GUC to rate-limit all prefetching. We add one to the size
+ * because our circular buffer has a gap between head and tail when full.
+ */
+ prefetcher = palloc0(offsetof(XLogPrefetcher, prefetch_queue) +
+ sizeof(XLogRecPtr) * (maintenance_io_concurrency + 1));
+ prefetcher->prefetch_queue_size = maintenance_io_concurrency + 1;
+ prefetcher->options.tli = tli;
+ prefetcher->options.nowait = true;
+ if (streaming)
+ {
+ /*
+ * We're only allowed to read as far as the WAL receiver has written.
+ * We don't have to wait for it to be flushed, though, as recovery
+ * does, so that gives us a chance to get a bit further ahead.
+ */
+ prefetcher->options.read_upto_policy = XLRO_WALRCV_WRITTEN;
+ }
+ else
+ {
+ /* Read as far as we can. */
+ prefetcher->options.read_upto_policy = XLRO_END;
+ }
+ reader_routines.page_read_private = &prefetcher->options;
+ prefetcher->reader = XLogReaderAllocate(wal_segment_size,
+ NULL,
+ &reader_routines,
+ NULL);
+ prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024,
+ &hash_table_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ dlist_init(&prefetcher->filter_queue);
+
+ /* Prepare to read at the given LSN. */
+ ereport(LOG,
+ (errmsg("recovery started prefetching on timeline %u at %X/%X",
+ tli,
+ (uint32) (lsn << 32), (uint32) lsn)));
+ XLogBeginRead(prefetcher->reader, lsn);
+
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+
+ return prefetcher;
+}
+
+/*
+ * Destroy a prefetcher and release all resources.
+ */
+void
+XLogPrefetcherFree(XLogPrefetcher *prefetcher)
+{
+ /* Log final statistics. */
+ ereport(LOG,
+ (errmsg("recovery finished prefetching at %X/%X; "
+ "prefetch = " UINT64_FORMAT ", "
+ "skip_hit = " UINT64_FORMAT ", "
+ "skip_new = " UINT64_FORMAT ", "
+ "skip_fpw = " UINT64_FORMAT ", "
+ "skip_seq = " UINT64_FORMAT ", "
+ "avg_distance = %f, "
+ "avg_queue_depth = %f",
+ (uint32) (prefetcher->reader->EndRecPtr << 32),
+ (uint32) (prefetcher->reader->EndRecPtr),
+ pg_atomic_read_u64(&Stats->prefetch),
+ pg_atomic_read_u64(&Stats->skip_hit),
+ pg_atomic_read_u64(&Stats->skip_new),
+ pg_atomic_read_u64(&Stats->skip_fpw),
+ pg_atomic_read_u64(&Stats->skip_seq),
+ Stats->avg_distance,
+ Stats->avg_queue_depth)));
+ XLogReaderFree(prefetcher->reader);
+ hash_destroy(prefetcher->filter_table);
+ pfree(prefetcher);
+}
+
+/*
+ * Called when recovery is replaying a new LSN, to check if we can read ahead.
+ */
+void
+XLogPrefetcherReadAhead(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ uint32 reset_request;
+
+ /* If an error has occurred or we've hit the end of the WAL, do nothing. */
+ if (prefetcher->shutdown)
+ return;
+
+ /*
+ * Have any in-flight prefetches definitely completed, judging by the LSN
+ * that is currently being replayed?
+ */
+ XLogPrefetcherCompletedIO(prefetcher, replaying_lsn);
+
+ /*
+ * Do we already have the maximum permitted number of I/Os running
+ * (according to the information we have)? If so, we have to wait for at
+ * least one to complete, so give up early and let recovery catch up.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ return;
+
+ /*
+ * Can we drop any filters yet? This happens when the LSN that is
+ * currently being replayed has moved past a record that prevents
+ * pretching of a block range, such as relation extension.
+ */
+ XLogPrefetcherCompleteFilters(prefetcher, replaying_lsn);
+
+ /*
+ * Have we been asked to reset our stats counters? This is checked with
+ * an unsynchronized memory read, but we'll see it eventually and we'll be
+ * accessing that cache line anyway.
+ */
+ reset_request = pg_atomic_read_u32(&Stats->reset_request);
+ if (reset_request != Stats->reset_handled)
+ {
+ XLogPrefetchResetStats();
+ Stats->reset_handled = reset_request;
+ prefetcher->avg_distance = 0;
+ prefetcher->avg_queue_depth = 0;
+ prefetcher->samples = 0;
+ }
+
+ /* OK, we can now try reading ahead. */
+ XLogPrefetcherScanRecords(prefetcher, replaying_lsn);
+}
+
+/*
+ * Read ahead as far as we are allowed to, considering the LSN that recovery
+ * is currently replaying.
+ */
+static void
+XLogPrefetcherScanRecords(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ for (;;)
+ {
+ char *error;
+ int64 distance;
+
+ /* If we don't already have a record, then try to read one. */
+ if (!prefetcher->have_record)
+ {
+ if (!XLogReadRecord(reader, &error))
+ {
+ /* If we got an error, log it and give up. */
+ if (error)
+ {
+ ereport(LOG, (errmsg("recovery no longer prefetching: %s", error)));
+ prefetcher->shutdown = true;
+ Stats->queue_depth = 0;
+ Stats->distance = 0;
+ }
+ /* Otherwise, we'll try again later when more data is here. */
+ return;
+ }
+ prefetcher->have_record = true;
+ prefetcher->next_block_id = 0;
+ }
+
+ /* How far ahead of replay are we now? */
+ distance = prefetcher->reader->ReadRecPtr - replaying_lsn;
+
+ /* Update distance shown in shm. */
+ Stats->distance = distance;
+
+ /* Periodically recompute some statistics. */
+ if (unlikely(replaying_lsn >= prefetcher->next_sample_lsn))
+ {
+ /* Compute online averages. */
+ prefetcher->samples++;
+ if (prefetcher->samples == 1)
+ {
+ prefetcher->avg_distance = Stats->distance;
+ prefetcher->avg_queue_depth = Stats->queue_depth;
+ }
+ else
+ {
+ prefetcher->avg_distance +=
+ (Stats->distance - prefetcher->avg_distance) /
+ prefetcher->samples;
+ prefetcher->avg_queue_depth +=
+ (Stats->queue_depth - prefetcher->avg_queue_depth) /
+ prefetcher->samples;
+ }
+
+ /* Expose it in shared memory. */
+ Stats->avg_distance = prefetcher->avg_distance;
+ Stats->avg_queue_depth = prefetcher->avg_queue_depth;
+
+ /* Also periodically save the simple counters. */
+ XLogPrefetchSaveStats();
+
+ prefetcher->next_sample_lsn =
+ replaying_lsn + XLOGPREFETCHER_SAMPLE_DISTANCE;
+ }
+
+ /* Are we too far ahead of replay? */
+ if (distance >= max_recovery_prefetch_distance)
+ break;
+
+ /* Are we not far enough ahead? */
+ if (distance <= 0)
+ {
+ prefetcher->have_record = false; /* skip this record */
+ continue;
+ }
+
+ /*
+ * If this is a record that creates a new SMGR relation, we'll avoid
+ * prefetching anything from that rnode until it has been replayed.
+ */
+ if (replaying_lsn < reader->ReadRecPtr &&
+ XLogRecGetRmid(reader) == RM_SMGR_ID &&
+ (XLogRecGetInfo(reader) & ~XLR_INFO_MASK) == XLOG_SMGR_CREATE)
+ {
+ xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(reader);
+
+ XLogPrefetcherAddFilter(prefetcher, xlrec->rnode, 0,
+ reader->ReadRecPtr);
+ }
+
+ /* Scan the record's block references. */
+ if (!XLogPrefetcherScanBlocks(prefetcher))
+ return;
+
+ /* Advance to the next record. */
+ prefetcher->have_record = false;
+ }
+}
+
+/*
+ * Scan the current record for block references, and consider prefetching.
+ *
+ * Return true if we processed the current record to completion and still have
+ * queue space to process a new record, and false if we saturated the I/O
+ * queue and need to wait for recovery to advance before we continue.
+ */
+static bool
+XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
+{
+ XLogReaderState *reader = prefetcher->reader;
+
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+
+ /*
+ * We might already have been partway through processing this record when
+ * our queue became saturated, so we need to start where we left off.
+ */
+ for (int block_id = prefetcher->next_block_id;
+ block_id <= reader->max_block_id;
+ ++block_id)
+ {
+ PrefetchBufferResult prefetch;
+ DecodedBkpBlock *block = &reader->blocks[block_id];
+ SMgrRelation reln;
+
+ /* Ignore everything but the main fork for now. */
+ if (block->forknum != MAIN_FORKNUM)
+ continue;
+
+ /*
+ * If there is a full page image attached, we won't be reading the
+ * page, so you might think we should skip it. However, if the
+ * underlying filesystem uses larger logical blocks than us, it
+ * might still need to perform a read-before-write some time later.
+ * Therefore, only prefetch if configured to do so.
+ */
+ if (block->has_image && !recovery_prefetch_fpw)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
+ continue;
+ }
+
+ /*
+ * If this block will initialize a new page then it's probably an
+ * extension. Since it might create a new segment, we can't try
+ * to prefetch this block until the record has been replayed, or we
+ * might try to open a file that doesn't exist yet.
+ */
+ if (block->flags & BKPBLOCK_WILL_INIT)
+ {
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, block->blkno,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Should we skip this block due to a filter? */
+ if (XLogPrefetcherIsFiltered(prefetcher, block->rnode, block->blkno))
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ continue;
+ }
+
+ /* Fast path for repeated references to the same relation. */
+ if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
+ {
+ /*
+ * If this is a repeat access to the same block, then skip it.
+ *
+ * XXX We could also check for last_blkno + 1 too, and also update
+ * last_blkno; it's not clear if the kernel would do a better job
+ * of sequential prefetching.
+ */
+ if (block->blkno == prefetcher->last_blkno)
+ {
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
+ continue;
+ }
+
+ /* We can avoid calling smgropen(). */
+ reln = prefetcher->last_reln;
+ }
+ else
+ {
+ /* Otherwise we have to open it. */
+ reln = smgropen(block->rnode, InvalidBackendId);
+ prefetcher->last_rnode = block->rnode;
+ prefetcher->last_reln = reln;
+ }
+ prefetcher->last_blkno = block->blkno;
+
+ /* Try to prefetch this block! */
+ prefetch = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
+ if (BufferIsValid(prefetch.recent_buffer))
+ {
+ /*
+ * It was already cached, so do nothing. Perhaps in future we
+ * could remember the buffer so that recovery doesn't have to look
+ * it up again.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_hit, 1);
+ }
+ else if (prefetch.initiated_io)
+ {
+ /*
+ * I/O has possibly been initiated (though we don't know if it
+ * was already cached by the kernel, so we just have to assume
+ * that it has due to lack of better information). Record
+ * this as an I/O in progress until eventually we replay this
+ * LSN.
+ */
+ pg_atomic_unlocked_add_fetch_u64(&Stats->prefetch, 1);
+ XLogPrefetcherInitiatedIO(prefetcher, reader->ReadRecPtr);
+ /*
+ * If the queue is now full, we'll have to wait before processing
+ * any more blocks from this record, or move to a new record if
+ * that was the last block.
+ */
+ if (XLogPrefetcherSaturated(prefetcher))
+ {
+ prefetcher->next_block_id = block_id + 1;
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Neither cached nor initiated. The underlying segment file
+ * doesn't exist. Presumably it will be unlinked by a later WAL
+ * record. When recovery reads this block, it will use the
+ * EXTENSION_CREATE_RECOVERY flag. We certainly don't want to do
+ * that sort of thing while merely prefetching, so let's just
+ * ignore references to this relation until this record is
+ * replayed, and let recovery create the dummy file or complain if
+ * something is wrong.
+ */
+ XLogPrefetcherAddFilter(prefetcher, block->rnode, 0,
+ reader->ReadRecPtr);
+ pg_atomic_unlocked_add_fetch_u64(&Stats->skip_new, 1);
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Expose statistics about recovery prefetching.
+ */
+Datum
+pg_stat_get_prefetch_recovery(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_PREFETCH_RECOVERY_COLS 10
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ Datum values[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+ bool nulls[PG_STAT_GET_PREFETCH_RECOVERY_COLS];
+
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("materialize mod required, but it is not allowed in this context")));
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ if (pg_atomic_read_u32(&Stats->reset_request) != Stats->reset_handled)
+ {
+ /* There's an unhandled reset request, so just show NULLs */
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = true;
+ }
+ else
+ {
+ for (int i = 0; i < PG_STAT_GET_PREFETCH_RECOVERY_COLS; ++i)
+ nulls[i] = false;
+ }
+
+ values[0] = TimestampTzGetDatum(pg_atomic_read_u64(&Stats->reset_time));
+ values[1] = Int64GetDatum(pg_atomic_read_u64(&Stats->prefetch));
+ values[2] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_hit));
+ values[3] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_new));
+ values[4] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_fpw));
+ values[5] = Int64GetDatum(pg_atomic_read_u64(&Stats->skip_seq));
+ values[6] = Int32GetDatum(Stats->distance);
+ values[7] = Int32GetDatum(Stats->queue_depth);
+ values[8] = Float4GetDatum(Stats->avg_distance);
+ values[9] = Float4GetDatum(Stats->avg_queue_depth);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Don't prefetch any blocks >= 'blockno' from a given 'rnode', until 'lsn'
+ * has been replayed.
+ */
+static inline void
+XLogPrefetcherAddFilter(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno, XLogRecPtr lsn)
+{
+ XLogPrefetcherFilter *filter;
+ bool found;
+
+ filter = hash_search(prefetcher->filter_table, &rnode, HASH_ENTER, &found);
+ if (!found)
+ {
+ /*
+ * Don't allow any prefetching of this block or higher until replayed.
+ */
+ filter->filter_until_replayed = lsn;
+ filter->filter_from_block = blockno;
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+ else
+ {
+ /*
+ * We were already filtering this rnode. Extend the filter's lifetime
+ * to cover this WAL record, but leave the (presumably lower) block
+ * number there because we don't want to have to track individual
+ * blocks.
+ */
+ filter->filter_until_replayed = lsn;
+ dlist_delete(&filter->link);
+ dlist_push_head(&prefetcher->filter_queue, &filter->link);
+ }
+}
+
+/*
+ * Have we replayed the records that caused us to begin filtering a block
+ * range? That means that relations should have been created, extended or
+ * dropped as required, so we can drop relevant filters.
+ */
+static inline void
+XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = dlist_tail_element(XLogPrefetcherFilter,
+ link,
+ &prefetcher->filter_queue);
+
+ if (filter->filter_until_replayed >= replaying_lsn)
+ break;
+ dlist_delete(&filter->link);
+ hash_search(prefetcher->filter_table, filter, HASH_REMOVE, NULL);
+ }
+}
+
+/*
+ * Check if a given block should be skipped due to a filter.
+ */
+static inline bool
+XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher, RelFileNode rnode,
+ BlockNumber blockno)
+{
+ /*
+ * Test for empty queue first, because we expect it to be empty most of the
+ * time and we can avoid the hash table lookup in that case.
+ */
+ if (unlikely(!dlist_is_empty(&prefetcher->filter_queue)))
+ {
+ XLogPrefetcherFilter *filter = hash_search(prefetcher->filter_table, &rnode,
+ HASH_FIND, NULL);
+
+ if (filter && filter->filter_from_block <= blockno)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Insert an LSN into the queue. The queue must not be full already. This
+ * tracks the fact that we have (to the best of our knowledge) initiated an
+ * I/O, so that we can impose a cap on concurrent prefetching.
+ */
+static inline void
+XLogPrefetcherInitiatedIO(XLogPrefetcher *prefetcher,
+ XLogRecPtr prefetching_lsn)
+{
+ Assert(!XLogPrefetcherSaturated(prefetcher));
+ prefetcher->prefetch_queue[prefetcher->prefetch_head++] = prefetching_lsn;
+ prefetcher->prefetch_head %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth++;
+ Assert(Stats->queue_depth <= prefetcher->prefetch_queue_size);
+}
+
+/*
+ * Have we replayed the records that caused us to initiate the oldest
+ * prefetches yet? That means that they're definitely finished, so we can can
+ * forget about them and allow ourselves to initiate more prefetches. For now
+ * we don't have any awareness of when I/O really completes.
+ */
+static inline void
+XLogPrefetcherCompletedIO(XLogPrefetcher *prefetcher, XLogRecPtr replaying_lsn)
+{
+ while (prefetcher->prefetch_head != prefetcher->prefetch_tail &&
+ prefetcher->prefetch_queue[prefetcher->prefetch_tail] < replaying_lsn)
+ {
+ prefetcher->prefetch_tail++;
+ prefetcher->prefetch_tail %= prefetcher->prefetch_queue_size;
+ Stats->queue_depth--;
+ Assert(Stats->queue_depth >= 0);
+ }
+}
+
+/*
+ * Check if the maximum allowed number of I/Os is already in flight.
+ */
+static inline bool
+XLogPrefetcherSaturated(XLogPrefetcher *prefetcher)
+{
+ return (prefetcher->prefetch_head + 1) % prefetcher->prefetch_queue_size ==
+ prefetcher->prefetch_tail;
+}
+
+void
+assign_max_recovery_prefetch_distance(int new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ max_recovery_prefetch_distance = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
+
+void
+assign_recovery_prefetch_fpw(bool new_value, void *extra)
+{
+ /* Reconfigure prefetching, because a setting it depends on changed. */
+ recovery_prefetch_fpw = new_value;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d..6c39b9ad48 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -826,6 +826,20 @@ CREATE VIEW pg_stat_wal_receiver AS
FROM pg_stat_get_wal_receiver() s
WHERE s.pid IS NOT NULL;
+CREATE VIEW pg_stat_prefetch_recovery AS
+ SELECT
+ s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s;
+
CREATE VIEW pg_stat_subscription AS
SELECT
su.oid AS subid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..5ac3fed4c6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
#include "access/transam.h"
#include "access/twophase_rmgr.h"
#include "access/xact.h"
+#include "access/xlogprefetch.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
#include "common/ip.h"
@@ -282,6 +283,7 @@ static int localNumBackends = 0;
static PgStat_ArchiverStats archiverStats;
static PgStat_GlobalStats globalStats;
static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
+static PgStat_RecoveryPrefetchStats recoveryPrefetchStats;
/*
* List of OIDs of databases we need to write out. If an entry is InvalidOid,
@@ -354,6 +356,7 @@ static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
+static void pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len);
static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
@@ -1370,11 +1373,20 @@ pgstat_reset_shared_counters(const char *target)
msg.m_resettarget = RESET_ARCHIVER;
else if (strcmp(target, "bgwriter") == 0)
msg.m_resettarget = RESET_BGWRITER;
+ else if (strcmp(target, "prefetch_recovery") == 0)
+ {
+ /*
+ * We can't ask the stats collector to do this for us as it is not
+ * attached to shared memory.
+ */
+ XLogPrefetchRequestResetStats();
+ return;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("unrecognized reset target: \"%s\"", target),
- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\", \"bgwriter\" or \"prefetch_recovery\".")));
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
pgstat_send(&msg, sizeof(msg));
@@ -2696,6 +2708,22 @@ pgstat_fetch_slru(void)
}
+/*
+ * ---------
+ * pgstat_fetch_recoveryprefetch() -
+ *
+ * Support function for restoring the counters managed by xlogprefetch.c.
+ * ---------
+ */
+PgStat_RecoveryPrefetchStats *
+pgstat_fetch_recoveryprefetch(void)
+{
+ backend_read_statsfile();
+
+ return &recoveryPrefetchStats;
+}
+
+
/* ------------------------------------------------------------
* Functions for management of the shared-memory PgBackendStatus array
* ------------------------------------------------------------
@@ -4444,6 +4472,23 @@ pgstat_send_slru(void)
}
+/* ----------
+ * pgstat_send_recoveryprefetch() -
+ *
+ * Send recovery prefetch statistics to the collector
+ * ----------
+ */
+void
+pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats)
+{
+ PgStat_MsgRecoveryPrefetch msg;
+
+ pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYPREFETCH);
+ msg.m_stats = *stats;
+ pgstat_send(&msg, sizeof(msg));
+}
+
+
/* ----------
* PgstatCollectorMain() -
*
@@ -4640,6 +4685,10 @@ PgstatCollectorMain(int argc, char *argv[])
pgstat_recv_slru(&msg.msg_slru, len);
break;
+ case PGSTAT_MTYPE_RECOVERYPREFETCH:
+ pgstat_recv_recoveryprefetch(&msg.msg_recoveryprefetch, len);
+ break;
+
case PGSTAT_MTYPE_FUNCSTAT:
pgstat_recv_funcstat(&msg.msg_funcstat, len);
break;
@@ -4915,6 +4964,13 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
(void) rc; /* we'll check for error with ferror */
+ /*
+ * Write recovery prefetch stats struct
+ */
+ rc = fwrite(&recoveryPrefetchStats, sizeof(recoveryPrefetchStats), 1,
+ fpout);
+ (void) rc; /* we'll check for error with ferror */
+
/*
* Walk through the database table.
*/
@@ -5174,6 +5230,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
memset(&globalStats, 0, sizeof(globalStats));
memset(&archiverStats, 0, sizeof(archiverStats));
memset(&slruStats, 0, sizeof(slruStats));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
/*
* Set the current timestamp (will be kept only in case we can't load an
@@ -5261,6 +5318,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
goto done;
}
+ /*
+ * Read recoveryPrefetchStats struct
+ */
+ if (fread(&recoveryPrefetchStats, 1, sizeof(recoveryPrefetchStats),
+ fpin) != sizeof(recoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ memset(&recoveryPrefetchStats, 0, sizeof(recoveryPrefetchStats));
+ goto done;
+ }
+
/*
* We found an existing collector stats file. Read it and put all the
* hashtable entries into place.
@@ -5560,6 +5629,7 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
PgStat_GlobalStats myGlobalStats;
PgStat_ArchiverStats myArchiverStats;
PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
+ PgStat_RecoveryPrefetchStats myRecoveryPrefetchStats;
FILE *fpin;
int32 format_id;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
@@ -5625,6 +5695,18 @@ pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
return false;
}
+ /*
+ * Read recovery prefetch stats struct
+ */
+ if (fread(&myRecoveryPrefetchStats, 1, sizeof(myRecoveryPrefetchStats),
+ fpin) != sizeof(myRecoveryPrefetchStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ FreeFile(fpin);
+ return false;
+ }
+
/* By default, we're going to return the timestamp of the global file. */
*ts = myGlobalStats.stats_timestamp;
@@ -6422,6 +6504,18 @@ pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
slruStats[msg->m_index].truncate += msg->m_truncate;
}
+/* ----------
+ * pgstat_recv_recoveryprefetch() -
+ *
+ * Process a recovery prefetch message.
+ * ----------
+ */
+static void
+pgstat_recv_recoveryprefetch(PgStat_MsgRecoveryPrefetch *msg, int len)
+{
+ recoveryPrefetchStats = msg->m_stats;
+}
+
/* ----------
* pgstat_recv_recoveryconflict() -
*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..221081bddc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
#include "access/nbtree.h"
#include "access/subtrans.h"
#include "access/twophase.h"
+#include "access/xlogprefetch.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -124,6 +125,7 @@ CreateSharedMemoryAndSemaphores(void)
size = add_size(size, PredicateLockShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
+ size = add_size(size, XLogPrefetchShmemSize());
size = add_size(size, CLOGShmemSize());
size = add_size(size, CommitTsShmemSize());
size = add_size(size, SUBTRANSShmemSize());
@@ -212,6 +214,7 @@ CreateSharedMemoryAndSemaphores(void)
* Set up xlog, clog, and buffers
*/
XLOGShmemInit();
+ XLogPrefetchShmemInit();
CLOGShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..2fea5f3dcd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogprefetch.h"
#include "catalog/namespace.h"
#include "catalog/pg_authid.h"
#include "catalog/storage.h"
@@ -198,6 +199,7 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
+static void assign_maintenance_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -1272,6 +1274,18 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"recovery_prefetch_fpw", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Prefetch blocks that have full page images in the WAL"),
+ gettext_noop("On some systems, there is no benefit to prefetching pages that will be "
+ "entirely overwritten, but if the logical page size of the filesystem is "
+ "larger than PostgreSQL's, this can be beneficial. This option has no "
+ "effect unless max_recovery_prefetch_distance is set to a positive number.")
+ },
+ &recovery_prefetch_fpw,
+ false,
+ NULL, assign_recovery_prefetch_fpw, NULL
+ },
{
{"wal_log_hints", PGC_POSTMASTER, WAL_SETTINGS,
@@ -2649,6 +2663,22 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_recovery_prefetch_distance", PGC_SIGHUP, WAL_ARCHIVE_RECOVERY,
+ gettext_noop("Maximum number of bytes to read ahead in the WAL to prefetch referenced blocks."),
+ gettext_noop("Set to -1 to disable prefetching during recovery."),
+ GUC_UNIT_BYTE
+ },
+ &max_recovery_prefetch_distance,
+#ifdef USE_PREFETCH
+ 256 * 1024,
+#else
+ -1,
+#endif
+ -1, INT_MAX,
+ NULL, assign_max_recovery_prefetch_distance, NULL
+ },
+
{
{"wal_keep_segments", PGC_SIGHUP, REPLICATION_SENDING,
gettext_noop("Sets the number of WAL files held for standby servers."),
@@ -2968,7 +2998,8 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_maintenance_io_concurrency, NULL, NULL
+ check_maintenance_io_concurrency, assign_maintenance_io_concurrency,
+ NULL
},
{
@@ -11586,6 +11617,20 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static void
+assign_maintenance_io_concurrency(int newval, void *extra)
+{
+#ifdef USE_PREFETCH
+ /*
+ * Reconfigure recovery prefetching, because a setting it depends on
+ * changed.
+ */
+ maintenance_io_concurrency = newval;
+ if (AmStartupProcess())
+ XLogPrefetchReconfigure();
+#endif
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 81055edde7..38763f88b0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,11 @@
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
+# - Prefetching during recovery -
+
+#max_recovery_prefetch_distance = 256kB # -1 disables prefetching
+#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW
+
# - Archiving -
#archive_mode = off # enables archiving; off, on, or always
diff --git a/src/include/access/xlogprefetch.h b/src/include/access/xlogprefetch.h
new file mode 100644
index 0000000000..d8e2e1ca50
--- /dev/null
+++ b/src/include/access/xlogprefetch.h
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogprefetch.h
+ * Declarations for the recovery prefetching module.
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogprefetch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGPREFETCH_H
+#define XLOGPREFETCH_H
+
+#include "access/xlogdefs.h"
+
+/* GUCs */
+extern int max_recovery_prefetch_distance;
+extern bool recovery_prefetch_fpw;
+
+struct XLogPrefetcher;
+typedef struct XLogPrefetcher XLogPrefetcher;
+
+extern int XLogPrefetchReconfigureCount;
+
+typedef struct XLogPrefetchState
+{
+ XLogPrefetcher *prefetcher;
+ int reconfigure_count;
+} XLogPrefetchState;
+
+extern size_t XLogPrefetchShmemSize(void);
+extern void XLogPrefetchShmemInit(void);
+
+extern void XLogPrefetchReconfigure(void);
+extern void XLogPrefetchRequestResetStats(void);
+
+extern void XLogPrefetchBegin(XLogPrefetchState *state);
+extern void XLogPrefetchEnd(XLogPrefetchState *state);
+
+/* Functions exposed only for the use of XLogPrefetch(). */
+extern XLogPrefetcher *XLogPrefetcherAllocate(TimeLineID tli,
+ XLogRecPtr lsn,
+ bool streaming);
+extern void XLogPrefetcherFree(XLogPrefetcher *prefetcher);
+extern void XLogPrefetcherReadAhead(XLogPrefetcher *prefetch,
+ XLogRecPtr replaying_lsn);
+
+/*
+ * Tell the prefetching module that we are now replaying a given LSN, so that
+ * it can decide how far ahead to read in the WAL, if configured.
+ */
+static inline void
+XLogPrefetch(XLogPrefetchState *state,
+ TimeLineID replaying_tli,
+ XLogRecPtr replaying_lsn,
+ bool from_stream)
+{
+ /*
+ * Handle any configuration changes. Rather than trying to deal with
+ * various parameter changes, we just tear down and set up a new
+ * prefetcher if anything we depend on changes.
+ */
+ if (unlikely(state->reconfigure_count != XLogPrefetchReconfigureCount))
+ {
+ /* If we had a prefetcher, tear it down. */
+ if (state->prefetcher)
+ {
+ XLogPrefetcherFree(state->prefetcher);
+ state->prefetcher = NULL;
+ }
+ /* If we want a prefetcher, set it up. */
+ if (max_recovery_prefetch_distance > 0)
+ state->prefetcher = XLogPrefetcherAllocate(replaying_tli,
+ replaying_lsn,
+ from_stream);
+ state->reconfigure_count = XLogPrefetchReconfigureCount;
+ }
+
+ if (state->prefetcher)
+ XLogPrefetcherReadAhead(state->prefetcher, replaying_lsn);
+}
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..56b48bf2ad 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6136,6 +6136,14 @@
prorettype => 'bool', proargtypes => '',
prosrc => 'pg_is_wal_replay_paused' },
+{ oid => '9085', descr => 'statistics: information about WAL prefetching',
+ proname => 'pg_stat_get_prefetch_recovery', prorows => '1', provolatile => 'v',
+ proretset => 't', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{timestamptz,int8,int8,int8,int8,int8,int4,int4,float4,float4}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{stats_reset,prefetch,skip_hit,skip_new,skip_fpw,skip_seq,distance,queue_depth,avg_distance,avg_queue_depth}',
+ prosrc => 'pg_stat_get_prefetch_recovery' },
+
{ oid => '2621', descr => 'reload configuration files',
proname => 'pg_reload_conf', provolatile => 'v', prorettype => 'bool',
proargtypes => '', prosrc => 'pg_reload_conf' },
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..0dcd3c377a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -62,6 +62,7 @@ typedef enum StatMsgType
PGSTAT_MTYPE_ARCHIVER,
PGSTAT_MTYPE_BGWRITER,
PGSTAT_MTYPE_SLRU,
+ PGSTAT_MTYPE_RECOVERYPREFETCH,
PGSTAT_MTYPE_FUNCSTAT,
PGSTAT_MTYPE_FUNCPURGE,
PGSTAT_MTYPE_RECOVERYCONFLICT,
@@ -182,6 +183,19 @@ typedef struct PgStat_TableXactStatus
struct PgStat_TableXactStatus *next; /* next of same subxact */
} PgStat_TableXactStatus;
+/*
+ * Recovery prefetching statistics persisted on disk by pgstat.c, but kept in
+ * shared memory by xlogprefetch.c.
+ */
+typedef struct PgStat_RecoveryPrefetchStats
+{
+ PgStat_Counter prefetch;
+ PgStat_Counter skip_hit;
+ PgStat_Counter skip_new;
+ PgStat_Counter skip_fpw;
+ PgStat_Counter skip_seq;
+ TimestampTz stat_reset_timestamp;
+} PgStat_RecoveryPrefetchStats;
/* ------------------------------------------------------------
* Message formats follow
@@ -453,6 +467,16 @@ typedef struct PgStat_MsgSLRU
PgStat_Counter m_truncate;
} PgStat_MsgSLRU;
+/* ----------
+ * PgStat_MsgRecoveryPrefetch Sent by XLogPrefetch to save statistics.
+ * ----------
+ */
+typedef struct PgStat_MsgRecoveryPrefetch
+{
+ PgStat_MsgHdr m_hdr;
+ PgStat_RecoveryPrefetchStats m_stats;
+} PgStat_MsgRecoveryPrefetch;
+
/* ----------
* PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
* ----------
@@ -597,6 +621,7 @@ typedef union PgStat_Msg
PgStat_MsgArchiver msg_archiver;
PgStat_MsgBgWriter msg_bgwriter;
PgStat_MsgSLRU msg_slru;
+ PgStat_MsgRecoveryPrefetch msg_recoveryprefetch;
PgStat_MsgFuncstat msg_funcstat;
PgStat_MsgFuncpurge msg_funcpurge;
PgStat_MsgRecoveryConflict msg_recoveryconflict;
@@ -1458,6 +1483,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
extern void pgstat_send_archiver(const char *xlog, bool failed);
extern void pgstat_send_bgwriter(void);
+extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
/* ----------
* Support functions for the SQL-callable functions to
@@ -1473,6 +1499,7 @@ extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_RecoveryPrefetchStats *pgstat_fetch_recoveryprefetch(void);
extern void pgstat_count_slru_page_zeroed(int slru_idx);
extern void pgstat_count_slru_page_hit(int slru_idx);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..976cf8b116 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,4 +440,8 @@ extern void assign_search_path(const char *newval, void *extra);
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
+/* in access/transam/xlogprefetch.c */
+extern void assign_max_recovery_prefetch_distance(int new_value, void *extra);
+extern void assign_recovery_prefetch_fpw(bool new_value, void *extra);
+
#endif /* GUC_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b813e32215..74dd8c604c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1857,6 +1857,17 @@ pg_stat_gssapi| SELECT s.pid,
s.gss_enc AS encrypted
FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid)
WHERE (s.client_port IS NOT NULL);
+pg_stat_prefetch_recovery| SELECT s.stats_reset,
+ s.prefetch,
+ s.skip_hit,
+ s.skip_new,
+ s.skip_fpw,
+ s.skip_seq,
+ s.distance,
+ s.queue_depth,
+ s.avg_distance,
+ s.avg_queue_depth
+ FROM pg_stat_get_prefetch_recovery() s(stats_reset, prefetch, skip_hit, skip_new, skip_fpw, skip_seq, distance, queue_depth, avg_distance, avg_queue_depth);
pg_stat_progress_analyze| SELECT s.pid,
s.datid,
d.datname,
--
2.20.1
Thomas Munro escribi�:
@@ -1094,8 +1103,16 @@ WALRead(XLogReaderState *state,
XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
state->routine.segment_open(state, nextSegNo, &tli);- /* This shouldn't happen -- indicates a bug in segment_open */ - Assert(state->seg.ws_file >= 0); + /* callback reported that there was no such file */ + if (state->seg.ws_file < 0) + { + errinfo->wre_errno = errno; + errinfo->wre_req = 0; + errinfo->wre_read = 0; + errinfo->wre_off = startoff; + errinfo->wre_seg = state->seg; + return false; + }
Ah, this is what Michael was saying ... we need to fix WALRead so that
it doesn't depend on segment_open alway returning a good FD. This needs
a fix everywhere, not just here, and improve the error report interface.
Maybe it does make sense to get it fixed in pg13 and avoid a break
later.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've spent some time testing this, mostly from the performance point of
view. I've done a very simple thing, in order to have reproducible test:
1) I've initialized pgbench with scale 8000 (so ~120GB on a machine with
only 64GB of RAM)
2) created a physical backup, enabled WAL archiving
3) did 1h pgbench run with 32 clients
4) disabled full-page writes and did another 1h pgbench run
Once I had this, I did a recovery using the physical backup and WAL
archive, measuring how long it took to apply each WAL segment. First
without any prefetching (current master), then twice with prefetching.
First with default values (m_io_c=10, distance=256kB) and then with
higher values (100 + 2MB).
I did this on two storage systems I have in the system - NVME SSD and
SATA RAID (3 x 7.2k drives). So, a fast one and slow one.
1) NVME
On the NVME, this generates ~26k WAL segments (~400GB), and each of the
pgbench runs generates ~120M transactions (~33k tps). Of course, wast
majority of the WAL segments ~16k comes from the first run, because
there's a lot of FPI due to the random nature of the workload.
I have not expected a significant improvement from the prefetching, as
the NVME is pretty good in handling random I/O. The total duration looks
like this:
no prefetch prefetch prefetch2
10618 10385 9403
So the default is a tiny bit faster, and the more aggressive config
makes it about 10% faster. Not bad, considering the expectations.
Attached is a chart comparing the three runs. There are three clearly
visible parts - first the 1h run with f_p_w=on, with two checkpoints.
That's first ~16k segments. Then there's a bit of a gap before the
second pgbench run was started - I think it's mostly autovacuum etc. And
then at segment ~23k the second pgbench (f_p_w=off) starts.
I think this shows the prefetching starts to help as the number of FPIs
decreases. It's subtle, but it's there.
2) SATA
On SATA it's just ~550 segments (~8.5GB), and the pgbench runs generate
only about 1M transactions. Again, vast majority of the segments comes
from the first run, due to FPI.
In this case, I don't have complete results, but after processing 542
segments (out of the ~550) it looks like this:
no prefetch prefetch prefetch2
6644 6635 8282
So the no prefetch and "default" prefetch are roughly on par, but the
"aggressive" prefetch is way slower. I'll get back to this shortly, but
I'd like to point out this is entirely due to the "no FPI" pgbench,
because after the first ~525 initial segments it looks like this:
no prefetch prefetch prefetch2
58 65 57
So it goes very fast by the initial segments with plenty of FPIs, and
then we get to the "no FPI" segments and the prefetch either does not
help or makes it slower.
Looking at how long it takes to apply the last few segments, it looks
like this:
no prefetch prefetch prefetch2
280 298 478
which is not particularly great, I guess. There however seems to be
something wrong, because with the prefetching I see this in the log:
prefetch:
2020-06-05 02:47:25.970 CEST 1591318045.970 [22961] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
0000000100000108000000FF, offset 0
prefetch2:
2020-06-05 15:29:23.895 CEST 1591363763.895 [26676] LOG: recovery no
longer prefetching: unexpected pageaddr 108/E8000000 in log segment
000000010000010900000001, offset 0
Which seems pretty suspicious, but I have no idea what's wrong. I admit
the archive/restore commands are a bit hacky, but I've only seen this
with prefetching on the SATA storage, while all other cases seem to be
just fine. I haven't seen in on NVME (which processes much more WAL).
And the SATA baseline (no prefetching) also worked fine.
Moreover, the pageaddr value is the same in both cases, but the WAL
segments are different (but just one segment apart). Seems strange.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
nvme-prefetch.pngimage/pngDownload
�PNG
IHDR � Z
% t�IDATx���|TU���BB����4Q�+*�]++"�uuuE.��`�U�b��`GAT��=�N%R�����@��L&��������9w���s��s�x<���� �R���:�y=\� P��w BH �B � �� !� !$ @! B BH �B � �� !� !$ @! B BH �B � �� (n��6�=Z=���������G��1c�a�]x���7o�6n�����l����������>��{���{��O?�v�����{�q����������~�I��~���w�yj���^~��\�+* ���+5}�t�{���0aB��_��m��eK}��gE�>