WAL prefetch
There was very interesting presentation at pgconf about pg_prefaulter:
http://www.pgcon.org/2018/schedule/events/1204.en.html
But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.
First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some
number of prefetch threads:
Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE
One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE
One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE
4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE
Based on this experiments (on my desktop), I made the following conclusions:
1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to
2 times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k TPS to 60k TPS (on pgbench with scale 1000).
Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until
you interrupt it.
pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.
It is possible to explicitly specify start LSN for pg_wal_prefetch()
function. Otherwise, WAL redo position will be used as start LSN.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Jun 13, 2018 at 6:39 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
There was very interesting presentation at pgconf about pg_prefaulter:
http://www.pgcon.org/2018/schedule/events/1204.en.html
But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.
Right or in other words, it could do something like autoprewarm [1]https://www.postgresql.org/docs/devel/static/pgprewarm.html
which can allow a more user-friendly interface for this utility if we
decides to include it.
First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some number
of prefetch threads:Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILEOne prefetch thread which uses pread:
./prefetch SOME_BIG_FILEOne prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILEBased on this experiments (on my desktop), I made the following conclusions:
1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).
That's a reasonable improvement.
Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.
I think it is not a very user-friendly interface, but the idea sounds
good to me, it can help some other workloads. I think this can help
in recovery as well.
[1]: https://www.postgresql.org/docs/devel/static/pgprewarm.html
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.
Hi Konstantin,
Why stop at the page cache... what about shared buffers?
--
Thomas Munro
http://www.enterprisedb.com
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to
shared buffers.
But the current c'est la vie with Postgres is that allocating too large
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement
algorithm, "write storm",...
If your system has 1Tb of memory, almost none of Postgresql
administrators will recommend to use all this 1Tb for shared buffers.
Moreover there are recommendations to choose shared buffers size based
on size of internal cache of persistent storage device
(so that it will be possible to flush changes without doing writes to
physical media). So at this system with 1Tb of RAM, size of shared
buffers will be most likely set to few hundreds of gigabytes.
Also PostgreSQL is not currently supporting dynamic changing of shared
buffers size. Without it, the only way of using Postgres in clouds and
another multiuser systems where system load is not fully controlled by
user is to choose relatively small shared buffer size and rely on OS
caching.
Yes, access to shared buffer is about two times faster than reading data
from file system cache.
But it is better, then situation when shared buffers are swapped out and
effect of large shared buffers becomes negative.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).That's a reasonable improvement.
Somehow I would have expected more. That's only a 7% speedup.
I am also surprised that HDD didn't show any improvement. Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 14, 2018 at 6:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).That's a reasonable improvement.
Somehow I would have expected more. That's only a 7% speedup.
It might be due to the reason that there is already a big overhead of
synchronous mode of replication that it didn't show a big speedup. We
might want to try recovery (PITR) or maybe async replication to see if
we see any better numbers.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 14.06.2018 15:44, Robert Haas wrote:
On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).That's a reasonable improvement.
Somehow I would have expected more. That's only a 7% speedup.
I am also surprised that HDD didn't show any improvement.
My be pgbench is not the best use case for prefetch. It is updating more
or less random pages and if database is large enough and
full_page_writes is true (default value)
then most pages will be updated only once since last checkpoint and most
of updates will be represented in WAL by full page records.
And such records do not require reading any data from disk.
Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.
Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability
of head movement, so actually parallel access to HDD may even decrease
IO speed rather than increase it.
In theory, given several concurrent IO requests, driver can execute them
in optimal order, trying to minimize head movement. But if there are
really access to random pages,
then probability that we can win something by such optimization is very
small.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability of
head movement, so actually parallel access to HDD may even decrease IO speed
rather than increase it.
In theory, given several concurrent IO requests, driver can execute them in
optimal order, trying to minimize head movement. But if there are really
access to random pages,
then probability that we can win something by such optimization is very
small.
You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs. Perhaps I am
mis-remembering.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 14.06.2018 16:25, Robert Haas wrote:
On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability of
head movement, so actually parallel access to HDD may even decrease IO speed
rather than increase it.
In theory, given several concurrent IO requests, driver can execute them in
optimal order, trying to minimize head movement. But if there are really
access to random pages,
then probability that we can win something by such optimization is very
small.You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs. Perhaps I am
mis-remembering.
It is true for RAIDs of HDD which can really win by issuing parallel IO
operations.
But there are some many different factors that I will not be surprised
by any result:)
The last problem I have observed with NVME device at one of the
customer's system was huge performance degradation (> 10 times: from
500Mb/sec to 50Mb/sec write speed)
after space exhaustion at the device. There is 3Tb NVME RAID device with
1.5Gb database. ext4 was mounted without "discard" option.
After incorrect execution of rsync, space was exhausted. Then I removed
all data and copied database from master node.
Then I observed huge lags in async. replication between master and
replica. wal_receiver is saving received data too slowly: write speed is
about ~50Mb/sec vs. 0.5Gb at master.
All my attempts to use fstrim or ex4defrag didn't help. The problem was
solved only after deleting all database files, performing fstrim and
copying database once again.
After it wal_sender is writing data with normal speed ~0.5Gb and there
is no lag between master and replica.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Greetings,
* Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote:
There was very interesting presentation at pgconf about pg_prefaulter:
I agree and I've chatted a bit w/ Sean further about it.
But it is implemented in GO and using pg_waldump.
Yeah, that's not too good if we want it in core.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.
I don't think this needs to be, or should be, an extension.. If this is
worthwhile (and it certainly appears to be) then we should just do it in
core.
First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some number
of prefetch threads:Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILEOne prefetch thread which uses pread:
./prefetch SOME_BIG_FILEOne prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILEBased on this experiments (on my desktop), I made the following conclusions:
1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.
Ok.
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).
I'm also surprised that it wasn't a larger improvement.
Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.
Thanks!
Stephen
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net> wrote:
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).I'm also surprised that it wasn't a larger improvement.
Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.
Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 15.06.2018 07:36, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net> wrote:
I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).I'm also surprised that it wasn't a larger improvement.
Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.
I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then
first update of a page since last checkpoint will be written in WAL as
full page and applying it will not require reading any data from disk.
If this pages is updated multiple times in subsequent transactions, then
most likely it will be still present in OS file cache, unless checkpoint
interval exceeds OS cache size (amount of free memory in the system). So
if this conditions are satisfied then looks like prefetch is not needed.
And it seems to be true for most real configurations: checkpoint
interval is rarely set larger than hundred of gigabytes and modern
servers usually have more RAM.
But once this condition is not satisfied and lag is larger than size of
OS cache, then prefetch can be not efficient because prefetched pages
may be thrown away from OS cache before them are actually accessed by
redo process. In this case extra synchronization between prefetch and
replay processes is needed so that prefetch is not moving too far away
from replayed LSN.
It is not a problem to integrate this code in Postgres core and run it
in background worker. I do not think that performing prefetch in wal
receiver process itself is good idea: it may slow down speed of
receiving changes from master. And in this case I really can throw away
cut&pasted code. But it is easier to experiment with extension rather
than with patch to Postgres core.
And I have published this extension to make it possible to perform
experiments and check whether it is useful on real workloads.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 15.06.2018 07:36, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net>
wrote:I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k
TPS to 60k TPS (on pgbench with scale 1000).I'm also surprised that it wasn't a larger improvement.
Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then first
update of a page since last checkpoint will be written in WAL as full page
and applying it will not require reading any data from disk.
What exactly you mean by above? AFAIU, it needs to read WAL to apply
full page image. See below code:
XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 15.06.2018 18:03, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:On 15.06.2018 07:36, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net>
wrote:I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k
TPS to 60k TPS (on pgbench with scale 1000).I'm also surprised that it wasn't a larger improvement.
Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then first
update of a page since last checkpoint will be written in WAL as full page
and applying it will not require reading any data from disk.What exactly you mean by above? AFAIU, it needs to read WAL to apply
full page image. See below code:XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}
Sorry, for my confusing statement.
Definitely we need to read page from WAL.
I mean that in case of "full page write" we do not need to read updated
page from the database.
It can be just overwritten.
pg_prefaulter and my wal_prefetch are not prefetching WAL pages themselves.
There is no sense to do it, because them are just written by
wal_receiver and so should be present in file system cache.
wal_prefetch is prefetching blocks referenced by WAL records. But in
case of "full page writes" such prefetch is not needed and even is harmful.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.
I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.
But the current c'est la vie with Postgres is that allocating too large
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement algorithm,
"write storm",...
I think a lot of that fear is overplayed. And we've fixed a number of
issues. We don't really generate write storms in the default config
anymore in most scenarios, and if it's an issue you can turn on
backend_flush_after.
If your system has 1Tb of memory,� almost none of Postgresql administrators
will recommend to use all this 1Tb for shared buffers.
I've used 1TB successfully.
Also PostgreSQL is not currently supporting dynamic changing of shared
buffers size. Without it, the only way of using Postgres in clouds and
another multiuser systems where system load is not fully controlled by� user
is to choose relatively small shared buffer size and rely on OS caching.
That seems largely unrelated to the replay case, because there the data
will be read into shared buffers anyway. And it'll be dirtied therein.
Greetings,
Andres Freund
On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 15.06.2018 18:03, Amit Kapila wrote:
wal_prefetch is prefetching blocks referenced by WAL records. But in case of
"full page writes" such prefetch is not needed and even is harmful.
Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced. If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.
We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 16.06.2018 06:30, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:On 15.06.2018 18:03, Amit Kapila wrote:
wal_prefetch is prefetching blocks referenced by WAL records. But in case of
"full page writes" such prefetch is not needed and even is harmful.Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced. If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?
Sorry, I do not see any intersection with autoprewarw functionality:
wal prefetch is performed at replica where data was not yet modified:
actually the goal of WAL prefetch is to make this update more efficient.
WAL prefetch can be also done at standalone server to speed up recovery
after crash. But it seems to be much more exotic use case.
On 16.06.2018 06:33, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.
Unfortunately I afraid that a) requires different approach: unlike
posix_fadvise, reading data to shared buffer is blocking operation. If
we do it by one worker, then it will read it with the same speed as redo
process. So to make prefetch really efficient, in this case we have to
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
does).
Another my concern against prefetching to shared buffers is that it may
flush away from cache pages which are most frequently used by read only
queries at hot standby replica.
On Sat, Jun 16, 2018 at 10:47 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
On 16.06.2018 06:33, Amit Kapila wrote:
On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de>
wrote:On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch
block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to
shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.Unfortunately I afraid that a) requires different approach: unlike
posix_fadvise, reading data to shared buffer is blocking operation. If we
do it by one worker, then it will read it with the same speed as redo
process. So to make prefetch really efficient, in this case we have to
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
does).Another my concern against prefetching to shared buffers is that it may
flush away from cache pages which are most frequently used by read only
queries at hot standby replica.
Okay, but I am suggesting to make it optional so that it can be
enabled when helpful (say when the user has enough shared buffers to
hold the data).
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.
Could you elaborate why prefetching into s_b is so much better (I'm sure
it has advantages, but I suppose prefetching into page cache would be
much easier to implement).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it has advantages, but I suppose prefetching into page cache would be much easier to implement).
posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.
--
Thomas Munro
http://www.enterprisedb.com
On 06/16/2018 12:06 PM, Thomas Munro wrote:
On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it has advantages, but I suppose prefetching into page cache would be much easier to implement).
posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.
Sure. Assuming the switch to direct I/O will happen (it probably will,
sooner or later), my question is whether this patch should be required
to introduce the prefetching into s_b. Or should we use posix_fadvise
for now, get most of the benefit, and leave the prefetch into s_b as an
improvement for later?
The thing is - we're already doing posix_fadvise prefetching in bitmap
heap scans, it would not be putting additional burden on the direct I/O
patch (hypothetical, so far).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
On 06/16/2018 12:06 PM, Thomas Munro wrote:
On Sat, Jun 16, 2018 at 9:38 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it has advantages, but I suppose prefetching into page cache would be much easier to implement).
posix_fadvise(POSIX_FADV_WILLNEED) might already get most of the
speed-up available here in the short term for this immediate
application, but in the long term a shared buffers prefetch system is
one of the components we'll need to support direct IO.Sure. Assuming the switch to direct I/O will happen (it probably will,
sooner or later), my question is whether this patch should be required to
introduce the prefetching into s_b. Or should we use posix_fadvise for now,
get most of the benefit, and leave the prefetch into s_b as an improvement
for later?The thing is - we're already doing posix_fadvise prefetching in bitmap heap
scans, it would not be putting additional burden on the direct I/O patch
(hypothetical, so far).
This was my take on it also. Prefetching is something we've come to
accept in other parts of the system and if it's beneficial to add it
here then we should certainly do so and it seems like it'd keep the
patch nice and simple and small.
Thanks!
Stephen
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).
I think there's a number of issues with just issuing prefetch requests
via fadvise etc:
- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.
- reading from the page cache is far from free - so you add costs to the
replay process that it doesn't need to do.
- you don't have any sort of completion notification, so you basically
just have to guess how far ahead you want to read. If you read a bit
too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
data either. Nor are the decisions what to keep cached likley to be
particularly useful.
- We imo need to add support for direct IO before long, and adding more
and more work to reach feature parity strikes meas a bad move.
Greetings,
Andres Freund
Hi,
On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:
Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.
FWIW, I think the proper design would rather be a background worker that
does this work. Forcing the user to somehow coordinate starting a
permanently running script whenever the database restarts isn't
great. There's also some issues around snapshots preventing vacuum
(which could be solved, but not nicely).
Greetings,
Andres Freund
On 06/16/2018 09:02 PM, Andres Freund wrote:
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).I think there's a number of issues with just issuing prefetch requests
via fadvise etc:- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.
How does reading directly into shared buffers substantially change the
behavior? The only difference is that we end up with the double
buffering after performing the write. Which is expected to happen pretty
quick after the read request.
- reading from the page cache is far from free - so you add costs to the
replay process that it doesn't need to do.
- you don't have any sort of completion notification, so you basically
just have to guess how far ahead you want to read. If you read a bit
too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
data either. Nor are the decisions what to keep cached likley to be
particularly useful.
The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).
- We imo need to add support for direct IO before long, and adding more
and more work to reach feature parity strikes meas a bad move.
IMHO it's unlikely to happen in PG12, but I might be over-estimating the
invasiveness and complexity of the direct I/O change. While this patch
seems pretty doable, and the improvements are pretty significant.
My point was that I don't think this actually adds a significant amount
of work to the direct IO patch, as we already do prefetch for bitmap
heap scans. So this needs to be written anyway, and I'd expect those two
places to share most of the code. So where's the additional work?
I don't think we should reject patches just because it might add a bit
of work to some not-yet-written future patch ... (which I however don't
think is this case).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-06-16 21:34:30 +0200, Tomas Vondra wrote:
- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.How does reading directly into shared buffers substantially change the
behavior? The only difference is that we end up with the double
buffering after performing the write. Which is expected to happen pretty
quick after the read request.
Random reads directly as a response to a read() request can be cached
differently - and we trivially could force that with another fadvise() -
than posix_fadvise(WILLNEED). There's pretty much no other case - so
far - where we know as clearly that we won't re-read the page until
write as here.
- you don't have any sort of completion notification, so you basically
just have to guess how far ahead you want to read. If you read a bit
too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
data either. Nor are the decisions what to keep cached likley to be
particularly useful.The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).
Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.
My point was that I don't think this actually adds a significant amount
of work to the direct IO patch, as we already do prefetch for bitmap
heap scans. So this needs to be written anyway, and I'd expect those two
places to share most of the code. So where's the additional work?
I think it's largely entirely separate from what we'd do for bitmap
index scans.
Greetings,
Andres Freund
On 16.06.2018 22:02, Andres Freund wrote:
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).I think there's a number of issues with just issuing prefetch requests
via fadvise etc:- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.
Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we
need to first read page from the disk
before been able to perform update. So before "we'll write out the whole
page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.
Please notice that at the moment of prefetch there is no double
buffering. As far as page is not accessed before, it is not present in
shared buffers. And once page is updated, there is really no need to
keep it in shared buffers. We can use cyclic buffers (like in case of
sequential scan or bulk update) to prevent throwing away useful pages
from shared buffers by redo process. So once again there will no double
buffering.
- reading from the page cache is far from free - so you add costs to the
replay process that it doesn't need to do.
- you don't have any sort of completion notification, so you basically
just have to guess how far ahead you want to read. If you read a bit
too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
data either. Nor are the decisions what to keep cached likley to be
particularly useful.
- We imo need to add support for direct IO before long, and adding more
and more work to reach feature parity strikes meas a bad move.
I am not so familiar with current implementation of full page writes
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but
I failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full
page write doesn't require prefetch.
Full page write has to be performed when the page is update first time
after checkpoint.
But what if slightly extend this rule and perform full page write also
when distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?
In this case even if checkpoint interval is larger than OS cache size,
we still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!
On 16.06.2018 22:23, Andres Freund wrote:
Hi,
On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:
Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.FWIW, I think the proper design would rather be a background worker that
does this work. Forcing the user to somehow coordinate starting a
permanently running script whenever the database restarts isn't
great. There's also some issues around snapshots preventing vacuum
(which could be solved, but not nicely).
As I already wrote, the current my approach with extension and
pg_wal_prefetch function called by user can be treated only as prototype
implementation which can be used to estimate efficiency of prefetch. But
in case of prefetching in shared buffers, one background worker will not
be enough. Prefetch can can speedup recovery process if it performs
reads in parallel or background. So more than once background worker
will be needed for prefetch if we read data to Postgres shared buffers
rather then using posix_prefetch to load page in OS cache.
Show quoted text
Greetings,
Andres Freund
On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:
On 16.06.2018 22:02, Andres Freund wrote:
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).I think there's a number of issues with just issuing prefetch requests
via fadvise etc:- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we need
to first read page from the disk
Yes.
before been able to perform update. So before "we'll write out the whole
page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.
Yes.
Please notice that at the moment of prefetch there is no double
buffering.
Sure, but as soon as it's read there is.
As far as page is not accessed before, it is not present in shared buffers.
And once page is updated,� there is really no need to keep it in shared
buffers.� We can use cyclic buffers (like in case� of sequential scan or
bulk update) to prevent throwing away useful pages from shared� buffers by
redo process. So once again there will no double buffering.
That's a terrible idea. There's a *lot* of spatial locality of further
WAL records arriving for the same blocks.
I am not so familiar with current implementation of full page writes
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but I
failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full page
write doesn't require prefetch.
Full page write has to be performed when the page is update first time after
checkpoint.
But what if slightly extend this rule and perform full page write also when
distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?In this case even if checkpoint interval is larger than OS cache size, we
still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!
We could do so, but I suspect the WAL volume penalty would be
prohibitive in many cases. Worthwhile to try though.
Greetings,
Andres Freund
On 2018-06-16 23:31:49 +0300, Konstantin Knizhnik wrote:
On 16.06.2018 22:23, Andres Freund wrote:
Hi,
On 2018-06-13 16:09:45 +0300, Konstantin Knizhnik wrote:
Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.FWIW, I think the proper design would rather be a background worker that
does this work. Forcing the user to somehow coordinate starting a
permanently running script whenever the database restarts isn't
great. There's also some issues around snapshots preventing vacuum
(which could be solved, but not nicely).As I already wrote, the current my approach with extension and
pg_wal_prefetch function called by user can be treated only as prototype
implementation which can be used to estimate efficiency of prefetch. But in
case of prefetching in shared buffers, one background worker will not be
enough. Prefetch can can speedup recovery process if it performs reads in
parallel or background. So more than once background worker will be needed
for prefetch if we read data to Postgres shared buffers rather then using
posix_prefetch to load page in OS cache.
Sure, we'd need more than one to get the full benefit, but that's not
really hard. You'd see benefit even with a single process, because WAL
replay often has a lot of other bottlenecks too. But no reason to not
have multiple ones.
Greetings,
Andres Freund
On 17.06.2018 03:00, Andres Freund wrote:
On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:
On 16.06.2018 22:02, Andres Freund wrote:
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.Hi Konstantin,
Why stop at the page cache... what about shared buffers?
It is good question. I thought a lot about prefetching directly to shared
buffers.I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).I think there's a number of issues with just issuing prefetch requests
via fadvise etc:- it leads to guaranteed double buffering, in a way that's just about
guaranteed to *never* be useful. Because we'd only prefetch whenever
there's an upcoming write, there's simply no benefit in the page
staying in the page cache - we'll write out the whole page back to the
OS.Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we need
to first read page from the diskYes.
before been able to perform update. So before "we'll write out the whole
page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.Yes.
Please notice that at the moment of prefetch there is no double
buffering.Sure, but as soon as it's read there is.
As far as page is not accessed before, it is not present in shared buffers.
And once page is updated, there is really no need to keep it in shared
buffers. We can use cyclic buffers (like in case of sequential scan or
bulk update) to prevent throwing away useful pages from shared buffers by
redo process. So once again there will no double buffering.That's a terrible idea. There's a *lot* of spatial locality of further
WAL records arriving for the same blocks.
In some cases it is true, in some cases - not. In typical OLTP system if
record is updated, then there is high probability that
it will be accessed soon. So if at such system we perform write requests
on master and read-only queries at replicas,
keeping updated pages in shared buffers at replica can be very helpful.
But if replica is used for running mostly analytic queries while master
performs some updates, then
it is more useful to keep in replica's cache indexes and most
frequently accessed pages, rather than recent updates from the master.
So at least it seems to be reasonable to have such parameter and make
DBA to choose caching policy at replicas.
I am not so familiar with current implementation of full page writes
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but I
failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full page
write doesn't require prefetch.
Full page write has to be performed when the page is update first time after
checkpoint.
But what if slightly extend this rule and perform full page write also when
distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?In this case even if checkpoint interval is larger than OS cache size, we
still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!We could do so, but I suspect the WAL volume penalty would be
prohibitive in many cases. Worthwhile to try though.
Well, the typical size of server's memory is now several hundreds of
megabytes.
Certainly some of this memory is used for shared buffers, backends work
memory, ...
But still there are hundreds of gigabytes of free memory which can be
used by OS for caching.
Let's assume that full page write threshold is 100Gb. So one extra 8kb
for 100Gb of WAL!
Certainly it is estimation only for one page and it is more realistic to
expect that we have to force full page writes for most of the updated
pages. But still I do not believe that it will cause significant growth
of log size.
Another question is why do we choose so large checkpoint interval: re
than hundred gigabytes.
Certainly frequent checkpoints have negative impact on performance. But
100Gb is not "too frequent" in any case...
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund <andres@anarazel.de> wrote:
The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.
My guesses are:
- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.
- It'll be five years before we have direct I/O.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund <andres@anarazel.de> wrote:
The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.My guesses are:
- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.
Why? The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization between
them isn't that much more work.
- It'll be five years before we have direct I/O.
I think we'll have lost a significant market share by then if that's the
case. Deservedly so.
Greetings,
Andres Freund
On 18.06.2018 23:47, Andres Freund wrote:
On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund <andres@anarazel.de> wrote:
The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.My guesses are:
- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.Why? The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization between
them isn't that much more work.
I do not think that prefetching in shared buffers requires much more
efforts and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache
of prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...
Also there are two points which makes prefetching into shared buffers
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of false
prefetch is less critical.
- It'll be five years before we have direct I/O.
I think we'll have lost a significant market share by then if that's the
case. Deservedly so.
I have implemented some number of DBMS engines (GigaBASE, GOODS, FastDB,
...) and have supported direct IO (as option) in most of them.
But at most workloads I have not get any significant improvement in
performance.
Certainly, it may be some problem with my implementations... and Linux
kernel is significantly changed since this time.
But there is one "axiom" which complicates usage of direct IO: only OS
knows at each moment of time how much free memory it has.
So only OS can efficiently schedule memory so that all system RAM is
used.\302\240 It is very hard if ever possible to do it at application level.
As a result you will have to be very conservative in choosing size of
shared buffers to fit in RAM and avoid swapping.
It may be possible if you have complete control on the server and there
is just one Postgres instance running at this server.
But now there is a trend towards visualization and clouds and such
assumption is not true in most cases. So double buffering
(or even triple if take in account on-device internal caches) is
definitely an issue. But direct IO seems to be not a silver bullet for
solving it...
Concerning WAL perfetch I still have a serious doubt if it is needed at
all:
if checkpoint interval is less than size of free memory at the system,
then redo process should not read much.
And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?) then quite small patch (as it seems to me now)
forcing full page write when distance between page LSN and current WAL
insertion point exceeds some threshold should eliminate random reads
also in this case.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:
On 18.06.2018 23:47, Andres Freund wrote:
On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund <andres@anarazel.de>
wrote:The posix_fadvise approach is not perfect, no doubt about that. But it
works pretty well for bitmap heap scans, and it's about 13249x better
(rough estimate) than the current solution (no prefetching).Sure, but investing in an architecture we know might not live long also
has it's cost. Especially if it's not that complicated to do better.My guesses are:
- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.Why?\302\240 The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization between
them isn't that much more work.I do not think that prefetching in shared buffers requires much more
efforts and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache
of prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...Also there are two points which makes prefetching into shared buffers
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of false
prefetch is less critical.
I think the main challenge here is that all buffer reads are currently
synchronous (correct me if I'm wrong), while the posix_fadvise() allows
a to prefetch the buffers asynchronously.
I don't think simply spawning a couple of bgworkers to prefetch buffers
is going to be equal to async prefetch, unless we support some sort of
async I/O. Maybe something has changed recently, but every time I looked
for good portable async I/O API/library I got burned.
Now, maybe a couple of bgworkers prefetching buffers synchronously would
be good enough for WAL refetching - after all, we only need to prefetch
data fast enough for the recovery not to wait. But I doubt it's going to
be good enough for bitmap heap scans, for example.
We need a prefetch that allows filling the I/O queues with hundreds of
requests, and I don't think sync prefetch from a handful of bgworkers
can achieve that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 19.06.2018 14:03, Tomas Vondra wrote:
On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:
On 18.06.2018 23:47, Andres Freund wrote:
On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund <andres@anarazel.de>
wrote:The posix_fadvise approach is not perfect, no doubt about that.
But it
works pretty well for bitmap heap scans, and it's about 13249x
better
(rough estimate) than the current solution (no prefetching).Sure, but investing in an architecture we know might not live long
also
has it's cost. Especially if it's not that complicated to do better.My guesses are:
- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.Why?\302\240 The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization
between
them isn't that much more work.I do not think that prefetching in shared buffers requires much more
efforts and make patch more envasive...
It even somehow simplify it, because there is no to maintain own
cache of prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by
read-only queries,...Also there are two points which makes prefetching into shared buffers
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of
false prefetch is less critical.I think the main challenge here is that all buffer reads are currently
synchronous (correct me if I'm wrong), while the posix_fadvise()
allows a to prefetch the buffers asynchronously.
Yes, this is why we have to spawn several concurrent background workers
to perfrom prefetch.
I don't think simply spawning a couple of bgworkers to prefetch
buffers is going to be equal to async prefetch, unless we support some
sort of async I/O. Maybe something has changed recently, but every
time I looked for good portable async I/O API/library I got burned.Now, maybe a couple of bgworkers prefetching buffers synchronously
would be good enough for WAL refetching - after all, we only need to
prefetch data fast enough for the recovery not to wait. But I doubt
it's going to be good enough for bitmap heap scans, for example.We need a prefetch that allows filling the I/O queues with hundreds of
requests, and I don't think sync prefetch from a handful of bgworkers
can achieve that.regards
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote:
On 19.06.2018 14:03, Tomas Vondra wrote:
On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:
...
Also there are two points which makes prefetching into shared buffers
more complex:
1. Need to spawn multiple workers to make prefetch in parallel and
somehow distribute work between them.
2. Synchronize work of recovery process with prefetch to prevent
prefetch to go too far and doing useless job.
The same problem exists for prefetch in OS cache, but here risk of
false prefetch is less critical.I think the main challenge here is that all buffer reads are currently
synchronous (correct me if I'm wrong), while the posix_fadvise()
allows a to prefetch the buffers asynchronously.Yes, this is why we have to spawn several concurrent background workers
to perfrom prefetch.
Right. My point is that while spawning bgworkers probably helps, I don't
expect it to be enough to fill the I/O queues on modern storage systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more than 16
requests in the queue.
Consider for example [1]/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency values:
[1]: /messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com
/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com
Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).
ISTM a small number of bgworkers is not going to be sufficient. It might
be enough for WAL prefetching (where we may easily run into the
redo-is-single-threaded bottleneck), but it's hardly a solution for
bitmap heap scans, for example. We'll need to invent something else for
that.
OTOH my guess is that whatever solution we'll end up implementing for
bitmap heap scans, it will be applicable for WAL prefetching too. Which
is why I'm suggesting simply using posix_fadvise is not going to make
the direct I/O patch significantly more complicated.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Right. My point is that while spawning bgworkers probably helps, I don't
expect it to be enough to fill the I/O queues on modern storage systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more than 16
requests in the queue.Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency values:[1]
/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com
Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'
For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1]https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3 Multiply that appropriately for RAID arrays.
Regards,
Ants Aasma
[1]: https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3
https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3
On 19.06.2018 16:57, Ants Aasma wrote:
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:Right. My point is that while spawning bgworkers probably helps, I
don't
expect it to be enough to fill the I/O queues on modern storage
systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more
than 16
requests in the queue.Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency
values:[1]
/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.comClearly, you need to prefetch 32/64 blocks or so. Consider you may
have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1] Multiply that appropriately for RAID
arrays.So
How it is related with results for S3500 where this is almost now
performance improvement for effective_io_concurrency >8?
Starting 128 or more workers for performing prefetch is definitely not
acceptable...
Regards,
Ants Aasma
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 06/19/2018 04:50 PM, Konstantin Knizhnik wrote:
On 19.06.2018 16:57, Ants Aasma wrote:
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:Right. My point is that while spawning bgworkers probably helps, I
don't
expect it to be enough to fill the I/O queues on modern storage
systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more
than 16
requests in the queue.Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency
values:[1]
/messages/by-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.comClearly, you need to prefetch 32/64 blocks or so. Consider you may
have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).'For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1] Multiply that appropriately for RAID
arrays.SoHow it is related with results for S3500 where this is almost now
performance improvement for effective_io_concurrency >8?
Starting 128 or more workers for performing prefetch is definitely not
acceptable...
I'm not sure what you mean by "almost now performance improvement", but
I guess you meant "almost no performance improvement" instead?
If that's the case, it's not quite true - increasing the queue depth
above 8 further improved the throughput by about ~10-20% (both by
duration and peak throughput measured by iotop).
But more importantly, this is just a single device - you typically have
multiple of them in a larger arrays, to get better capacity, performance
and/or reliability. So if you have 16 such drives, and you want to send
at least 8 requests to each, suddenly you need at least 128 requests.
And as pointed out before, S3500 is about 5-years old device (it was
introduced in Q2/2013). On newer devices the difference is usually way
more significant / the required queue depth is much higher.
Obviously, this is a somewhat simplified view, ignoring various details
(e.g. that there may be multiple concurrent queries, each sending I/O
requests - what matters is the combined number of requests, of course).
But I don't think this makes a huge difference.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...
These arguments seem bogus to me. Otherwise the startup process is going
to do that work.
Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.
I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED. I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.
That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.
Concerning WAL perfetch I still have a serious doubt if it is needed at all:
if checkpoint interval is less than size of free memory at the system, then
redo process should not read much.
I'm confused. Didn't you propose this? FWIW, there's a significant
number of installations where people have observed this problem in
practice.
And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?)
Yes, there are. Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.
then quite small patch (as it seems to me now) forcing full page write
when distance between page LSN and current WAL insertion point exceeds
some threshold should eliminate random reads also in this case.
I'm pretty sure that that'll hurt a significant number of installations,
that set the timeout high, just so they can avoid FPWs.
Greetings,
Andres Freund
On 19.06.2018 18:50, Andres Freund wrote:
On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...These arguments seem bogus to me. Otherwise the startup process is going
to do that work.
There is just one process replaying WAL. Certainly it has some impact on
hot standby query execution.
But if there will be several prefetch workers (128???) then this impact
will be dramatically increased.
Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED. I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.Concerning WAL perfetch I still have a serious doubt if it is needed at all:
if checkpoint interval is less than size of free memory at the system, then
redo process should not read much.I'm confused. Didn't you propose this? FWIW, there's a significant
number of installations where people have observed this problem in
practice.
Well, originally it was proposed by Sean - the author of pg-prefaulter.
I just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic
improvement in performance (in case of synchronous replication) or
reducing replication lag for asynchronous replication neither at my
desktop (SSD, 16Gb RAM, local replication within same computer, pgbench
scale 1000), neither at pair of two powerful servers connected by
InfiniBand and 3Tb NVME (pgbench with scale 100000).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.
So it make me think when such prefetch may be needed... And it caused
new questions:
I wonder how frequently checkpoint interval is much larger than OS cache?
If we enforce full pages writes (let's say each after each 1Gb), how it
affect wal size and performance?
Looks like it is difficult to answer the second question without
implementing some prototype.
May be I will try to do it.
And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?)Yes, there are. Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.
One more consideration: data is written to the disk as blocks in any
case. If you updated just few bytes on a page, then still the whole page
has to be written in database file.
So avoiding full page writes allows to reduce WAL size and amount of
data written to the WAL, but not amount of data written to the database
itself.
It means that if we completely eliminate FPW and transactions are
updating random pages, then disk traffic is reduced less than two times...
then quite small patch (as it seems to me now) forcing full page write
when distance between page LSN and current WAL insertion point exceeds
some threshold should eliminate random reads also in this case.I'm pretty sure that that'll hurt a significant number of installations,
that set the timeout high, just so they can avoid FPWs.
May be, but I am not so sure. This is why I will try to investigate it more.
Greetings,
Andres Freund
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 06/19/2018 05:50 PM, Andres Freund wrote:
On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...These arguments seem bogus to me. Otherwise the startup process is going
to do that work.Also there are two points which makes prefetching into shared buffers more
complex:
1. Need to spawn multiple workers to make prefetch in parallel and somehow
distribute work between them.I'm not even convinced that's true. It doesn't seem insane to have a
queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
where the oldest requests is read into shared buffers by the
prefetcher. And then discarded from the page cache with WONTNEED. I
think we're going to want a queue that's sorted in the prefetch process
anyway, because there's a high likelihood that we'll otherwise issue
prfetch requets for the same pages over and over again.That gets rid of most of the disadvantages: We have backpressure
(because the read into shared buffers will block if not yet ready),
we'll prevent double buffering, we'll prevent the startup process from
doing the victim buffer search.
I'm confused. I thought you wanted to prefetch directly to shared
buffers, so that it also works with direct I/O in the future. But now
you suggest to use posix_fadvise() to work around the synchronous buffer
read limitation. I don't follow ...
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-06-19 19:34:22 +0300, Konstantin Knizhnik wrote:
On 19.06.2018 18:50, Andres Freund wrote:
On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own cache of
prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...These arguments seem bogus to me. Otherwise the startup process is going
to do that work.There is just one process replaying WAL. Certainly it has some impact on hot
standby query execution.
But if there will be several prefetch workers (128???) then this impact will
be dramatically increased.
Hence me suggesting how you can do that with one process (re locking). I
still entirely fail to see how "throwing away pages accessed by
read-only queries" is meaningful here - the startup process is going to
read the data anyway, and we *do not* want to use a ringbuffer as that'd
make the situation dramatically worse.
Well, originally it was proposed by Sean - the author of pg-prefaulter. I
just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic improvement
in performance (in case of synchronous replication) or reducing replication
lag for asynchronous replication neither at my desktop (SSD, 16Gb RAM, local
replication within same computer, pgbench scale 1000), neither at pair of
two powerful servers connected by
InfiniBand and 3Tb NVME (pgbench with scale 100000).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.
I think it's probably largely a mix of 3 and 4. pgbench with random
distribution probably indeed is a bad testcase, because either
everything is in cache or just about every write ends up as a full page
write because of the scale. You might want to try a) turn of full page
writes b) use a less random distribution.
So it make me think when such prefetch may be needed... And it caused new
questions:
I wonder how frequently checkpoint interval is much larger than OS
cache?
Extremely common.
If we enforce full pages writes (let's say each after each 1Gb), how it
affect wal size and performance?
Extremely badly. If you look at stats of production servers (using
pg_waldump) you can see that large percentage of the total WAL volume is
FPWs, that FPWs are a storage / bandwidth / write issue, and that higher
FPW rates after a checkpoint correlate strongly negatively with performance.
Greetings,
Andres Freund
Hi,
On 2018-06-19 18:41:24 +0200, Tomas Vondra wrote:
I'm confused. I thought you wanted to prefetch directly to shared buffers,
so that it also works with direct I/O in the future. But now you suggest to
use posix_fadvise() to work around the synchronous buffer read limitation. I
don't follow ...
Well, I have multiple goals. For one I think using prefetching without
any sort of backpressure and mechanism to see which have completed will
result in hard to monitor and random performance. For another I'm
concerned with wasting a significant amount of memory for the OS cache
of all the read data that's guaranteed to never be needed (as we'll
*always* write to the relevant page shortly down the road). For those
reasons alone I think prefetching just into the OS cache is a bad idea,
and should be rejected.
I also would want something that's more compatible with DIO. But people
pushed back on that, so... As long as we build something that looks
like a request queue (which my proposal does), it's also something that
can later with some reduced effort be ported onto asynchronous io.
Greetings,
Andres Freund
On 06/19/2018 06:34 PM, Konstantin Knizhnik wrote:
On 19.06.2018 18:50, Andres Freund wrote:
On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
I do not think that prefetching in shared buffers requires much more
efforts
and make patch more envasive...
It even somehow simplify it, because there is no to maintain own
cache of
prefetched pages...
But it will definitely have much more impact on Postgres performance:
contention for buffer locks, throwing away pages accessed by read-only
queries,...These arguments seem bogus to me. Otherwise the startup process is going
to do that work.There is just one process replaying WAL. Certainly it has some impact on
hot standby query execution.
But if there will be several prefetch workers (128???) then this impact
will be dramatically increased.
The goal of prefetching is better saturation of the storage. Which means
less bandwidth remaining for other processes (that have to compete for
the same storage). I don't think "startup process is going to do that
work" is entirely true - it'd do that work, but likely over longer
period of time.
But I don't think this is an issue - I'd expect having some GUC defining
how many records to prefetch (just like effective_io_concurrency).
Concerning WAL perfetch I still have a serious doubt if it is needed
at all:
if checkpoint interval is less than size of free memory at the
system, then
redo process should not read much.I'm confused. Didn't you propose this? FWIW, there's a significant
number of installations where people have observed this problem in
practice.Well, originally it was proposed by Sean - the author of pg-prefaulter.
I just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic
improvement in performance (in case of synchronous replication) or
reducing replication lag for asynchronous replication neither at my
desktop (SSD, 16Gb RAM, local replication within same computer, pgbench
scale 1000), neither at pair of two powerful servers connected by
InfiniBand and 3Tb NVME (pgbench with scale 100000).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.
pgbench is a perfectly sufficient workload to demonstrate the issue, all
you need to do is use sufficiently large scale factor (say 2*RAM) and
large number of clients to generate writes on the primary (to actually
saturate the storage). Then the redo on replica won't be able to keep
up, because the redo only fetches one page at a time.
So it make me think when such prefetch may be needed... And it caused
new questions:
I wonder how frequently checkpoint interval is much larger than OS cache?
Pretty often. Furthermore, replicas may also run queries (often large
ones), pushing pages related to redo from RAM.
If we enforce full pages writes (let's say each after each 1Gb), how it
affect wal size and performance?
It would improve redo performance, of course, exactly because the page
would not need to be loaded from disk. But the amount of WAL can
increase tremendously, causing issues for network bandwidth
(particularly between different data centers).
Looks like it is difficult to answer the second question without
implementing some prototype.
May be I will try to do it.
Perhaps you should prepare some examples of workloads demonstrating the
issue, before trying implementing a solution.
And if checkpoint interval is much larger than OS cache (are there cases
when it is really needed?)Yes, there are. Percentage of FPWs can cause serious problems, as do
repeated writouts by the checkpointer.One more consideration: data is written to the disk as blocks in any
case. If you updated just few bytes on a page, then still the whole page
has to be written in database file.
So avoiding full page writes allows to reduce WAL size and amount of
data written to the WAL, but not amount of data written to the database
itself.
It means that if we completely eliminate FPW and transactions are
updating random pages, then disk traffic is reduced less than two times...
I don't follow. What do you mean by "less than two times"? Surely the
difference can be anything between 0 and infinity, depending on how
often you write a single page.
The other problem with just doing FPI all the time is backups. To do
physical backups / WAL archival, you need to store all the WAL segments.
If the amount of WAL increases 10x you're going to be unhappy.
then quite small patch (as it seems to me now) forcing full page write
when distance between page LSN and current WAL insertion point exceeds
some threshold should eliminate random reads also in this case.I'm pretty sure that that'll hurt a significant number of installations,
that set the timeout high, just so they can avoid FPWs.May be, but I am not so sure. This is why I will try to investigate it
more.
I'd say checkpoints already do act as such timeout (not only, but people
are setting it high to get rid of FPIs).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started
together with startup process and is able to help it to speedup recovery.
The patch is attached.
Unfortunately result is negative (at least at my desktop: SSD, 16Gb
RAM). Recovery with prefetch is 3 times slower than without it.
What I am doing:
Configuration:
max_wal_size=min_wal_size=10Gb,
shared)buffers = 1Gb
Database:
pgbench -i -s 1000
Test:
pgbench -c 10 -M prepared -N -T 100 -P 1
pkill postgres
echo 3 > /proc/sys/vm/drop_caches
time pg_ctl -t 1000 -D pgsql -l logfile start
Without prefetch it is 19 seconds (recovered about 4Gb of WAL), with
prefetch it is about one minute. About 400k blocks are prefetched.
CPU usage is small (<20%), both processes as in "Ds" state.
vmstat without prefetch shows the following output:
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 2 2667964 11465832 7892 2515588 0 0 344272 2 6129 22290
8 4 84 5 0
3 1 2667960 10013900 9516 3963056 6 0 355606 8772 7412 25228
12 6 74 8 0
1 0 2667960 8526228 11036 5440192 0 0 366910 242 6123 19476
8 5 83 3 0
1 1 2667960 7824816 11060 6141920 0 0 166860 171638 9581
24746 4 4 79 13 0
0 4 2667960 7822824 11072 6143788 0 0 264 376836 19292
49973 1 3 69 27 0
1 0 2667960 7033140 11220 6932400 0 0 188810 168070 14610
41390 5 4 72 19 0
1 1 2667960 5739616 11384 8226148 0 0 254492 57884 6733 19263
8 5 84 4 0
0 3 2667960 5024380 11400 8941532 0 0 8 398198 18164
45782 2 5 70 23 0
0 0 2667960 5020152 11428 8946000 0 0 168 69128 3918 10370
2 1 91 6 0
with prefetch:
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 2 2651816 12340648 11148 1564420 0 0 178980 96 4411 14237
5 2 90 3 0
2 0 2651816 11771612 11712 2132180 0 0 169572 0 6388
18244 5 3 72 20 0
2 0 2651816 11199248 12008 2701960 0 0 168966 162 6677
18816 5 3 72 20 0
1 3 2651816 10660512 12028 3241604 0 0 162666 16 7065
21668 6 5 69 20 0
0 2 2651816 10247180 12052 3653888 0 0 131564 18112 7376
22023 6 3 69 23 0
0 2 2651816 9850424 12096 4064980 0 0 133158 238 6398 17557
4 2 71 22 0
2 0 2651816 9456616 12108 4459456 0 0 134702 44 6219 16665
3 2 73 22 0
0 2 2651816 9161336 12160 4753868 0 0 111168 74408 8038 20440
3 3 69 25 0
3 0 2651816 8810336 12172 5106068 0 0 134694 0 6251 16978
4 2 73 22 0
0 2 2651816 8451924 12192 5463692 0 0 137546 80 6264 16930
3 2 73 22 0
1 1 2651816 8108000 12596 5805856 0 0 135212 10 6218 16827
4 2 72 22 0
1 3 2651816 7793992 12612 6120376 0 0 135072 0 6233 16736
3 2 73 22 0
0 2 2651816 7507644 12632 6406512 0 0 134830 90 6267 16910
3 2 73 22 0
0 2 2651816 7246696 12776 6667804 0 0 122656 51820 7419 19384
3 3 71 23 0
1 2 2651816 6990080 12784 6924352 0 0 121248 55284 7527 19794
3 3 71 23 0
0 3 2651816 6913648 12804 7000376 0 0 36078 295140 14852
37925 2 3 67 29 0
0 2 2651816 6873112 12804 7040852 0 0 19180 291330 16167
41711 1 3 68 28 0
5 1 2651816 6641848 12812 7271736 0 0 107696 68 5760 15301
3 2 73 22 0
3 1 2651816 6426356 12820 7490636 0 0 103412 0 5942 15994
3 2 72 22 0
0 2 2651816 6195288 12824 7720720 0 0 104446 0 5605 14757
3 2 73 22 0
0 2 2651816 5946876 12980 7970912 0 0 113340 74 5980 15678
3 2 71 24 0
1 2 2651816 5655768 12984 8262880 0 0 137290 0 6235 16412
3 2 73 21 0
2 0 2651816 5359548 13120 8557072 0 0 137608 86 6309 16658
3 2 73 21 0
2 0 2651816 5068268 13124 8849136 0 0 137386 0 6225 16589
3 2 73 21 0
2 0 2651816 4816812 13124 9100600 0 0 120116 53284 7273 18776
3 2 72 23 0
0 2 2651816 4563152 13132 9353232 0 0 117972 54352 7423 19375
3 2 73 22 0
1 2 2651816 4367108 13144 9549712 0 0 51994 239498 10846
25987 3 5 73 19 0
0 0 2651816 4366356 13164 9549892 0 0 168 294196 14981
39432 1 3 79 17 0
So as you can see, read speed with prefetch is smaller: < 130Mb/sec,
while without prefetch up to 366Mb/sec.
My hypothesis is that prefetch flushes dirty pages from cache and as a
result, more data has to be written and backends are more frequently
blocked in write.
In any case - very upsetting result.
Any comments are welcome.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
walprefetch.patchtext/x-patch; name=walprefetch.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 1b000a2..9730b42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -879,13 +879,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
return true;
}
-#ifdef FRONTEND
-/*
- * Functions that are currently not needed in the backend, but are better
- * implemented inside xlogreader.c because of the internal facilities available
- * here.
- */
-
/*
* Find the first record with an lsn >= RecPtr.
*
@@ -1004,9 +997,6 @@ out:
return found;
}
-#endif /* FRONTEND */
-
-
/* ----------------------------------------
* Functions for decoding the data and block references in a record.
* ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e..7847311 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -652,7 +652,7 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
* in walsender.c but for small differences (such as lack of elog() in
* frontend). Probably these should be merged at some point.
*/
-static void
+void
XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
Size count)
{
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee..e492715 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -32,6 +32,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
+#include "postmaster/walprefetcher.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -335,6 +336,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
case WalReceiverProcess:
statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
break;
+ case WalPrefetcherProcess:
+ statmsg = pgstat_get_backend_desc(B_WAL_PREFETCHER);
+ break;
default:
statmsg = "??? process";
break;
@@ -462,6 +466,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
WalReceiverMain();
proc_exit(1); /* should never return */
+ case WalPrefetcherProcess:
+ /* don't set signals, walprefetcher has its own agenda */
+ WalPrefetcherMain();
+ proc_exit(1); /* should never return */
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..13e5066 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
- pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+ pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o walprefetcher.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e..7195578 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2870,6 +2870,9 @@ pgstat_bestart(void)
case WalReceiverProcess:
beentry->st_backendType = B_WAL_RECEIVER;
break;
+ case WalPrefetcherProcess:
+ beentry->st_backendType = B_WAL_PREFETCHER;
+ break;
default:
elog(FATAL, "unrecognized process type: %d",
(int) MyAuxProcType);
@@ -3519,6 +3522,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_WAL_WRITER_MAIN:
event_name = "WalWriterMain";
break;
+ case WAIT_EVENT_WAL_PREFETCHER_MAIN:
+ event_name = "WalPrefetcherMain";
+ break;
/* no default case, so that compiler will warn */
}
@@ -4126,6 +4132,9 @@ pgstat_get_backend_desc(BackendType backendType)
case B_WAL_RECEIVER:
backendDesc = "walreceiver";
break;
+ case B_WAL_PREFETCHER:
+ backendDesc = "walprefetcher";
+ break;
case B_WAL_SENDER:
backendDesc = "walsender";
break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b3..a5b54cd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,9 @@ static pid_t StartupPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
PgStatPID = 0,
- SysLoggerPID = 0;
+ SysLoggerPID = 0,
+ WalPrefetcherPID = 0
+;
/* Startup process's status */
typedef enum
@@ -362,6 +364,9 @@ static volatile bool avlauncher_needs_signal = false;
/* received START_WALRECEIVER signal */
static volatile sig_atomic_t WalReceiverRequested = false;
+/* received START_WALPREFETCHER signal */
+static volatile sig_atomic_t WalPrefetcherRequested = false;
+
/* set when there's a worker that needs to be started up */
static volatile bool StartWorkerNeeded = true;
static volatile bool HaveCrashedWorker = false;
@@ -549,6 +554,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartupDataBase() StartChildProcess(StartupProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartWalPrefetcher() StartChildProcess(WalPrefetcherProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1373,6 +1379,9 @@ PostmasterMain(int argc, char *argv[])
StartupStatus = STARTUP_RUNNING;
pmState = PM_STARTUP;
+ /* Start Wal prefetcher now because it may speed-up WAL redo */
+ WalPrefetcherPID = StartWalPrefetcher();
+
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -2535,8 +2544,11 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
+ signal_child(BgWriterPID, SIGHUP);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGHUP);
if (AutoVacPID != 0)
@@ -2685,6 +2697,8 @@ pmdie(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGTERM);
if (pmState == PM_RECOVERY)
{
SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
@@ -2864,6 +2878,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -2967,6 +2983,20 @@ reaper(SIGNAL_ARGS)
}
/*
+ * Was it the wal prefetcher? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalPrefetcherPID)
+ {
+ WalPrefetcherPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL prefetcher process"));
+ continue;
+ }
+
+ /*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
* necessary. Any other exit condition is treated as a crash.
@@ -3451,6 +3481,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(WalWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the walprefetcherr too */
+ if (pid == WalPrefetcherPID)
+ WalPrefetcherPID = 0;
+ else if (WalPrefetcherPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) WalPrefetcherPID)));
+ signal_child(WalPrefetcherPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walreceiver too */
if (pid == WalReceiverPID)
WalReceiverPID = 0;
@@ -3657,6 +3699,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalPrefetcherPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3757,6 +3800,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(WalPrefetcherPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -3946,6 +3990,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5041,6 +5087,10 @@ sigusr1_handler(SIGNAL_ARGS)
Assert(BgWriterPID == 0);
BgWriterPID = StartBackgroundWriter();
+ /* WAL prefetcher is expected to be started earlier but if not, try to start it now */
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
+
/*
* Start the archiver if we're responsible for (re-)archiving received
* files.
@@ -5361,6 +5411,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalPrefetcherProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL prefetcher process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
diff --git a/src/backend/postmaster/walprefetcher.c b/src/backend/postmaster/walprefetcher.c
new file mode 100644
index 0000000..eba4143
--- /dev/null
+++ b/src/backend/postmaster/walprefetcher.c
@@ -0,0 +1,569 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.c
+ *
+ * Replaying WAL is done by single process, it may cause slow recovery time
+ * cause lag between master and replica.
+ *
+ * Prefetcher trieds to preload in OS file cache blocks, referenced by WAL
+ * records to speedup recovery
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/walprefetcher.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/buf_internals.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/memutils.h"
+
+
+/*
+ * GUC parameters
+ */
+int WalPrefetchLead = 0;
+int WalPrefetchPollInterval = 1000;
+bool WalPrefetchEnabled = false;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+static void WpfQuickDie(SIGNAL_ARGS);
+static void WpfSigHupHandler(SIGNAL_ARGS);
+static void WpfShutdownHandler(SIGNAL_ARGS);
+static void WpfSigusr1Handler(SIGNAL_ARGS);
+
+/*
+ * Main entry point for walprefetcher background worker
+ */
+void
+WalPrefetcherMain()
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext walprefetcher_context;
+ int rc;
+
+ pqsignal(SIGHUP, WpfSigHupHandler); /* set flag to read config file */
+ pqsignal(SIGINT, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGTERM, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGQUIT, WpfQuickDie); /* hard crash time */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, WpfSigusr1Handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+ pqsignal(SIGTTIN, SIG_DFL);
+ pqsignal(SIGTTOU, SIG_DFL);
+ pqsignal(SIGCONT, SIG_DFL);
+ pqsignal(SIGWINCH, SIG_DFL);
+
+ /* We allow SIGQUIT (quickdie) at all times */
+ sigdelset(&BlockSig, SIGQUIT);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks. Formerly this code just ran in
+ * TopMemoryContext, but resetting that would be a really bad idea.
+ */
+ walprefetcher_context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Prefetcher",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(walprefetcher_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * This code is heavily based on bgwriter.c, q.v.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ pgstat_report_wait_end();
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(walprefetcher_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(walprefetcher_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ /*
+ * Process any requests or signals received recently.
+ */
+ if (got_SIGHUP)
+ {
+ got_SIGHUP = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+ if (shutdown_requested)
+ {
+ /* Normal exit from the walprefetcher is here */
+ proc_exit(0); /* done */
+ }
+
+ if (WalPrefetchEnabled)
+ WalPrefetch(InvalidXLogRecPtr);
+
+ /*
+ * Sleep until we are signaled
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+ }
+}
+
+
+/* --------------------------------
+ * signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * WpfQuickDie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+WpfQuickDie(SIGNAL_ARGS)
+{
+ PG_SETMASK(&BlockSig);
+
+ /*
+ * We DO NOT want to run proc_exit() callbacks -- we're here because
+ * shared memory may be corrupted, so we don't want to try to clean up our
+ * transaction. Just nail the windows shut and get out of town. Now that
+ * there's an atexit callback to prevent third-party code from breaking
+ * things by calling exit() directly, we have to reset the callbacks
+ * explicitly to make this work as intended.
+ */
+ on_exit_reset();
+
+ /*
+ * Note we do exit(2) not exit(0). This is to force the postmaster into a
+ * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+ * backend. This is necessary precisely because we don't clean up our
+ * shared memory state. (The "dead man switch" mechanism in pmsignal.c
+ * should ensure the postmaster sees this as a crash, too, but no harm in
+ * being doubly sure.)
+ */
+ exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+WpfSigHupHandler(SIGNAL_ARGS)
+{
+ got_SIGHUP = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGTERM: set flag to exit normally */
+static void
+WpfShutdownHandler(SIGNAL_ARGS)
+{
+ shutdown_requested = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+WpfSigusr1Handler(SIGNAL_ARGS)
+{
+ latch_sigusr1_handler();
+}
+
+/*
+ * Now wal prefetch code itself.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI);
+
+#define BLOCK_HASH_SIZE 4001 /* Size of block LRU cache, may be it is better to use information about free RAM instead of hardcoded constant */
+#define FILE_HASH_SIZE 64 /* Size of opened files hash */
+#define STAT_REFRESH_PERIOD 1024 /* Refresh backend status rate */
+
+/*
+ * Block LRU hash table is used to keep information about most recently prefetched blocks.
+ */
+typedef struct BlockHashEntry
+{
+ struct BlockHashEntry* next;
+ struct BlockHashEntry* prev;
+ struct BlockHashEntry* collision;
+ BufferTag tag;
+ uint32 hash;
+} BlockHashEntry;
+
+static BlockHashEntry* block_hash_table[BLOCK_HASH_SIZE];
+static size_t block_hash_size;
+static BlockHashEntry lru = {&lru, &lru};
+static TimeLineID replay_timeline;
+
+/*
+ * Yet another L2-list implementation
+ */
+static void
+unlink_block(BlockHashEntry* entry)
+{
+ entry->next->prev = entry->prev;
+ entry->prev->next = entry->next;
+}
+
+static void
+link_block_after(BlockHashEntry* head, BlockHashEntry* entry)
+{
+ entry->next = head->next;
+ entry->prev = head;
+ head->next->prev = entry;
+ head->next = entry;
+}
+
+/*
+ * Put block in LRU hash or link it to the head of LRU list. Returns true if block was not present in hash, false otherwise.
+ */
+static bool
+put_block_in_cache(BufferTag* tag)
+{
+ uint32 hash;
+ BlockHashEntry* entry;
+
+ hash = BufTableHashCode(tag) % BLOCK_HASH_SIZE;
+ for (entry = block_hash_table[hash]; entry != NULL; entry = entry->collision)
+ {
+ if (BUFFERTAGS_EQUAL(entry->tag, *tag))
+ {
+ unlink_block(entry);
+ link_block_after(&lru, entry);
+ return false;
+ }
+ }
+ if (block_hash_size == BLOCK_HASH_SIZE)
+ {
+ BlockHashEntry* victim = lru.prev;
+ BlockHashEntry** epp = &block_hash_table[victim->hash];
+ while (*epp != victim)
+ epp = &(*epp)->collision;
+ *epp = (*epp)->collision;
+ unlink_block(victim);
+ entry = victim;
+ }
+ else
+ {
+ entry = (BlockHashEntry*)palloc(sizeof(BlockHashEntry));
+ block_hash_size += 1;
+ }
+ entry->tag = *tag;
+ entry->hash = hash;
+ entry->collision = block_hash_table[hash];
+ block_hash_table[hash] = entry;
+ link_block_after(&lru, entry);
+
+ return true;
+}
+
+/*
+ * Hash of of opened files. It seems to be simpler to maintain own cache rather than provide SMgrRelation for smgr functions.
+ */
+typedef struct FileHashEntry
+{
+ BufferTag tag;
+ File file;
+} FileHashEntry;
+
+static FileHashEntry file_hash_table[FILE_HASH_SIZE];
+
+static File
+WalOpenFile(BufferTag* tag)
+{
+ BufferTag segment_tag = *tag;
+ uint32 hash;
+ char* path;
+ File file;
+
+ /* Transform block number into segment number */
+ segment_tag.blockNum /= RELSEG_SIZE;
+ hash = BufTableHashCode(&segment_tag) % FILE_HASH_SIZE;
+
+ if (BUFFERTAGS_EQUAL(file_hash_table[hash].tag, segment_tag))
+ return file_hash_table[hash].file;
+
+ path = relpathperm(tag->rnode, tag->forkNum);
+ if (segment_tag.blockNum > 0)
+ {
+ char* fullpath = psprintf("%s.%d", path, segment_tag.blockNum);
+ pfree(path);
+ path = fullpath;
+ }
+ file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+ pfree(path);
+
+ if (file >= 0)
+ {
+ if (file_hash_table[hash].tag.rnode.dbNode != 0)
+ FileClose(file_hash_table[hash].file);
+
+ file_hash_table[hash].file = file;
+ file_hash_table[hash].tag = segment_tag;
+ }
+ return file;
+}
+
+/*
+ * Our backend doesn't receive any notifications about WAL progress, so we have to use sleep
+ * to wait until requested information is available
+ */
+static void
+WalWaitWAL(void)
+{
+ int rc;
+ CHECK_FOR_INTERRUPTS();
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ WalPrefetchPollInterval,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+
+}
+
+/*
+ * Main function: perform prefetch of blocks referenced by WAL records starting from given LSN or from WAL replay position if lsn=0
+ */
+void
+WalPrefetch(XLogRecPtr lsn)
+{
+ XLogReaderState *xlogreader;
+ long n_prefetched = 0;
+ bool startup_recovery = true;
+
+ /* Dirty hack: prevent recovery conflict */
+ MyPgXact->xmin = InvalidTransactionId;
+
+ memset(file_hash_table, 0, sizeof file_hash_table);
+ memset(block_hash_table, 0, sizeof block_hash_table);
+
+ xlogreader = XLogReaderAllocate(wal_segment_size, &WalReadPage, NULL);
+
+ if (!xlogreader)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ if (lsn == InvalidXLogRecPtr)
+ lsn = GetXLogReplayRecPtr(NULL); /* Start with replay LSN */
+
+ while (true)
+ {
+ char *errormsg;
+ int block_id;
+ XLogRecPtr replay_lsn = GetXLogReplayRecPtr(&replay_timeline);
+ XLogRecord *record;
+
+ /*
+ * If current position is behind current replay LSN, then move it forward: we do not want to perform useless job and prefetch
+ * blocks for already processed WAL records
+ */
+ if (lsn != InvalidXLogRecPtr || replay_lsn >= xlogreader->EndRecPtr)
+ {
+ XLogRecPtr prefetch_lsn = replay_lsn != InvalidXLogRecPtr
+ ? XLogFindNextRecord(xlogreader, Max(lsn, replay_lsn) + WalPrefetchLead) : InvalidXLogRecPtr;
+ if (prefetch_lsn == InvalidXLogRecPtr)
+ {
+ WalWaitWAL();
+ continue;
+ }
+ lsn = prefetch_lsn;
+ }
+
+ record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+ if (record != NULL)
+ {
+ lsn = InvalidXLogRecPtr; /* continue with next record */
+
+ /* Loop through blocks referenced by this WAL record */
+ for (block_id = 0; block_id <= xlogreader->max_block_id; block_id++)
+ {
+ BufferTag tag;
+ File file;
+
+ /* Do not prefetch full pages */
+ if (!XLogRecGetBlockTag(xlogreader, block_id, &tag.rnode, &tag.forkNum, &tag.blockNum)
+ || xlogreader->blocks[block_id].has_image)
+ continue;
+
+ /* Check if block already prefetched */
+ if (!put_block_in_cache(&tag))
+ continue;
+
+ file = WalOpenFile(&tag);
+ if (file >= 0)
+ {
+ off_t offs = (off_t) BLCKSZ * (tag.blockNum % ((BlockNumber) RELSEG_SIZE));
+ int rc = FilePrefetch(file, offs, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+ if (rc != 0)
+ elog(ERROR, "Failed to prefetch file: %m");
+ else if (++n_prefetched % STAT_REFRESH_PERIOD == 0)
+ {
+ char buf[1024];
+ sprintf(buf, "Prefetch %ld blocks at LSN %lx, replay LSN %lx", n_prefetched, xlogreader->EndRecPtr, replay_lsn);
+ pgstat_report_activity(STATE_RUNNING, buf);
+ elog(DEBUG1, "%s", buf);
+ }
+ }
+ else
+ elog(LOG, "File segment doesn't exists");
+ }
+ }
+ else
+ WalWaitWAL();
+ }
+}
+
+/*
+ * Almost copy of read_local_xlog_page from xlogutils.c, but it reads until flush position of WAL receiver, rather then replay position.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI)
+{
+ XLogRecPtr read_upto,
+ loc;
+ int count;
+
+ loc = targetPagePtr + reqLen;
+
+ /* Loop waiting for xlog to be available if necessary */
+ while (1)
+ {
+ read_upto = WalRcv->walRcvState == WALRCV_STOPPED ? (XLogRecPtr)-1 : WalRcv->receivedUpto;
+ *pageTLI = replay_timeline;
+
+ if (loc <= read_upto)
+ break;
+
+ WalWaitWAL();
+ CHECK_FOR_INTERRUPTS();
+ pg_usleep(1000L);
+ }
+
+ if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have caller
+ * come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ }
+ else if (targetPagePtr + reqLen > read_upto)
+ {
+ /* not enough data there */
+ return -1;
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = read_upto - targetPagePtr;
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e47ddca..6319ed5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,7 +252,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndRead(char *buf, XLogRecPtr startptr, Size count);
/* Initialize walsender process before entering the main command loop */
@@ -771,7 +771,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
count = flushptr - targetPagePtr; /* part of the page available */
/* now actually read the data, we know it's there */
- XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+ WalSndRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
return count;
}
@@ -2314,7 +2314,7 @@ WalSndKill(int code, Datum arg)
* more than one.
*/
static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+WalSndRead(char *buf, XLogRecPtr startptr, Size count)
{
char *p;
XLogRecPtr recptr;
@@ -2710,7 +2710,7 @@ XLogSendPhysical(void)
* calls.
*/
enlargeStringInfo(&output_message, nbytes);
- XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+ WalSndRead(&output_message.data[output_message.len], startptr, nbytes);
output_message.len += nbytes;
output_message.data[output_message.len] = '\0';
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7..3c55f51 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walprefetcher.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -1823,6 +1824,17 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_enabled", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Allow prefetch of blocks referenced by WAL records."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &WalPrefetchEnabled,
+ false,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -2487,6 +2499,27 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_prefetcher_lead", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Lead before WAL replay LSN and prefetch LSNr."),
+ NULL
+ },
+ &WalPrefetchLead,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_prefetch_poll_interval", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Interval of polling WAl by WAL prefetcher."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &WalPrefetchPollInterval,
+ 1000, 1, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_writer_delay", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Time between WAL flushes performed in the WAL writer."),
NULL,
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f307b63..70eac88 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,9 +212,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
/* Invalidate read state */
extern void XLogReaderInvalReadState(XLogReaderState *state);
-#ifdef FRONTEND
extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif /* FRONTEND */
/* Functions for decoding an XLogRecord */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index c406699..c2b8a6f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -55,4 +55,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
extern void XLogReadDetermineTimeline(XLogReaderState *state,
XLogRecPtr wantPage, uint32 wantLength);
+extern void XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+ Size count);
+
#endif
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8..5f8b67d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,6 +400,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalPrefetcherProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -412,6 +413,7 @@ extern AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalPrefetcherProcess() (MyAuxProcType == WalPrefetcherProcess)
/*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..20ab699 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
- B_WAL_WRITER
+ B_WAL_WRITER,
+ B_WAL_PREFETCHER
} BackendType;
@@ -767,7 +768,8 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
- WAIT_EVENT_WAL_WRITER_MAIN
+ WAIT_EVENT_WAL_WRITER_MAIN,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN
} WaitEventActivity;
/* ----------
diff --git a/src/include/postmaster/walprefetcher.h b/src/include/postmaster/walprefetcher.h
new file mode 100644
index 0000000..82a6010
--- /dev/null
+++ b/src/include/postmaster/walprefetcher.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.h
+ * Exports from postmaster/walprefetcher.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/walprefetcher.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _WALPREFETCHER_H
+#define _WALPREFETCHER_H
+
+/* GUC options */
+extern int WalPrefetchLead;
+extern int WalPrefetchPollInterval;
+extern bool WalPrefetchEnabled;
+
+extern void WalPrefetcherMain(void) pg_attribute_noreturn();
+extern void WalPrefetch(XLogRecPtr lsn);
+
+#endif /* _WALPREFETCHER_H */
On 06/21/2018 04:01 PM, Konstantin Knizhnik wrote:
I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started
together with startup process and is able to help it to speedup recovery.
The patch is attached.Unfortunately result is negative (at least at my desktop: SSD, 16Gb
RAM). Recovery with prefetch is 3 times slower than without it.
What I am doing:Configuration:
max_wal_size=min_wal_size=10Gb,
shared)buffers = 1Gb
Database:
pgbench -i -s 1000
Test:
pgbench -c 10 -M prepared -N -T 100 -P 1
pkill postgres
echo 3 > /proc/sys/vm/drop_caches
time pg_ctl -t 1000 -D pgsql -l logfile startWithout prefetch it is 19 seconds (recovered about 4Gb of WAL), with
prefetch it is about one minute. About 400k blocks are prefetched.
CPU usage is small (<20%), both processes as in "Ds" state.
Based on a quick test, my guess is that the patch is broken in several
ways. Firstly, with the patch attached (and wal_prefetch_enabled=on,
which I think is needed to enable the prefetch) I can't even restart the
server, because pg_ctl restart just hangs (the walprefetcher process
gets stuck in WaitForWAL, IIRC).
I have added an elog(LOG,...) to walprefetcher.c, right before the
FilePrefetch call, and (a) I don't see any actual prefetch calls during
recovery but (b) I do see the prefetch happening during the pgbench.
That seems a bit ... wrong?
Furthermore, you've added an extra
signal_child(BgWriterPID, SIGHUP);
to SIGHUP_handler, which seems like a bug too. I don't have time to
investigate/debug this further.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 21.06.2018 19:57, Tomas Vondra wrote:
On 06/21/2018 04:01 PM, Konstantin Knizhnik wrote:
I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started
together with startup process and is able to help it to speedup
recovery.
The patch is attached.Unfortunately result is negative (at least at my desktop: SSD, 16Gb
RAM). Recovery with prefetch is 3 times slower than without it.
What I am doing:Configuration:
max_wal_size=min_wal_size=10Gb,
shared)buffers = 1Gb
Database:
pgbench -i -s 1000
Test:
pgbench -c 10 -M prepared -N -T 100 -P 1
pkill postgres
echo 3 > /proc/sys/vm/drop_caches
time pg_ctl -t 1000 -D pgsql -l logfile startWithout prefetch it is 19 seconds (recovered about 4Gb of WAL), with
prefetch it is about one minute. About 400k blocks are prefetched.
CPU usage is small (<20%), both processes as in "Ds" state.Based on a quick test, my guess is that the patch is broken in several
ways. Firstly, with the patch attached (and wal_prefetch_enabled=on,
which I think is needed to enable the prefetch) I can't even restart
the server, because pg_ctl restart just hangs (the walprefetcher
process gets stuck in WaitForWAL, IIRC).I have added an elog(LOG,...) to walprefetcher.c, right before the
FilePrefetch call, and (a) I don't see any actual prefetch calls
during recovery but (b) I do see the prefetch happening during the
pgbench. That seems a bit ... wrong?Furthermore, you've added an extra
signal_child(BgWriterPID, SIGHUP);
to SIGHUP_handler, which seems like a bug too. I don't have time to
investigate/debug this further.regards
Sorry, updated version of the patch is attached.
Please also notice that you can check number of prefetched pages using
pg_stat_activity() - it is reported for walprefetcher process.
Concerning the fact that you have no see prefetches at recovery time:
please check that min_wal_size and max_wal_size are large enough and
pgbench (or whatever else)
committed large enough changes so that recovery will take some time.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
walprefetch-2.patchtext/x-patch; name=walprefetch-2.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 1b000a2..9730b42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -879,13 +879,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
return true;
}
-#ifdef FRONTEND
-/*
- * Functions that are currently not needed in the backend, but are better
- * implemented inside xlogreader.c because of the internal facilities available
- * here.
- */
-
/*
* Find the first record with an lsn >= RecPtr.
*
@@ -1004,9 +997,6 @@ out:
return found;
}
-#endif /* FRONTEND */
-
-
/* ----------------------------------------
* Functions for decoding the data and block references in a record.
* ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e..7847311 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -652,7 +652,7 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
* in walsender.c but for small differences (such as lack of elog() in
* frontend). Probably these should be merged at some point.
*/
-static void
+void
XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
Size count)
{
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee..e492715 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -32,6 +32,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
+#include "postmaster/walprefetcher.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -335,6 +336,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
case WalReceiverProcess:
statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
break;
+ case WalPrefetcherProcess:
+ statmsg = pgstat_get_backend_desc(B_WAL_PREFETCHER);
+ break;
default:
statmsg = "??? process";
break;
@@ -462,6 +466,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
WalReceiverMain();
proc_exit(1); /* should never return */
+ case WalPrefetcherProcess:
+ /* don't set signals, walprefetcher has its own agenda */
+ WalPrefetcherMain();
+ proc_exit(1); /* should never return */
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..13e5066 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
- pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+ pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o walprefetcher.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e..7195578 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2870,6 +2870,9 @@ pgstat_bestart(void)
case WalReceiverProcess:
beentry->st_backendType = B_WAL_RECEIVER;
break;
+ case WalPrefetcherProcess:
+ beentry->st_backendType = B_WAL_PREFETCHER;
+ break;
default:
elog(FATAL, "unrecognized process type: %d",
(int) MyAuxProcType);
@@ -3519,6 +3522,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_WAL_WRITER_MAIN:
event_name = "WalWriterMain";
break;
+ case WAIT_EVENT_WAL_PREFETCHER_MAIN:
+ event_name = "WalPrefetcherMain";
+ break;
/* no default case, so that compiler will warn */
}
@@ -4126,6 +4132,9 @@ pgstat_get_backend_desc(BackendType backendType)
case B_WAL_RECEIVER:
backendDesc = "walreceiver";
break;
+ case B_WAL_PREFETCHER:
+ backendDesc = "walprefetcher";
+ break;
case B_WAL_SENDER:
backendDesc = "walsender";
break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b3..1f3598d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,9 @@ static pid_t StartupPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
PgStatPID = 0,
- SysLoggerPID = 0;
+ SysLoggerPID = 0,
+ WalPrefetcherPID = 0
+;
/* Startup process's status */
typedef enum
@@ -362,6 +364,9 @@ static volatile bool avlauncher_needs_signal = false;
/* received START_WALRECEIVER signal */
static volatile sig_atomic_t WalReceiverRequested = false;
+/* received START_WALPREFETCHER signal */
+static volatile sig_atomic_t WalPrefetcherRequested = false;
+
/* set when there's a worker that needs to be started up */
static volatile bool StartWorkerNeeded = true;
static volatile bool HaveCrashedWorker = false;
@@ -549,6 +554,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartupDataBase() StartChildProcess(StartupProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartWalPrefetcher() StartChildProcess(WalPrefetcherProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1373,6 +1379,9 @@ PostmasterMain(int argc, char *argv[])
StartupStatus = STARTUP_RUNNING;
pmState = PM_STARTUP;
+ /* Start Wal prefetcher now because it may speed-up WAL redo */
+ WalPrefetcherPID = StartWalPrefetcher();
+
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -2535,6 +2544,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -2685,6 +2696,8 @@ pmdie(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGTERM);
if (pmState == PM_RECOVERY)
{
SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
@@ -2864,6 +2877,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -2967,6 +2982,20 @@ reaper(SIGNAL_ARGS)
}
/*
+ * Was it the wal prefetcher? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalPrefetcherPID)
+ {
+ WalPrefetcherPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL prefetcher process"));
+ continue;
+ }
+
+ /*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
* necessary. Any other exit condition is treated as a crash.
@@ -3451,6 +3480,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(WalWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the walprefetcherr too */
+ if (pid == WalPrefetcherPID)
+ WalPrefetcherPID = 0;
+ else if (WalPrefetcherPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) WalPrefetcherPID)));
+ signal_child(WalPrefetcherPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walreceiver too */
if (pid == WalReceiverPID)
WalReceiverPID = 0;
@@ -3657,6 +3698,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalPrefetcherPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3757,6 +3799,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(WalPrefetcherPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -3946,6 +3989,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5041,6 +5086,10 @@ sigusr1_handler(SIGNAL_ARGS)
Assert(BgWriterPID == 0);
BgWriterPID = StartBackgroundWriter();
+ /* WAL prefetcher is expected to be started earlier but if not, try to start it now */
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
+
/*
* Start the archiver if we're responsible for (re-)archiving received
* files.
@@ -5361,6 +5410,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalPrefetcherProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL prefetcher process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
diff --git a/src/backend/postmaster/walprefetcher.c b/src/backend/postmaster/walprefetcher.c
new file mode 100644
index 0000000..1606e60
--- /dev/null
+++ b/src/backend/postmaster/walprefetcher.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.c
+ *
+ * Replaying WAL is done by single process, it may cause slow recovery time
+ * cause lag between master and replica.
+ *
+ * Prefetcher trieds to preload in OS file cache blocks, referenced by WAL
+ * records to speedup recovery
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/walprefetcher.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/buf_internals.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/memutils.h"
+
+
+/*
+ * GUC parameters
+ */
+int WalPrefetchLead = 0;
+int WalPrefetchPollInterval = 1000;
+bool WalPrefetchEnabled = false;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+static void WpfQuickDie(SIGNAL_ARGS);
+static void WpfSigHupHandler(SIGNAL_ARGS);
+static void WpfShutdownHandler(SIGNAL_ARGS);
+static void WpfSigusr1Handler(SIGNAL_ARGS);
+
+/*
+ * Main entry point for walprefetcher background worker
+ */
+void
+WalPrefetcherMain()
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext walprefetcher_context;
+ int rc;
+
+ pqsignal(SIGHUP, WpfSigHupHandler); /* set flag to read config file */
+ pqsignal(SIGINT, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGTERM, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGQUIT, WpfQuickDie); /* hard crash time */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, WpfSigusr1Handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+ pqsignal(SIGTTIN, SIG_DFL);
+ pqsignal(SIGTTOU, SIG_DFL);
+ pqsignal(SIGCONT, SIG_DFL);
+ pqsignal(SIGWINCH, SIG_DFL);
+
+ /* We allow SIGQUIT (quickdie) at all times */
+ sigdelset(&BlockSig, SIGQUIT);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks. Formerly this code just ran in
+ * TopMemoryContext, but resetting that would be a really bad idea.
+ */
+ walprefetcher_context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Prefetcher",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(walprefetcher_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * This code is heavily based on bgwriter.c, q.v.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ pgstat_report_wait_end();
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(walprefetcher_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(walprefetcher_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ /*
+ * Process any requests or signals received recently.
+ */
+ if (got_SIGHUP)
+ {
+ got_SIGHUP = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+ if (shutdown_requested)
+ {
+ /* Normal exit from the walprefetcher is here */
+ proc_exit(0); /* done */
+ }
+
+ if (WalPrefetchEnabled)
+ WalPrefetch(InvalidXLogRecPtr);
+
+ /*
+ * Sleep until we are signaled
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+ }
+}
+
+
+/* --------------------------------
+ * signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * WpfQuickDie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+WpfQuickDie(SIGNAL_ARGS)
+{
+ PG_SETMASK(&BlockSig);
+
+ /*
+ * We DO NOT want to run proc_exit() callbacks -- we're here because
+ * shared memory may be corrupted, so we don't want to try to clean up our
+ * transaction. Just nail the windows shut and get out of town. Now that
+ * there's an atexit callback to prevent third-party code from breaking
+ * things by calling exit() directly, we have to reset the callbacks
+ * explicitly to make this work as intended.
+ */
+ on_exit_reset();
+
+ /*
+ * Note we do exit(2) not exit(0). This is to force the postmaster into a
+ * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+ * backend. This is necessary precisely because we don't clean up our
+ * shared memory state. (The "dead man switch" mechanism in pmsignal.c
+ * should ensure the postmaster sees this as a crash, too, but no harm in
+ * being doubly sure.)
+ */
+ exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+WpfSigHupHandler(SIGNAL_ARGS)
+{
+ got_SIGHUP = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGTERM: set flag to exit normally */
+static void
+WpfShutdownHandler(SIGNAL_ARGS)
+{
+ shutdown_requested = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+WpfSigusr1Handler(SIGNAL_ARGS)
+{
+ latch_sigusr1_handler();
+}
+
+/*
+ * Now wal prefetch code itself.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI);
+
+#define BLOCK_HASH_SIZE 4001 /* Size of block LRU cache, may be it is better to use information about free RAM instead of hardcoded constant */
+#define FILE_HASH_SIZE 64 /* Size of opened files hash */
+#define STAT_REFRESH_PERIOD 1024 /* Refresh backend status rate */
+
+/*
+ * Block LRU hash table is used to keep information about most recently prefetched blocks.
+ */
+typedef struct BlockHashEntry
+{
+ struct BlockHashEntry* next;
+ struct BlockHashEntry* prev;
+ struct BlockHashEntry* collision;
+ BufferTag tag;
+ uint32 hash;
+} BlockHashEntry;
+
+static BlockHashEntry* block_hash_table[BLOCK_HASH_SIZE];
+static size_t block_hash_size;
+static BlockHashEntry lru = {&lru, &lru};
+static TimeLineID replay_timeline;
+
+/*
+ * Yet another L2-list implementation
+ */
+static void
+unlink_block(BlockHashEntry* entry)
+{
+ entry->next->prev = entry->prev;
+ entry->prev->next = entry->next;
+}
+
+static void
+link_block_after(BlockHashEntry* head, BlockHashEntry* entry)
+{
+ entry->next = head->next;
+ entry->prev = head;
+ head->next->prev = entry;
+ head->next = entry;
+}
+
+/*
+ * Put block in LRU hash or link it to the head of LRU list. Returns true if block was not present in hash, false otherwise.
+ */
+static bool
+put_block_in_cache(BufferTag* tag)
+{
+ uint32 hash;
+ BlockHashEntry* entry;
+
+ hash = BufTableHashCode(tag) % BLOCK_HASH_SIZE;
+ for (entry = block_hash_table[hash]; entry != NULL; entry = entry->collision)
+ {
+ if (BUFFERTAGS_EQUAL(entry->tag, *tag))
+ {
+ unlink_block(entry);
+ link_block_after(&lru, entry);
+ return false;
+ }
+ }
+ if (block_hash_size == BLOCK_HASH_SIZE)
+ {
+ BlockHashEntry* victim = lru.prev;
+ BlockHashEntry** epp = &block_hash_table[victim->hash];
+ while (*epp != victim)
+ epp = &(*epp)->collision;
+ *epp = (*epp)->collision;
+ unlink_block(victim);
+ entry = victim;
+ }
+ else
+ {
+ entry = (BlockHashEntry*)palloc(sizeof(BlockHashEntry));
+ block_hash_size += 1;
+ }
+ entry->tag = *tag;
+ entry->hash = hash;
+ entry->collision = block_hash_table[hash];
+ block_hash_table[hash] = entry;
+ link_block_after(&lru, entry);
+
+ return true;
+}
+
+/*
+ * Hash of of opened files. It seems to be simpler to maintain own cache rather than provide SMgrRelation for smgr functions.
+ */
+typedef struct FileHashEntry
+{
+ BufferTag tag;
+ File file;
+} FileHashEntry;
+
+static FileHashEntry file_hash_table[FILE_HASH_SIZE];
+
+static File
+WalOpenFile(BufferTag* tag)
+{
+ BufferTag segment_tag = *tag;
+ uint32 hash;
+ char* path;
+ File file;
+
+ /* Transform block number into segment number */
+ segment_tag.blockNum /= RELSEG_SIZE;
+ hash = BufTableHashCode(&segment_tag) % FILE_HASH_SIZE;
+
+ if (BUFFERTAGS_EQUAL(file_hash_table[hash].tag, segment_tag))
+ return file_hash_table[hash].file;
+
+ path = relpathperm(tag->rnode, tag->forkNum);
+ if (segment_tag.blockNum > 0)
+ {
+ char* fullpath = psprintf("%s.%d", path, segment_tag.blockNum);
+ pfree(path);
+ path = fullpath;
+ }
+ file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+ pfree(path);
+
+ if (file >= 0)
+ {
+ if (file_hash_table[hash].tag.rnode.dbNode != 0)
+ FileClose(file_hash_table[hash].file);
+
+ file_hash_table[hash].file = file;
+ file_hash_table[hash].tag = segment_tag;
+ }
+ return file;
+}
+
+/*
+ * Our backend doesn't receive any notifications about WAL progress, so we have to use sleep
+ * to wait until requested information is available
+ */
+static void
+WalWaitWAL(void)
+{
+ int rc;
+ CHECK_FOR_INTERRUPTS();
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ WalPrefetchPollInterval,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+
+}
+
+/*
+ * Main function: perform prefetch of blocks referenced by WAL records starting from given LSN or from WAL replay position if lsn=0
+ */
+void
+WalPrefetch(XLogRecPtr lsn)
+{
+ XLogReaderState *xlogreader;
+ long n_prefetched = 0;
+
+ /* Dirty hack: prevent recovery conflict */
+ MyPgXact->xmin = InvalidTransactionId;
+
+ memset(file_hash_table, 0, sizeof file_hash_table);
+ memset(block_hash_table, 0, sizeof block_hash_table);
+
+ xlogreader = XLogReaderAllocate(wal_segment_size, &WalReadPage, NULL);
+
+ if (!xlogreader)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ if (lsn == InvalidXLogRecPtr)
+ lsn = GetXLogReplayRecPtr(NULL); /* Start with replay LSN */
+
+ while (!shutdown_requested)
+ {
+ char *errormsg;
+ int block_id;
+ XLogRecPtr replay_lsn = GetXLogReplayRecPtr(&replay_timeline);
+ XLogRecord *record;
+
+ /*
+ * If current position is behind current replay LSN, then move it forward: we do not want to perform useless job and prefetch
+ * blocks for already processed WAL records
+ */
+ if (lsn != InvalidXLogRecPtr || replay_lsn >= xlogreader->EndRecPtr)
+ {
+ XLogRecPtr prefetch_lsn = replay_lsn != InvalidXLogRecPtr
+ ? XLogFindNextRecord(xlogreader, Max(lsn, replay_lsn) + WalPrefetchLead) : InvalidXLogRecPtr;
+ if (prefetch_lsn == InvalidXLogRecPtr)
+ {
+ WalWaitWAL();
+ continue;
+ }
+ lsn = prefetch_lsn;
+ }
+
+ record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+ if (record != NULL)
+ {
+ lsn = InvalidXLogRecPtr; /* continue with next record */
+
+ /* Loop through blocks referenced by this WAL record */
+ for (block_id = 0; block_id <= xlogreader->max_block_id; block_id++)
+ {
+ BufferTag tag;
+ File file;
+
+ /* Do not prefetch full pages */
+ if (!XLogRecGetBlockTag(xlogreader, block_id, &tag.rnode, &tag.forkNum, &tag.blockNum)
+ || xlogreader->blocks[block_id].has_image)
+ continue;
+
+ /* Check if block already prefetched */
+ if (!put_block_in_cache(&tag))
+ continue;
+
+ file = WalOpenFile(&tag);
+ if (file >= 0)
+ {
+ off_t offs = (off_t) BLCKSZ * (tag.blockNum % ((BlockNumber) RELSEG_SIZE));
+ int rc = FilePrefetch(file, offs, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+ if (rc != 0)
+ elog(ERROR, "Failed to prefetch file: %m");
+ else if (++n_prefetched % STAT_REFRESH_PERIOD == 0)
+ {
+ char buf[1024];
+ sprintf(buf, "Prefetch %ld blocks at LSN %lx, replay LSN %lx", n_prefetched, xlogreader->EndRecPtr, replay_lsn);
+ pgstat_report_activity(STATE_RUNNING, buf);
+ elog(DEBUG1, "%s", buf);
+ }
+ }
+ else
+ elog(LOG, "File segment doesn't exists");
+ }
+ }
+ else
+ WalWaitWAL();
+ }
+}
+
+/*
+ * Almost copy of read_local_xlog_page from xlogutils.c, but it reads until flush position of WAL receiver, rather then replay position.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI)
+{
+ XLogRecPtr read_upto,
+ loc;
+ int count;
+
+ loc = targetPagePtr + reqLen;
+
+ /* Loop waiting for xlog to be available if necessary */
+ while (1)
+ {
+ /*
+ * If we perform recovery at startup then read until end of WAL,
+ * otherwise if there is active WAL receiver at replica, read until the end of received data,
+ * if there is no active wal recevier, then just sleep.
+ */
+ read_upto = WalRcv->walRcvState == WALRCV_STOPPED
+ ? RecoveryInProgress() ? (XLogRecPtr)-1 : InvalidXLogRecPtr
+ : WalRcv->receivedUpto;
+ *pageTLI = replay_timeline;
+
+ if (loc <= read_upto)
+ break;
+
+ WalWaitWAL();
+ CHECK_FOR_INTERRUPTS();
+ if (shutdown_requested)
+ return -1;
+ }
+
+ if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have caller
+ * come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ }
+ else if (targetPagePtr + reqLen > read_upto)
+ {
+ /* not enough data there */
+ return -1;
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = read_upto - targetPagePtr;
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e47ddca..6319ed5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,7 +252,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndRead(char *buf, XLogRecPtr startptr, Size count);
/* Initialize walsender process before entering the main command loop */
@@ -771,7 +771,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
count = flushptr - targetPagePtr; /* part of the page available */
/* now actually read the data, we know it's there */
- XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+ WalSndRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
return count;
}
@@ -2314,7 +2314,7 @@ WalSndKill(int code, Datum arg)
* more than one.
*/
static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+WalSndRead(char *buf, XLogRecPtr startptr, Size count)
{
char *p;
XLogRecPtr recptr;
@@ -2710,7 +2710,7 @@ XLogSendPhysical(void)
* calls.
*/
enlargeStringInfo(&output_message, nbytes);
- XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+ WalSndRead(&output_message.data[output_message.len], startptr, nbytes);
output_message.len += nbytes;
output_message.data[output_message.len] = '\0';
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7..3c55f51 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walprefetcher.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -1823,6 +1824,17 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_enabled", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Allow prefetch of blocks referenced by WAL records."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &WalPrefetchEnabled,
+ false,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -2487,6 +2499,27 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_prefetcher_lead", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Lead before WAL replay LSN and prefetch LSNr."),
+ NULL
+ },
+ &WalPrefetchLead,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_prefetch_poll_interval", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Interval of polling WAl by WAL prefetcher."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &WalPrefetchPollInterval,
+ 1000, 1, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_writer_delay", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Time between WAL flushes performed in the WAL writer."),
NULL,
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f307b63..70eac88 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,9 +212,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
/* Invalidate read state */
extern void XLogReaderInvalReadState(XLogReaderState *state);
-#ifdef FRONTEND
extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif /* FRONTEND */
/* Functions for decoding an XLogRecord */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index c406699..c2b8a6f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -55,4 +55,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
extern void XLogReadDetermineTimeline(XLogReaderState *state,
XLogRecPtr wantPage, uint32 wantLength);
+extern void XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+ Size count);
+
#endif
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8..5f8b67d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,6 +400,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalPrefetcherProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -412,6 +413,7 @@ extern AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalPrefetcherProcess() (MyAuxProcType == WalPrefetcherProcess)
/*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..20ab699 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
- B_WAL_WRITER
+ B_WAL_WRITER,
+ B_WAL_PREFETCHER
} BackendType;
@@ -767,7 +768,8 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
- WAIT_EVENT_WAL_WRITER_MAIN
+ WAIT_EVENT_WAL_WRITER_MAIN,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN
} WaitEventActivity;
/* ----------
diff --git a/src/include/postmaster/walprefetcher.h b/src/include/postmaster/walprefetcher.h
new file mode 100644
index 0000000..82a6010
--- /dev/null
+++ b/src/include/postmaster/walprefetcher.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.h
+ * Exports from postmaster/walprefetcher.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/walprefetcher.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _WALPREFETCHER_H
+#define _WALPREFETCHER_H
+
+/* GUC options */
+extern int WalPrefetchLead;
+extern int WalPrefetchPollInterval;
+extern bool WalPrefetchEnabled;
+
+extern void WalPrefetcherMain(void) pg_attribute_noreturn();
+extern void WalPrefetch(XLogRecPtr lsn);
+
+#endif /* _WALPREFETCHER_H */
On 22.06.2018 11:35, Konstantin Knizhnik wrote:
On 21.06.2018 19:57, Tomas Vondra wrote:
On 06/21/2018 04:01 PM, Konstantin Knizhnik wrote:
I continue my experiments with WAL prefetch.
I have embedded prefetch in Postgres: now walprefetcher is started
together with startup process and is able to help it to speedup
recovery.
The patch is attached.Unfortunately result is negative (at least at my desktop: SSD, 16Gb
RAM). Recovery with prefetch is 3 times slower than without it.
What I am doing:Configuration:
max_wal_size=min_wal_size=10Gb,
shared)buffers = 1Gb
Database:
pgbench -i -s 1000
Test:
pgbench -c 10 -M prepared -N -T 100 -P 1
pkill postgres
echo 3 > /proc/sys/vm/drop_caches
time pg_ctl -t 1000 -D pgsql -l logfile startWithout prefetch it is 19 seconds (recovered about 4Gb of WAL), with
prefetch it is about one minute. About 400k blocks are prefetched.
CPU usage is small (<20%), both processes as in "Ds" state.Based on a quick test, my guess is that the patch is broken in
several ways. Firstly, with the patch attached (and
wal_prefetch_enabled=on, which I think is needed to enable the
prefetch) I can't even restart the server, because pg_ctl restart
just hangs (the walprefetcher process gets stuck in WaitForWAL, IIRC).I have added an elog(LOG,...) to walprefetcher.c, right before the
FilePrefetch call, and (a) I don't see any actual prefetch calls
during recovery but (b) I do see the prefetch happening during the
pgbench. That seems a bit ... wrong?Furthermore, you've added an extra
signal_child(BgWriterPID, SIGHUP);
to SIGHUP_handler, which seems like a bug too. I don't have time to
investigate/debug this further.regards
Sorry, updated version of the patch is attached.
Please also notice that you can check number of prefetched pages using
pg_stat_activity() - it is reported for walprefetcher process.
Concerning the fact that you have no see prefetches at recovery time:
please check that min_wal_size and max_wal_size are large enough and
pgbench (or whatever else)
committed large enough changes so that recovery will take some time.
I have improved my WAL prefetch patch. The main reason of slowdown
recovery speed with enabled prefetch was that it doesn't take in account
initialized pages (XLOG_HEAP_INIT_PAGE)
and doesn't remember (cache) full page writes.
The main differences of new version of the patch:
1. Use effective_cache_size as size of cache of prefetched blocks
2. Do not prefetch blocks sent in shared buffers
3. Do not prefetch blocks for RM_HEAP_ID with XLOG_HEAP_INIT_PAGE bit set
4. Remember new/fpw pages in prefetch cache, to avoid prefetch them for
subsequent WAL records.
5. Add min/max prefetch lead parameters to make it possible to
synchronize speed of prefetch with speed of replay.
6. Increase size of open file cache to avoid redundant open/close
operations.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
walprefetch-3.patchtext/x-patch; name=walprefetch-3.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 1b000a2..9730b42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -879,13 +879,6 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
return true;
}
-#ifdef FRONTEND
-/*
- * Functions that are currently not needed in the backend, but are better
- * implemented inside xlogreader.c because of the internal facilities available
- * here.
- */
-
/*
* Find the first record with an lsn >= RecPtr.
*
@@ -1004,9 +997,6 @@ out:
return found;
}
-#endif /* FRONTEND */
-
-
/* ----------------------------------------
* Functions for decoding the data and block references in a record.
* ----------------------------------------
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e..7847311 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -652,7 +652,7 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
* in walsender.c but for small differences (such as lack of elog() in
* frontend). Probably these should be merged at some point.
*/
-static void
+void
XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
Size count)
{
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee..e492715 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -32,6 +32,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
+#include "postmaster/walprefetcher.h"
#include "replication/walreceiver.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -335,6 +336,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
case WalReceiverProcess:
statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
break;
+ case WalPrefetcherProcess:
+ statmsg = pgstat_get_backend_desc(B_WAL_PREFETCHER);
+ break;
default:
statmsg = "??? process";
break;
@@ -462,6 +466,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
WalReceiverMain();
proc_exit(1); /* should never return */
+ case WalPrefetcherProcess:
+ /* don't set signals, walprefetcher has its own agenda */
+ WalPrefetcherMain();
+ proc_exit(1); /* should never return */
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..13e5066 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
- pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+ pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o walprefetcher.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e..7195578 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2870,6 +2870,9 @@ pgstat_bestart(void)
case WalReceiverProcess:
beentry->st_backendType = B_WAL_RECEIVER;
break;
+ case WalPrefetcherProcess:
+ beentry->st_backendType = B_WAL_PREFETCHER;
+ break;
default:
elog(FATAL, "unrecognized process type: %d",
(int) MyAuxProcType);
@@ -3519,6 +3522,9 @@ pgstat_get_wait_activity(WaitEventActivity w)
case WAIT_EVENT_WAL_WRITER_MAIN:
event_name = "WalWriterMain";
break;
+ case WAIT_EVENT_WAL_PREFETCHER_MAIN:
+ event_name = "WalPrefetcherMain";
+ break;
/* no default case, so that compiler will warn */
}
@@ -4126,6 +4132,9 @@ pgstat_get_backend_desc(BackendType backendType)
case B_WAL_RECEIVER:
backendDesc = "walreceiver";
break;
+ case B_WAL_PREFETCHER:
+ backendDesc = "walprefetcher";
+ break;
case B_WAL_SENDER:
backendDesc = "walsender";
break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b3..1f3598d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,9 @@ static pid_t StartupPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
PgStatPID = 0,
- SysLoggerPID = 0;
+ SysLoggerPID = 0,
+ WalPrefetcherPID = 0
+;
/* Startup process's status */
typedef enum
@@ -362,6 +364,9 @@ static volatile bool avlauncher_needs_signal = false;
/* received START_WALRECEIVER signal */
static volatile sig_atomic_t WalReceiverRequested = false;
+/* received START_WALPREFETCHER signal */
+static volatile sig_atomic_t WalPrefetcherRequested = false;
+
/* set when there's a worker that needs to be started up */
static volatile bool StartWorkerNeeded = true;
static volatile bool HaveCrashedWorker = false;
@@ -549,6 +554,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartupDataBase() StartChildProcess(StartupProcess)
#define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
+#define StartWalPrefetcher() StartChildProcess(WalPrefetcherProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
@@ -1373,6 +1379,9 @@ PostmasterMain(int argc, char *argv[])
StartupStatus = STARTUP_RUNNING;
pmState = PM_STARTUP;
+ /* Start Wal prefetcher now because it may speed-up WAL redo */
+ WalPrefetcherPID = StartWalPrefetcher();
+
/* Some workers may be scheduled to start now */
maybe_start_bgworkers();
@@ -2535,6 +2544,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGHUP);
if (CheckpointerPID != 0)
signal_child(CheckpointerPID, SIGHUP);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGHUP);
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGHUP);
if (WalReceiverPID != 0)
@@ -2685,6 +2696,8 @@ pmdie(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, SIGTERM);
if (pmState == PM_RECOVERY)
{
SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
@@ -2864,6 +2877,8 @@ reaper(SIGNAL_ARGS)
*/
if (CheckpointerPID == 0)
CheckpointerPID = StartCheckpointer();
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
@@ -2967,6 +2982,20 @@ reaper(SIGNAL_ARGS)
}
/*
+ * Was it the wal prefetcher? Normal exit can be ignored; we'll start a
+ * new one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == WalPrefetcherPID)
+ {
+ WalPrefetcherPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("WAL prefetcher process"));
+ continue;
+ }
+
+ /*
* Was it the wal writer? Normal exit can be ignored; we'll start a
* new one at the next iteration of the postmaster's main loop, if
* necessary. Any other exit condition is treated as a crash.
@@ -3451,6 +3480,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(WalWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the walprefetcherr too */
+ if (pid == WalPrefetcherPID)
+ WalPrefetcherPID = 0;
+ else if (WalPrefetcherPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) WalPrefetcherPID)));
+ signal_child(WalPrefetcherPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/* Take care of the walreceiver too */
if (pid == WalReceiverPID)
WalReceiverPID = 0;
@@ -3657,6 +3698,7 @@ PostmasterStateMachine(void)
if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
StartupPID == 0 &&
WalReceiverPID == 0 &&
+ WalPrefetcherPID == 0 &&
BgWriterPID == 0 &&
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
@@ -3757,6 +3799,7 @@ PostmasterStateMachine(void)
Assert(WalReceiverPID == 0);
Assert(BgWriterPID == 0);
Assert(CheckpointerPID == 0);
+ Assert(WalPrefetcherPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
/* syslogger is not considered here */
@@ -3946,6 +3989,8 @@ TerminateChildren(int signal)
signal_child(WalWriterPID, signal);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, signal);
+ if (WalPrefetcherPID != 0)
+ signal_child(WalPrefetcherPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
if (PgArchPID != 0)
@@ -5041,6 +5086,10 @@ sigusr1_handler(SIGNAL_ARGS)
Assert(BgWriterPID == 0);
BgWriterPID = StartBackgroundWriter();
+ /* WAL prefetcher is expected to be started earlier but if not, try to start it now */
+ if (WalPrefetcherPID == 0)
+ WalPrefetcherPID = StartWalPrefetcher();
+
/*
* Start the archiver if we're responsible for (re-)archiving received
* files.
@@ -5361,6 +5410,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case WalPrefetcherProcess:
+ ereport(LOG,
+ (errmsg("could not fork WAL prefetcher process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
diff --git a/src/backend/postmaster/walprefetcher.c b/src/backend/postmaster/walprefetcher.c
new file mode 100644
index 0000000..3c8beba
--- /dev/null
+++ b/src/backend/postmaster/walprefetcher.c
@@ -0,0 +1,648 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.c
+ *
+ * Replaying WAL is done by single process, it may cause slow recovery time
+ * cause lag between master and replica.
+ *
+ * Prefetcher trieds to preload in OS file cache blocks, referenced by WAL
+ * records to speedup recovery
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/walprefetcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+
+#include "access/heapam_xlog.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecord.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/walprefetcher.h"
+#include "replication/walreceiver.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/buf_internals.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/memutils.h"
+
+#define KB (1024LL)
+/* #define DEBUG_PREFETCH 1 */
+
+#if DEBUG_PREFETCH
+#define LOG_LEVEL LOG
+#else
+#define LOG_LEVEL DEBUG1
+#endif
+
+/*
+ * GUC parameters
+ */
+int WalPrefetchMinLead = 0;
+int WalPrefetchMaxLead = 0;
+int WalPrefetchPollInterval = 1000;
+bool WalPrefetchEnabled = false;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+static void WpfQuickDie(SIGNAL_ARGS);
+static void WpfSigHupHandler(SIGNAL_ARGS);
+static void WpfShutdownHandler(SIGNAL_ARGS);
+static void WpfSigusr1Handler(SIGNAL_ARGS);
+
+/*
+ * Main entry point for walprefetcher background worker
+ */
+void
+WalPrefetcherMain()
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext walprefetcher_context;
+ int rc;
+
+ pqsignal(SIGHUP, WpfSigHupHandler); /* set flag to read config file */
+ pqsignal(SIGINT, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGTERM, WpfShutdownHandler); /* request shutdown */
+ pqsignal(SIGQUIT, WpfQuickDie); /* hard crash time */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, WpfSigusr1Handler);
+ pqsignal(SIGUSR2, SIG_IGN); /* not used */
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+ pqsignal(SIGTTIN, SIG_DFL);
+ pqsignal(SIGTTOU, SIG_DFL);
+ pqsignal(SIGCONT, SIG_DFL);
+ pqsignal(SIGWINCH, SIG_DFL);
+
+ /* We allow SIGQUIT (quickdie) at all times */
+ sigdelset(&BlockSig, SIGQUIT);
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks. Formerly this code just ran in
+ * TopMemoryContext, but resetting that would be a really bad idea.
+ */
+ walprefetcher_context = AllocSetContextCreate(TopMemoryContext,
+ "Wal Prefetcher",
+ ALLOCSET_DEFAULT_SIZES);
+ MemoryContextSwitchTo(walprefetcher_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * This code is heavily based on bgwriter.c, q.v.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ pgstat_report_wait_end();
+ AtEOXact_Files(false);
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(walprefetcher_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(walprefetcher_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+
+ /*
+ * Sleep at least 1 second after any error. A write error is likely
+ * to be repeated, and we don't want to be filling the error logs as
+ * fast as we can.
+ */
+ pg_usleep(1000000L);
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ /* Clear any already-pending wakeups */
+ ResetLatch(MyLatch);
+
+ /*
+ * Process any requests or signals received recently.
+ */
+ if (got_SIGHUP)
+ {
+ got_SIGHUP = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+ if (shutdown_requested)
+ {
+ /* Normal exit from the walprefetcher is here */
+ proc_exit(0); /* done */
+ }
+
+ if (WalPrefetchEnabled)
+ WalPrefetch(InvalidXLogRecPtr);
+
+ /*
+ * Sleep until we are signaled
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+ }
+}
+
+
+/* --------------------------------
+ * signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * WpfQuickDie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+WpfQuickDie(SIGNAL_ARGS)
+{
+ PG_SETMASK(&BlockSig);
+
+ /*
+ * We DO NOT want to run proc_exit() callbacks -- we're here because
+ * shared memory may be corrupted, so we don't want to try to clean up our
+ * transaction. Just nail the windows shut and get out of town. Now that
+ * there's an atexit callback to prevent third-party code from breaking
+ * things by calling exit() directly, we have to reset the callbacks
+ * explicitly to make this work as intended.
+ */
+ on_exit_reset();
+
+ /*
+ * Note we do exit(2) not exit(0). This is to force the postmaster into a
+ * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+ * backend. This is necessary precisely because we don't clean up our
+ * shared memory state. (The "dead man switch" mechanism in pmsignal.c
+ * should ensure the postmaster sees this as a crash, too, but no harm in
+ * being doubly sure.)
+ */
+ exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+WpfSigHupHandler(SIGNAL_ARGS)
+{
+ got_SIGHUP = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGTERM: set flag to exit normally */
+static void
+WpfShutdownHandler(SIGNAL_ARGS)
+{
+ shutdown_requested = true;
+ SetLatch(MyLatch);
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+WpfSigusr1Handler(SIGNAL_ARGS)
+{
+ latch_sigusr1_handler();
+}
+
+/*
+ * Now wal prefetch code itself.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI);
+
+#define FILE_HASH_SIZE 1009 /* Size of opened files hash */
+#define STAT_REFRESH_PERIOD 1024 /* Refresh backend status rate */
+
+/*
+ * Block LRU hash table is used to keep information about most recently prefetched blocks.
+ */
+typedef struct BlockHashEntry
+{
+ struct BlockHashEntry* next;
+ struct BlockHashEntry* prev;
+ struct BlockHashEntry* collision;
+ BufferTag tag;
+ uint32 hash;
+} BlockHashEntry;
+
+static BlockHashEntry** block_hash_table;
+static size_t block_hash_size;
+static size_t block_hash_used;
+static BlockHashEntry lru = {&lru, &lru};
+static TimeLineID replay_timeline;
+
+/*
+ * Yet another L2-list implementation
+ */
+static void
+unlink_block(BlockHashEntry* entry)
+{
+ entry->next->prev = entry->prev;
+ entry->prev->next = entry->next;
+}
+
+static void
+link_block_after(BlockHashEntry* head, BlockHashEntry* entry)
+{
+ entry->next = head->next;
+ entry->prev = head;
+ head->next->prev = entry;
+ head->next = entry;
+}
+
+/*
+ * Put block in LRU hash or link it to the head of LRU list. Returns true if block was not present in hash, false otherwise.
+ */
+static bool
+put_block_in_cache(BufferTag* tag)
+{
+ uint32 hash;
+ BlockHashEntry* entry;
+
+ hash = BufTableHashCode(tag) % block_hash_size;
+ for (entry = block_hash_table[hash]; entry != NULL; entry = entry->collision)
+ {
+ if (BUFFERTAGS_EQUAL(entry->tag, *tag))
+ {
+ unlink_block(entry);
+ link_block_after(&lru, entry);
+ return false;
+ }
+ }
+ if (block_hash_size == block_hash_used)
+ {
+ BlockHashEntry* victim = lru.prev;
+ BlockHashEntry** epp = &block_hash_table[victim->hash];
+ while (*epp != victim)
+ epp = &(*epp)->collision;
+ *epp = (*epp)->collision;
+ unlink_block(victim);
+ entry = victim;
+ }
+ else
+ {
+ entry = (BlockHashEntry*)palloc(sizeof(BlockHashEntry));
+ block_hash_used += 1;
+ }
+ entry->tag = *tag;
+ entry->hash = hash;
+ entry->collision = block_hash_table[hash];
+ block_hash_table[hash] = entry;
+ link_block_after(&lru, entry);
+
+ return true;
+}
+
+/*
+ * Hash of of opened files. It seems to be simpler to maintain own cache rather than provide SMgrRelation for smgr functions.
+ */
+typedef struct FileHashEntry
+{
+ BufferTag tag;
+ File file;
+} FileHashEntry;
+
+static FileHashEntry file_hash_table[FILE_HASH_SIZE];
+
+static File
+WalOpenFile(BufferTag* tag)
+{
+ BufferTag segment_tag = *tag;
+ uint32 hash;
+ char* path;
+ File file;
+
+ /* Transform block number into segment number */
+ segment_tag.blockNum /= RELSEG_SIZE;
+ hash = BufTableHashCode(&segment_tag) % FILE_HASH_SIZE;
+
+ if (BUFFERTAGS_EQUAL(file_hash_table[hash].tag, segment_tag))
+ return file_hash_table[hash].file;
+
+ path = relpathperm(tag->rnode, tag->forkNum);
+ if (segment_tag.blockNum > 0)
+ {
+ char* fullpath = psprintf("%s.%d", path, segment_tag.blockNum);
+ pfree(path);
+ path = fullpath;
+ }
+ file = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+ if (file >= 0)
+ {
+ elog(LOG_LEVEL, "WAL_PREFETCH: open file %s", path);
+ if (file_hash_table[hash].tag.rnode.dbNode != 0)
+ FileClose(file_hash_table[hash].file);
+
+ file_hash_table[hash].file = file;
+ file_hash_table[hash].tag = segment_tag;
+ }
+ pfree(path);
+ return file;
+}
+
+/*
+ * Our backend doesn't receive any notifications about WAL progress, so we have to use sleep
+ * to wait until requested information is available
+ */
+static void
+WalWaitWAL(void)
+{
+ int rc;
+ CHECK_FOR_INTERRUPTS();
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ WalPrefetchPollInterval,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN);
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+
+}
+
+/*
+ * Main function: perform prefetch of blocks referenced by WAL records starting from given LSN or from WAL replay position if lsn=0
+ */
+void
+WalPrefetch(XLogRecPtr lsn)
+{
+ XLogReaderState *xlogreader;
+ long n_prefetched = 0;
+ long n_fpw = 0;
+ long n_cached= 0;
+ long n_initialized = 0;
+
+ /* Dirty hack: prevent recovery conflict */
+ MyPgXact->xmin = InvalidTransactionId;
+
+ memset(file_hash_table, 0, sizeof file_hash_table);
+
+ free(block_hash_table);
+ block_hash_size = effective_cache_size;
+ block_hash_table = (BlockHashEntry**)calloc(block_hash_size, sizeof(BlockHashEntry*));
+ block_hash_used = 0;
+
+ xlogreader = XLogReaderAllocate(wal_segment_size, &WalReadPage, NULL);
+
+ if (!xlogreader)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+
+ if (lsn == InvalidXLogRecPtr)
+ lsn = GetXLogReplayRecPtr(NULL); /* Start with replay LSN */
+
+ while (!shutdown_requested)
+ {
+ char *errormsg;
+ int block_id;
+ XLogRecPtr replay_lsn = GetXLogReplayRecPtr(&replay_timeline);
+ XLogRecord *record;
+
+ /*
+ * If current position is behind current replay LSN, then move it forward: we do not want to perform useless job and prefetch
+ * blocks for already processed WAL records
+ */
+ if (lsn != InvalidXLogRecPtr || replay_lsn + WalPrefetchMinLead*KB >= xlogreader->EndRecPtr)
+ {
+ XLogRecPtr prefetch_lsn = replay_lsn != InvalidXLogRecPtr
+ ? XLogFindNextRecord(xlogreader, Max(lsn, replay_lsn) + WalPrefetchMinLead*KB) : InvalidXLogRecPtr;
+ if (prefetch_lsn == InvalidXLogRecPtr)
+ {
+ elog(LOG_LEVEL, "WAL_PREFETCH: wait for new WAL records at LSN %llx: replay lsn %llx, prefetched %ld, cached %ld, fpw %ld, initialized %ld",
+ (long long)xlogreader->EndRecPtr, (long long)replay_lsn, n_prefetched, n_cached, n_fpw, n_initialized);
+ WalWaitWAL();
+ continue;
+ }
+ lsn = prefetch_lsn;
+ }
+ /*
+ * Now opposite check: if prefetch goes too far from replay position, then suspend it for a while
+ */
+ if (WalPrefetchMaxLead != 0 && replay_lsn + WalPrefetchMaxLead*KB < xlogreader->EndRecPtr)
+ {
+ elog(LOG_LEVEL, "WAL_PREFETCH: wait for recovery at LSN %llx, replay LSN %llx",
+ (long long)xlogreader->EndRecPtr, (long long)replay_lsn);
+ WalWaitWAL();
+ continue;
+ }
+
+ record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+ if (record != NULL)
+ {
+ lsn = InvalidXLogRecPtr; /* continue with next record */
+
+ /* Loop through blocks referenced by this WAL record */
+ for (block_id = 0; block_id <= xlogreader->max_block_id; block_id++)
+ {
+ BufferTag tag;
+ File file;
+
+ if (!XLogRecGetBlockTag(xlogreader, block_id, &tag.rnode, &tag.forkNum, &tag.blockNum))
+ continue;
+
+ /* Check if block already prefetched */
+ if (!put_block_in_cache(&tag))
+ continue;
+
+ /* Check if block is cached in shared buffers */
+ if (IsBlockCached(&tag))
+ {
+ n_cached += 1;
+ continue;
+ }
+
+ /* Do not prefetch full pages */
+ if (XLogRecHasBlockImage(xlogreader, block_id))
+ {
+ n_fpw += 1;
+ continue;
+ }
+
+ /* Ignore initialized pages */
+ if (XLogRecGetRmid(xlogreader) == RM_HEAP_ID
+ && (XLogRecGetInfo(xlogreader) & XLOG_HEAP_INIT_PAGE))
+ {
+ n_initialized += 1;
+ continue;
+ }
+
+ file = WalOpenFile(&tag);
+ if (file >= 0)
+ {
+ off_t offs = (off_t) BLCKSZ * (tag.blockNum % ((BlockNumber) RELSEG_SIZE));
+ int rc;
+#if DEBUG_PREFETCH
+ instr_time start, stop;
+ INSTR_TIME_SET_CURRENT(start);
+#endif
+ rc = FilePrefetch(file, offs, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+ if (rc != 0)
+ elog(ERROR, "WAL_PREFETCH: failed to prefetch file: %m");
+ else if (++n_prefetched % STAT_REFRESH_PERIOD == 0)
+ {
+ char buf[1024];
+ sprintf(buf, "Prefetch %ld blocks at LSN %llx, replay LSN %llx",
+ n_prefetched, (long long)xlogreader->ReadRecPtr, (long long)replay_lsn);
+ pgstat_report_activity(STATE_RUNNING, buf);
+ elog(DEBUG1, "%s", buf);
+ }
+#if DEBUG_PREFETCH
+ INSTR_TIME_SET_CURRENT(stop);
+ INSTR_TIME_SUBTRACT(stop,start);
+ elog(LOG, "WAL_PREFETCH: %x/%x prefetch block %d fork %d of relation %d at LSN %llx, replay LSN %llx (%u usec), %ld prefetched, %ld cached, %ld fpw, %ld initialized",
+ XLogRecGetRmid(xlogreader), XLogRecGetInfo(xlogreader),
+ tag.blockNum, tag.forkNum, tag.rnode.relNode, (long long)xlogreader->ReadRecPtr, (long long)replay_lsn,
+ (int)INSTR_TIME_GET_MICROSEC(stop), n_prefetched, n_cached, n_fpw, n_initialized);
+#endif
+ }
+ else
+ elog(LOG, "WAL_PREFETCH: file segment doesn't exists");
+ }
+ }
+ else
+ {
+ elog(LOG, "WAL_PREFETCH: wait for valid record at LSN %llx, replay_lsn %llx: %s",
+ (long long)xlogreader->EndRecPtr, (long long)replay_lsn, errormsg);
+ WalWaitWAL();
+ }
+ }
+}
+
+/*
+ * Almost copy of read_local_xlog_page from xlogutils.c, but it reads until flush position of WAL receiver, rather then replay position.
+ */
+static int
+WalReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr,
+ int reqLen, XLogRecPtr targetRecPtr, char *cur_page,
+ TimeLineID *pageTLI)
+{
+ XLogRecPtr read_upto,
+ loc;
+ int count;
+
+ loc = targetPagePtr + reqLen;
+
+ /* Loop waiting for xlog to be available if necessary */
+ while (1)
+ {
+ /*
+ * If we perform recovery at startup then read until end of WAL,
+ * otherwise if there is active WAL receiver at replica, read until the end of received data,
+ * if there is no active wal recevier, then just sleep.
+ */
+ read_upto = WalRcv->walRcvState == WALRCV_STOPPED
+ ? RecoveryInProgress() ? (XLogRecPtr)-1 : InvalidXLogRecPtr
+ : WalRcv->receivedUpto;
+ *pageTLI = replay_timeline;
+
+ if (loc <= read_upto)
+ break;
+
+ elog(LOG_LEVEL, "WAL_PREFETCH: wait for new WAL records at LSN %llx, read up to lsn %llx",
+ (long long)loc, (long long)read_upto);
+ WalWaitWAL();
+ CHECK_FOR_INTERRUPTS();
+ if (shutdown_requested)
+ return -1;
+ }
+
+ if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+ {
+ /*
+ * more than one block available; read only that block, have caller
+ * come back if they need more.
+ */
+ count = XLOG_BLCKSZ;
+ }
+ else if (targetPagePtr + reqLen > read_upto)
+ {
+ /* not enough data there */
+ return -1;
+ }
+ else
+ {
+ /* enough bytes available to satisfy the request */
+ count = read_upto - targetPagePtr;
+ }
+
+ /*
+ * Even though we just determined how much of the page can be validly read
+ * as 'count', read the whole page anyway. It's guaranteed to be
+ * zero-padded up to the page boundary if it's incomplete.
+ */
+ XLogRead(cur_page, state->wal_segment_size, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
+
+
+ /* number of valid bytes in the buffer */
+ return count;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e47ddca..6319ed5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -252,7 +252,7 @@ static void LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time);
static TimeOffset LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now);
static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static void WalSndRead(char *buf, XLogRecPtr startptr, Size count);
/* Initialize walsender process before entering the main command loop */
@@ -771,7 +771,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
count = flushptr - targetPagePtr; /* part of the page available */
/* now actually read the data, we know it's there */
- XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+ WalSndRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
return count;
}
@@ -2314,7 +2314,7 @@ WalSndKill(int code, Datum arg)
* more than one.
*/
static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+WalSndRead(char *buf, XLogRecPtr startptr, Size count)
{
char *p;
XLogRecPtr recptr;
@@ -2710,7 +2710,7 @@ XLogSendPhysical(void)
* calls.
*/
enlargeStringInfo(&output_message, nbytes);
- XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+ WalSndRead(&output_message.data[output_message.len], startptr, nbytes);
output_message.len += nbytes;
output_message.data[output_message.len] = '\0';
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5..82849dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -585,6 +585,19 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
#endif /* USE_PREFETCH */
}
+bool
+IsBlockCached(BufferTag* tag)
+{
+ uint32 hash = BufTableHashCode(tag);
+ LWLock *plock = BufMappingPartitionLock(hash);
+ int bufid;
+
+ LWLockAcquire(plock, LW_SHARED);
+ bufid = BufTableLookup(tag, hash);
+ LWLockRelease(plock);
+
+ return bufid >= 0;
+}
/*
* ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e..4d337b9 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -43,6 +43,8 @@
#define FSYNCS_PER_ABSORB 10
#define UNLINKS_PER_ABSORB 10
+/* #define DEBUG_PREFETCH 1 */
+
/*
* Special values for the segno arg to RememberFsyncRequest.
*
@@ -733,7 +735,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
off_t seekpos;
int nbytes;
MdfdVec *v;
-
+#if DEBUG_PREFETCH
+ instr_time start, stop;
+ INSTR_TIME_SET_CURRENT(start);
+#endif
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
reln->smgr_rnode.node.dbNode,
@@ -788,6 +793,11 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
blocknum, FilePathName(v->mdfd_vfd),
nbytes, BLCKSZ)));
}
+#if DEBUG_PREFETCH
+ INSTR_TIME_SET_CURRENT(stop);
+ INSTR_TIME_SUBTRACT(stop,start);
+ elog(LOG, "Read block %d fork %d of relation %d (%d usec)", blocknum, forknum, reln->smgr_rnode.node.relNode, (int)INSTR_TIME_GET_MICROSEC(stop));
+#endif
}
/*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7..2e627b2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
#include "postmaster/bgwriter.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
+#include "postmaster/walprefetcher.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
#include "replication/slot.h"
@@ -1823,6 +1824,17 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"wal_prefetch_enabled", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("Allow prefetch of blocks referenced by WAL records."),
+ NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &WalPrefetchEnabled,
+ false,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
@@ -2487,6 +2499,39 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_prefetch_min_lead", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Minimal lead (kb) before WAL replay LSN and prefetched LSN."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &WalPrefetchMinLead,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_prefetch_max_lead", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Maximal lead (kb) before WAL replay LSN and prefetched LSN."),
+ NULL,
+ GUC_UNIT_KB
+ },
+ &WalPrefetchMaxLead,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"wal_prefetch_poll_interval", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Interval of polling WAL by WAL prefetcher."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &WalPrefetchPollInterval,
+ 100, 1, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
{"wal_writer_delay", PGC_SIGHUP, WAL_SETTINGS,
gettext_noop("Time between WAL flushes performed in the WAL writer."),
NULL,
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f307b63..70eac88 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -212,9 +212,7 @@ extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
/* Invalidate read state */
extern void XLogReaderInvalReadState(XLogReaderState *state);
-#ifdef FRONTEND
extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
-#endif /* FRONTEND */
/* Functions for decoding an XLogRecord */
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index c406699..c2b8a6f 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -55,4 +55,7 @@ extern int read_local_xlog_page(XLogReaderState *state,
extern void XLogReadDetermineTimeline(XLogReaderState *state,
XLogRecPtr wantPage, uint32 wantLength);
+extern void XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
+ Size count);
+
#endif
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8..5f8b67d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,6 +400,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ WalPrefetcherProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
@@ -412,6 +413,7 @@ extern AuxProcType MyAuxProcType;
#define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
#define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
#define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
+#define AmWalPrefetcherProcess() (MyAuxProcType == WalPrefetcherProcess)
/*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..20ab699 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SENDER,
- B_WAL_WRITER
+ B_WAL_WRITER,
+ B_WAL_PREFETCHER
} BackendType;
@@ -767,7 +768,8 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
- WAIT_EVENT_WAL_WRITER_MAIN
+ WAIT_EVENT_WAL_WRITER_MAIN,
+ WAIT_EVENT_WAL_PREFETCHER_MAIN
} WaitEventActivity;
/* ----------
diff --git a/src/include/postmaster/walprefetcher.h b/src/include/postmaster/walprefetcher.h
new file mode 100644
index 0000000..de156a9
--- /dev/null
+++ b/src/include/postmaster/walprefetcher.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * walprefetcher.h
+ * Exports from postmaster/walprefetcher.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/walprefetcher.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _WALPREFETCHER_H
+#define _WALPREFETCHER_H
+
+/* GUC options */
+extern int WalPrefetchMinLead;
+extern int WalPrefetchMaxLead;
+extern int WalPrefetchPollInterval;
+extern bool WalPrefetchEnabled;
+
+extern void WalPrefetcherMain(void) pg_attribute_noreturn();
+extern void WalPrefetch(XLogRecPtr lsn);
+
+#endif /* _WALPREFETCHER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5370035..aa1ca73 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -337,5 +337,6 @@ extern void DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum,
BlockNumber firstDelBlock);
extern void DropRelFileNodeAllLocalBuffers(RelFileNode rnode);
extern void AtEOXact_LocalBuffers(bool isCommit);
+extern bool IsBlockCached(BufferTag* tag);
#endif /* BUFMGR_INTERNALS_H */
On 06/27/2018 11:44 AM, Konstantin Knizhnik wrote:
...
I have improved my WAL prefetch patch. The main reason of slowdown
recovery speed with enabled prefetch was that it doesn't take in account
initialized pages (XLOG_HEAP_INIT_PAGE)
and doesn't remember (cache) full page writes.
The main differences of new version of the patch:1. Use effective_cache_size as size of cache of prefetched blocks
2. Do not prefetch blocks sent in shared buffers
3. Do not prefetch blocks for RM_HEAP_ID with XLOG_HEAP_INIT_PAGE bit set
4. Remember new/fpw pages in prefetch cache, to avoid prefetch them for
subsequent WAL records.
5. Add min/max prefetch lead parameters to make it possible to
synchronize speed of prefetch with speed of replay.
6. Increase size of open file cache to avoid redundant open/close
operations.
Thanks. I plan to look at it and do some testing, but I won't have time
until the end of next week (probably).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've done a bit of testing on the current patch, mostly to see how much
the prefetching can help (if at all). While the patch is still in early
WIP stages (at least that's my assessment, YMMV), the improvement are
already quite significant.
I've also planned to compare it to the pg_prefaulter [1] which kinda
started this all, but I've been unable to get it working with my very
limited knowledge of golang. I've fixed the simple stuff (references to
renamed PostgreSQL functions etc.) but then it does not do anything :-(
I wonder if it's working on FreeBSD only, or something like that ...
So this compares only master with and without WAL prefetching.
Instead of killing the server and measuring local recovery (which is
what Konstantin did before), I've decided to use replication. That is,
setup a replica, run pgbench on the master and see how much apply lag we
end up with over time. I find this much easier to reproduce, monitor
over time, do longer runs, ...
master
------
* 32 cores (2x E5-2620v4)
* 32GB of RAM
* Intel Optane SSD 280GB
* shared_buffers=4GB
* max_wal_size=128GB
* checkpoint_timeout=30min
replica
-------
* 4 cores (i5-2500k)
* 8GB RAM
* 6x Intel S3700 SSD (RAID0)
* shared_buffers=512MB
* effective_cache_size=256MB
I've also decided to use pgbench scale 1000 (~16GB) which fits into RAM
on the master but not the replica. This may seem like a bit strange
choice, but I think it's not entirely crazy, for a couple of reasons:
* It's not entirely uncommon to have replicas with different hardware
condiguration. For HA it's a bad idea, but there are valid use cases.
* Even with the same hardware config, you may have very different
workload on the replica, accessing very different subset of the data.
Consider master doing OLTP on small active set, while replica runs BI
queries on almost all data, pushing everything else from RAM.
* It amplifies the effect of prefetching, which is nice for testing.
* I don't have two machines with exactly the same config anyway ;-)
The pgbench test is then executed on master like this:
pgbench -c 32 -j 8 -T 3600 -l --aggregate-interval=1 test
The replica is unlikely to keep up with the master, so the question is
how much apply lag we end up with at the end.
Without prefetching, it's ~70GB of WAL. With prefetching, it's only
about 30GB. Considering the 1-hour test generates about 90GB of WAL,
this means the replay speed grew from 20GB/h to almost 60GB/h. That's
rather measurable improvement ;-)
The attached replication-lag.png chart, showing how the lag grows over
time. The "bumps" after ~30 minutes coincide with a checkpoint,
triggering FPIs for a short while. The record-size.png and fpi-size.png
come from pg_waldump and show what part of WAL consists of regular
records and FPIs.
Note: I've done two runs with each configuration, so there are four data
series on all charts.
With prefetching the lag drops down a bit after a while (by about the
same amount of WAL), while without prefetch it does not. My explanation
is that the replay is so slow it does not get to the FPIs until after
the test - so it happens, but we don't see it here.
Now, how does this look on system metrics? Without prefetching we see
low CPU usage, because the process is waiting for I/O. And the I/O is
under-utilized, because we only issue one request at a time (which means
short I/O queues, low utilization of individual devices in the RAID).
In this case I see that without prefetching, the replay process uses
about 20% of a CPU. With prefetching increases this to ~60%, which is nice.
At the storage level, the utilization for each device in the RAID0 array
is ~20%, and with prefetching enabled this jumps up to ~40%. If you look
at IOPS instead, that jumps from ~2000 to ~6500, so about 3x. How is
this possible when the utilization grew only ~2x? We're generating
longer I/O queues (20 requests instead of 3), and the devices can
optimize it quite a bit.
I think there's a room for additional improvement. We probably can't get
the CPU usage to 100%, but 60% is still quite low. The storage can
certainly handle more requests, the devices are doing something only
about 40% of the time.
But overall it looks quite nice, and I think it's worth to keep working
on it.
BTW to make this work, I had to tweak NUM_AUXILIARY_PROCS (increase it
from 4 to 5), otherwise InitAuxiliaryProcess() fails because there's not
room for additional process. I assume it works with local recovery, but
once you need to start walreceiver it fails.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
wal-record-size.pngimage/png; name=wal-record-size.pngDownload
�PNG
IHDR � � !J�z pHYs .# .#x�?v tIME�I � IDATx���w|���������� ��M��{C���T� �v� �\���&���(5��� i���n������l�$ *zQ��|���������y����sB!�B��M�" �B!�wB!�B��B!�P�N!�B���B!�B�;!�B!�B!�B(p'�B!�P�N!�B��B!�
� !�B���B!�B�;!�B!�wB!�B(p'�B!�P�N!�B���B!�
� !�B!�B��8�T�B�����cw�qc�W�$//�h4������� sss������,���{8���Q{����}��������H,���8p�������ky�&M������x�M�6e��������.��Q�F��
6��*��]�!�wB��N�������2))�b��Z�*((�K�.;v:t������;����s�N�:yc�{�������i��E�M�6]�����v{�z�n�B����SRR������N�;p�@dd��A�,�pB�4T������h���;�|��ieee&L����X�+,,����W����_}�c�s.I���V��7�HHH `6�������1`���o��*�{~~~�~�������:u����g�}v�� UUv�����>���;����������B�/EwB�e!��%K�n�����t��q�F�|����Z�j���� \.�o������'�i�����m[�:th���:t����~�?��(&L����D�v�����w�����v�ZEQ���}�a��u���p8V�Z�j�*��QTT�d��O/�x�����{����� ��k��U+q�W�ao�X�����SOyr�KJJf����ezz��������A����������_<q�����.����~����g��u�]�`������JNN�9s��Qs������������,Y"z��?~���V�U�d���g������q�FQ���]�j����|����;�n�������*�b��7�t���gO�:Ug�������
�kEm�6����eKAA���3����kE��qc��-x���[��YV�������{����������5��|��5�����W�^���|�T���|��:���������[�n-��BH��$!�/���a������%K,Kpp0������X�7-..���W"�,V���/�{>|8������h����g����w��}�������q���=������� �=�g�y����c��t:9��zpppnn��={�����_�^,���
��G��^��~��w n��V� /���H��n9c�T��Y�t�X��cG �������;� �x��������V�
�z}XX����O?�@DD�N�p�w��[�l `��}��E�x���������(--[�2���X"BU��$I���)))�s��3d�oI��<��3��.[�L4Z�t��w(\{�[qqq��`��M���K�����������^y� QQQbcq��e��v���b��6�������jo�X8�K�.�����93,,�[>u�F���������s�N���}� �n�Z|"������!��y*B��� 8������������?_\\l6�,X �}��n�;++���_��n���l6gff�����s��/_.����O������GFFx���8��G� I���+�9���/^�v�����
6��|-�{AAALLLhh���;�f��^x� O?�4�<///:::::://�f������?z��8���P���9����V�XQPPp�������|�g�y��9s|��'�|��;��a��rI���S;p����|�{q�n��j�l�w��WY+IRxx����sss�����W���m��I������h<�`�������N�>����bKQ��7K�f���G######�;VTT�9Obcc����f���gEX�c��Z��u����Q� ���+''���������];���v�k��+V�t����������;����;822���Cf�Y\�'�x�[V�����������/���{T�����h6�������[V�3�����cW��]�v��������i��+V��f��5m����d���b���rrrz��`���t_"�P�N��J��o/�LOO�j�
6�M<�l�R��fdd�u�^����M�����n�:��HIN�6����/� ��m[o�>d����{�0q�D��"k���{^^^TTT||��������������T������"H���O v���s���O���K�"##srrj�ott����
G�}��9 ���b��b��k�=���5�������{3�u���|wRg���lX�b��;�?���cbb��Q;p����wW�&M�^�+�yyy�KX<K�<y�wcq�Z��.����_^^.�N�������}��-,,��F��*��x<��� ���+�$..N��9��=��I�&�?EOo�e�[?E���?�����-:t��w�x1j�(oYyK���RUU�yrr���k�Vk��
u:��'�Z����z��1���B�J�I<�PQQ�v�eYNMM��h4���v�E��]w��}������/����*�x�v��=44���������~�I���n|�]w���W'��deeu���m��#G�9r���?�9��<����}���������?�@RR�_|!���d�X|�v{������z��5l�0;;;+++66v��%�����<yr�n���dY�����V��O?���/�Dlmk���j�:t�z�0���@���}3�Wy�o�'%%8q��/�<�W��}zSTTd2�JKKKJJD��{��w����W���Y�f��l������`q�����]���?�v�#""�s
��,��xG�FFFz;��'ND�0k�o�l�������z}bbbbb��M�x��� @��3 �����6���&�I<Z�GDDdff�l6��{v�1q�
}MB���RS�~�|����z��'������P�V+�%9]�V����N����wd������k��Dv��{���v���}�����?��3����6mZ���l�&M�2d��A���\�R�7�g��h~���:��Y�f�6mJOO�h4g��m����yP��3Z�h!��h���k��1������GjOc�Z/^�):�\�k,�������Vqq���}_�X@@@PP�w�Lq�+**j�\�#��#2�1q��E��jhh��b HLL��t�w"�5����q���to!�P�N�����WZ�l�m�6���������!k3��������fLk��~Ur1""b��e������_�f�[o�5}��3g��5�{��������a/���5k���CkLh#���n����,�������/))�s��F.�k�����N��f����fsAAA\\�9��^�:�i� ����Mt����� ]��\���x�:#���V�2d��b������K��5k��x8�p8�<����|P����{�!�����$�\QXXX@@@II�V���RPP "N�%���_�n?�����k��] �p���{��<y����A����"��e��8�Z���t8p�b�����o�~���b��w����S��[�.$$����o�Ub.�w�}7<<\�fdddnn�F��}�?sW�����}����7>������g���}Q[FFFYY�����?� ����'yyyg���C{�/��l��922R���5H�eo�/.n��F:�N������%eeeO?����>k�Xt:c�l6[�V�KD���k��;?�8*??��� s��cRL�0���F����V��l6�~�1B�������������,1x����[�ly�-�����={�:uJ�R��k���?�b�
���S�O�^^^.b���9��N�
`��)"&+//�2e����c�:v�(&ZK,X����D`���_N�2%!!!77�vo����S�Ny������m������}���\��h��\�r��A��k[�~�wF�U�Vu��U������xDp��W���v� R <���������yyy����j�����C=��j���������}/�h������U����������x�;����Gm��%((�Q�F������3g�����=z4..�������>5������Q
4�i�� ^z�%�>|���'Q�p�*������������G�z����3333ccc��B�vh|.!�k��%�|~�m���sg��-��5J�������3~<���k���>