WAL prefetch

Started by Konstantin Knizhnikalmost 8 years ago59 messageshackers
Jump to latest
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru

There was very interesting presentation at pgconf about pg_prefaulter:

http://www.pgcon.org/2018/schedule/events/1204.en.html

But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.

First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some
number of prefetch threads:

Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE

One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE

One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE

4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE

Based on this experiments (on my desktop), I made the following conclusions:

1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to
2 times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k TPS to 60k TPS (on pgbench with scale 1000).

Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until
you interrupt it.

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

It is possible to explicitly specify start LSN for pg_wal_prefetch()
function. Otherwise, WAL redo position will be used as start LSN.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

prefetch.ctext/x-csrc; name=prefetch.cDownload
wal_prefetch.tgzapplication/x-compressed-tar; name=wal_prefetch.tgzDownload
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: WAL prefetch

On Wed, Jun 13, 2018 at 6:39 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

There was very interesting presentation at pgconf about pg_prefaulter:

http://www.pgcon.org/2018/schedule/events/1204.en.html

But it is implemented in GO and using pg_waldump.
I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.

Right or in other words, it could do something like autoprewarm [1]https://www.postgresql.org/docs/devel/static/pgprewarm.html
which can allow a more user-friendly interface for this utility if we
decides to include it.

First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some number
of prefetch threads:

Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE

One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE

One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE

4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE

Based on this experiments (on my desktop), I made the following conclusions:

1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

That's a reasonable improvement.

Usage:
1. At master: create extension wal_prefetch
2. At replica: Call pg_wal_prefetch() function: it will not return until you
interrupt it.

I think it is not a very user-friendly interface, but the idea sounds
good to me, it can help some other workloads. I think this can help
in recovery as well.

[1]: https://www.postgresql.org/docs/devel/static/pgprewarm.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: WAL prefetch

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

--
Thomas Munro
http://www.enterprisedb.com

#4Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Thomas Munro (#3)
Re: WAL prefetch

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

It is good question. I thought a lot about prefetching directly to
shared buffers.
But the current c'est la vie with Postgres is that allocating too large
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement
algorithm, "write storm",...

If your system has 1Tb of memory,  almost none of Postgresql
administrators will recommend to use all this 1Tb for shared buffers.
Moreover there are recommendations to choose shared buffers size based
on size of internal cache of persistent storage device
(so that it will be possible to flush changes without doing writes to
physical media). So at this system with 1Tb of RAM, size of shared
buffers will be most likely set to few hundreds of gigabytes.

Also PostgreSQL is not currently supporting dynamic changing of shared
buffers size. Without it, the only way of using Postgres in clouds and
another multiuser systems where system load is not fully controlled by 
user is to choose relatively small shared buffer size and rely on OS
caching.

Yes, access to shared buffer is about two times faster than reading data
from file system cache.
But it is better, then situation when shared buffers are swapped out and
effect of large shared buffers becomes negative.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#5Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)
Re: WAL prefetch

On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

That's a reasonable improvement.

Somehow I would have expected more. That's only a 7% speedup.

I am also surprised that HDD didn't show any improvement. Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#5)
Re: WAL prefetch

On Thu, Jun 14, 2018 at 6:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

That's a reasonable improvement.

Somehow I would have expected more. That's only a 7% speedup.

It might be due to the reason that there is already a big overhead of
synchronous mode of replication that it didn't show a big speedup. We
might want to try recovery (PITR) or maybe async replication to see if
we see any better numbers.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#5)
Re: WAL prefetch

On 14.06.2018 15:44, Robert Haas wrote:

On Wed, Jun 13, 2018 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

That's a reasonable improvement.

Somehow I would have expected more. That's only a 7% speedup.

I am also surprised that HDD didn't show any improvement.

My be pgbench is not the best use case for prefetch. It is updating more
or less random pages and if database is large enough and
full_page_writes is true (default value)
then most pages will be updated only once since last checkpoint and most
of updates will be represented in WAL by full page records.
And such records do not require reading any data from disk.

Since HDD's
are bad at random I/O, I would have expected prefetching to help more
in that case.

Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability
of head movement, so actually parallel access to HDD may even decrease
IO speed rather than increase it.
In theory, given several concurrent IO requests, driver can execute them
in optimal order, trying to minimize head movement. But if there are
really access to random pages,
then probability that we can win something by such optimization is very
small.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#8Robert Haas
robertmhaas@gmail.com
In reply to: Konstantin Knizhnik (#7)
Re: WAL prefetch

On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability of
head movement, so actually parallel access to HDD may even decrease IO speed
rather than increase it.
In theory, given several concurrent IO requests, driver can execute them in
optimal order, trying to minimize head movement. But if there are really
access to random pages,
then probability that we can win something by such optimization is very
small.

You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs. Perhaps I am
mis-remembering.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#8)
Re: WAL prefetch

On 14.06.2018 16:25, Robert Haas wrote:

On Thu, Jun 14, 2018 at 9:23 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Speed of random HDD access is limited by speed of disk head movement.
By running several IO requests in parallel we just increase probability of
head movement, so actually parallel access to HDD may even decrease IO speed
rather than increase it.
In theory, given several concurrent IO requests, driver can execute them in
optimal order, trying to minimize head movement. But if there are really
access to random pages,
then probability that we can win something by such optimization is very
small.

You might be right, but I feel like I've heard previous reports of
significant speedups from prefetching on HDDs. Perhaps I am
mis-remembering.

It is true for RAIDs of HDD which can really win by issuing parallel IO
operations.

But there are some many different factors that I will not be surprised
by any result:)

The last problem I have observed with NVME device at one of the
customer's system was huge performance degradation (> 10 times: from
500Mb/sec to 50Mb/sec write speed)
after space exhaustion at the device. There is 3Tb NVME RAID device with
1.5Gb database. ext4 was mounted without "discard" option.
After incorrect execution of rsync, space was exhausted. Then I removed
all data and copied database from master node.
Then I observed huge lags in async. replication between master and
replica. wal_receiver is saving received data too slowly: write speed is
about ~50Mb/sec vs. 0.5Gb at master.
All my attempts to use fstrim or ex4defrag didn't help. The problem was
solved only after deleting all database files, performing fstrim and
copying database once again.
After it wal_sender is writing data with normal speed ~0.5Gb and there
is no lag between master and replica.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10Stephen Frost
sfrost@snowman.net
In reply to: Konstantin Knizhnik (#1)
Re: WAL prefetch

Greetings,

* Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote:

There was very interesting presentation at pgconf about pg_prefaulter:

http://www.pgcon.org/2018/schedule/events/1204.en.html

I agree and I've chatted a bit w/ Sean further about it.

But it is implemented in GO and using pg_waldump.

Yeah, that's not too good if we want it in core.

I tried to do the same but using built-on Postgres WAL traverse functions.
I have implemented it as extension for simplicity of integration.
In principle it can be started as BG worker.

I don't think this needs to be, or should be, an extension.. If this is
worthwhile (and it certainly appears to be) then we should just do it in
core.

First of all I tried to estimate effect of preloading data.
I have implemented prefetch utility with is also attached to this mail.
It performs random reads of blocks of some large file and spawns some number
of prefetch threads:

Just normal read without prefetch:
./prefetch -n 0 SOME_BIG_FILE

One prefetch thread which uses pread:
./prefetch SOME_BIG_FILE

One prefetch thread which uses posix_fadvise:
./prefetch -f SOME_BIG_FILE

4 prefetch thread which uses posix_fadvise:
./prefetch -f -n 4 SOME_BIG_FILE

Based on this experiments (on my desktop), I made the following conclusions:

1. Prefetch at HDD doesn't give any positive effect.
2. Using posix_fadvise allows to speed-up random read speed at SSD up to 2
times.
3. posix_fadvise(WILLNEED) is more efficient than performing normal reads.
4. Calling posix_fadvise in more than one thread has no sense.

Ok.

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.

Thanks!

Stephen

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#10)
Re: WAL prefetch

On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net> wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Amit Kapila (#11)
Re: WAL prefetch

On 15.06.2018 07:36, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net> wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from 56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then
first update of a page since last checkpoint will be written in WAL as
full page and applying it will not require reading any data from disk.
If this pages is updated multiple times in subsequent transactions, then
most likely it will be still present in OS file cache, unless checkpoint
interval exceeds OS cache size (amount of free memory in the system). So
if this conditions are satisfied then looks like prefetch is not needed.
And it seems to be true for most real configurations: checkpoint
interval is rarely set larger than hundred of gigabytes and modern
servers usually have more RAM.

But once this condition is not satisfied and lag is larger than size of
OS cache, then prefetch can be not efficient because prefetched pages
may be thrown away from OS cache before them are actually accessed by
redo process. In this case extra synchronization between prefetch and
replay processes is needed so that prefetch is not moving too far away
from replayed LSN.

It is not a problem to integrate this code in Postgres core and run it
in background worker. I do not think that performing prefetch in wal
receiver process itself is good idea: it may slow down speed of
receiving changes from master. And in this case I really can throw away
cut&pasted code. But it is easier to experiment with extension rather
than with patch to Postgres core.
And I have published this extension to make it possible to perform
experiments and check whether it is useful on real workloads.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Konstantin Knizhnik (#12)
Re: WAL prefetch

On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 15.06.2018 07:36, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net>
wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then first
update of a page since last checkpoint will be written in WAL as full page
and applying it will not require reading any data from disk.

What exactly you mean by above? AFAIU, it needs to read WAL to apply
full page image. See below code:

XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Amit Kapila (#13)
Re: WAL prefetch

On 15.06.2018 18:03, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 1:08 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 15.06.2018 07:36, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net>
wrote:

I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb
NVME
RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
The speed of synchronous replication between two nodes is increased from
56k
TPS to 60k TPS (on pgbench with scale 1000).

I'm also surprised that it wasn't a larger improvement.

Seems like it would make sense to implement in core using
posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
or nearby.. At least, that's the thinking I had when I was chatting w/
Sean.

Doing in-core certainly has some advantage such as it can easily reuse
the existing xlog code rather trying to make a copy as is currently
done in the patch, but I think it also depends on whether this is
really a win in a number of common cases or is it just a win in some
limited cases.

I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then first
update of a page since last checkpoint will be written in WAL as full page
and applying it will not require reading any data from disk.

What exactly you mean by above? AFAIU, it needs to read WAL to apply
full page image. See below code:

XLogReadBufferForRedoExtended()
{
..
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
Assert(XLogRecHasBlockImage(record, block_id));
*buf = XLogReadBufferExtended(rnode, forknum, blkno,
get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
..
}

Sorry, for my confusing statement.
Definitely we need to read page from WAL.
I mean that in case of "full page write" we do not need to read updated
page from the database.
It can be just overwritten.

pg_prefaulter and my wal_prefetch are not prefetching WAL pages themselves.
There is no sense to do it, because them are just written by
wal_receiver and so should be present in file system cache.
wal_prefetch is prefetching blocks referenced by WAL records. But in
case of "full page writes" such prefetch is not needed and even is harmful.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#15Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#4)
Re: WAL prefetch

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

But the current c'est la vie with Postgres is that allocating too large
memory for shared buffers is not recommended.
Due to many different reasons: degradation of clock replacement algorithm,
"write storm",...

I think a lot of that fear is overplayed. And we've fixed a number of
issues. We don't really generate write storms in the default config
anymore in most scenarios, and if it's an issue you can turn on
backend_flush_after.

If your system has 1Tb of memory,� almost none of Postgresql administrators
will recommend to use all this 1Tb for shared buffers.

I've used 1TB successfully.

Also PostgreSQL is not currently supporting dynamic changing of shared
buffers size. Without it, the only way of using Postgres in clouds and
another multiuser systems where system load is not fully controlled by� user
is to choose relatively small shared buffer size and rely on OS caching.

That seems largely unrelated to the replay case, because there the data
will be read into shared buffers anyway. And it'll be dirtied therein.

Greetings,

Andres Freund

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Konstantin Knizhnik (#14)
Re: WAL prefetch

On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 15.06.2018 18:03, Amit Kapila wrote:

wal_prefetch is prefetching blocks referenced by WAL records. But in case of
"full page writes" such prefetch is not needed and even is harmful.

Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced. If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#15)
Re: WAL prefetch

On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Amit Kapila (#16)
Re: WAL prefetch

On 16.06.2018 06:30, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 8:45 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 15.06.2018 18:03, Amit Kapila wrote:

wal_prefetch is prefetching blocks referenced by WAL records. But in case of
"full page writes" such prefetch is not needed and even is harmful.

Okay, IIUC, the basic idea is to prefetch recently modified data
pages, so that they can be referenced. If so, isn't there some
overlap with autoprewarm functionality which dumps recently modified
blocks and then on recovery, it can prefetch those?

Sorry,  I do not see any intersection with autoprewarw functionality:
wal prefetch is performed at replica where data was not yet modified:
actually the goal of WAL prefetch is to make this update more efficient.
WAL prefetch can be also done at standalone server to speed up recovery
after crash. But it seems to be much more exotic use case.

#19Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Amit Kapila (#17)
Re: WAL prefetch

On 16.06.2018 06:33, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.

Unfortunately I afraid that a) requires different approach: unlike
posix_fadvise,  reading data to shared buffer is blocking operation. If
we do it by one worker, then it will read it with the same speed as redo
process. So to make prefetch really efficient,  in this case we have to
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
does).

Another my concern against prefetching to shared buffers is that it may
flush away from cache pages which are most frequently used by read only
queries at hot standby replica.

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Konstantin Knizhnik (#19)
Re: WAL prefetch

On Sat, Jun 16, 2018 at 10:47 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 16.06.2018 06:33, Amit Kapila wrote:

On Fri, Jun 15, 2018 at 11:31 PM, Andres Freund <andres@anarazel.de>
wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch
block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache... what about shared buffers?

It is good question. I thought a lot about prefetching directly to
shared
buffers.

I think that's definitely how this should work. I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

We can think of supporting two modes (a) allows to read into shared
buffers or (b) allows to read into OS page cache.

Unfortunately I afraid that a) requires different approach: unlike
posix_fadvise, reading data to shared buffer is blocking operation. If we
do it by one worker, then it will read it with the same speed as redo
process. So to make prefetch really efficient, in this case we have to
spawn multiple workers to perform prefetch in parallel (as pg_prefaulter
does).

Another my concern against prefetching to shared buffers is that it may
flush away from cache pages which are most frequently used by read only
queries at hot standby replica.

Okay, but I am suggesting to make it optional so that it can be
enabled when helpful (say when the user has enough shared buffers to
hold the data).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#15)
#22Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Thomas Munro (#22)
#24Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#21)
#26Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#1)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#25)
#28Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#27)
#29Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#25)
#30Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#26)
#31Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#29)
#32Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#30)
#33Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#31)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
#35Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#34)
#36Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#35)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#36)
#38Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#37)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#38)
#40Ants Aasma
ants.aasma@cybertec.at
In reply to: Tomas Vondra (#39)
#41Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Ants Aasma (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#41)
#43Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#36)
#44Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#43)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#43)
#46Andres Freund
andres@anarazel.de
In reply to: Konstantin Knizhnik (#44)
#47Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#45)
#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#44)
#49Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#48)
#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#49)
#51Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#50)
#52Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Konstantin Knizhnik (#51)
#53Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#52)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#53)
#55Sean Chittenden
sean@chittenden.org
In reply to: Tomas Vondra (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Sean Chittenden (#55)
#57Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#56)
#58Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#54)
#59Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#57)