Use fadvise in wal replay

Started by Kirill Reshkealmost 4 years ago35 messageshackers
Jump to latest
#1Kirill Reshke
reshke@double.cloud

Hi hackers!

Recently we faced a problem with one of our production clusters. We use a
cascade replication setup in this cluster, that is: master, standby (r1),
and cascade standby (r2). From time to time, the replication lag on r1 used
to grow, while on r2 it did not. Analysys showed that r1 startup process
was spending a lot of time in reading wal from disk. Increasing
/sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case.
Maybe we can add fadvise call in postgresql startup, so it would not be
necessary to change settings on the hypervisor?

Attachments:

v1-0001-Use-fadvise-to-prefect-wal-in-xlogrecovery.patchapplication/octet-stream; name=v1-0001-Use-fadvise-to-prefect-wal-in-xlogrecovery.patchDownload+7-1
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Kirill Reshke (#1)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 1:07 PM Kirill Reshke <reshke@double.cloud> wrote:

Recently we faced a problem with one of our production clusters. We use a cascade replication setup in this cluster, that is: master, standby (r1), and cascade standby (r2). From time to time, the replication lag on r1 used to grow, while on r2 it did not. Analysys showed that r1 startup process was spending a lot of time in reading wal from disk. Increasing /sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case. Maybe we can add fadvise call in postgresql startup, so it would not be necessary to change settings on the hypervisor?

I wonder if the newly introduced "recovery_prefetch" [1]https://www.postgresql.org/docs/devel/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY for PG-15 can
help your case?

[1]: https://www.postgresql.org/docs/devel/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY

--
With Regards,
Amit Kapila.

#3Andrey Borodin
amborodin@acm.org
In reply to: Amit Kapila (#2)
Re: Use fadvise in wal replay

On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it. Kirill is trying to solve the problem of reading WAL segments that are our of OS page cache.

Best regards, Andrey Borodin.

#4Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Andrey Borodin (#3)
RE: Use fadvise in wal replay

On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to
prefetch WAL itself before reading it. Kirill is trying to solve the problem of
reading WAL segments that are our of OS page cache.

It seems that it is always by default set to 128 (kB) by default, another thing is that having (default) 16MB WAL segments might also hinder the readahead heuristics compared to having configured the bigger WAL segment size.

Maybe the important question is why would be readahead mechanism be disabled in the first place via /sys | blockdev ?

-J.

#5Andrey Borodin
amborodin@acm.org
In reply to: Jakub Wartak (#4)
Re: Use fadvise in wal replay

On 21 Jun 2022, at 13:20, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

Maybe the important question is why would be readahead mechanism be disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be prefetched and which should not. Big OS readahead affects index scan performance.

Best regards, Andrey Borodin.

#6Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Andrey Borodin (#5)
RE: Use fadvise in wal replay

Maybe the important question is why would be readahead mechanism be

disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be
prefetched and which should not. Big OS readahead affects index scan
performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is already hot and there is example from the past where syscalls were limiting the performance [1]https://commitfest.postgresql.org/28/2606/. Maybe it could be prefetching in larger batches (128kB? 1MB? 16MB?) ?

-J.

[1]: https://commitfest.postgresql.org/28/2606/

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Jakub Wartak (#6)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

Maybe the important question is why would be readahead mechanism be

disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be
prefetched and which should not. Big OS readahead affects index scan
performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is already hot and there is example from the past where syscalls were limiting the performance [1]. Maybe it could be prefetching in larger batches (128kB? 1MB? 16MB?) ?

I've always thought we'd want to tell it about the *next* segment
file, to smooth the transition from one file to the next, something
like the attached (not tested).

Attachments:

prefetch-wal-segments.patchtext/x-patch; charset=US-ASCII; name=prefetch-wal-segments.patchDownload+30-0
#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#7)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

Maybe the important question is why would be readahead mechanism be

disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be
prefetched and which should not. Big OS readahead affects index scan
performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is already hot and there is example from the past where syscalls were limiting the performance [1]. Maybe it could be prefetching in larger batches (128kB? 1MB? 16MB?) ?

I've always thought we'd want to tell it about the *next* segment
file, to smooth the transition from one file to the next, something
like the attached (not tested).

Yes, it makes sense to prefetch the "future" WAL files that "may be"
needed for recovery (crash recovery/archive or PITR recovery/standby
recovery), not the current WAL file. Having said that, it's not a
great idea (IMO) to make the WAL readers prefetching instead WAL
prefetching can be delegated to a new background worker or existing bg
writer or checkpointer which gets started during recovery.

Also, it's a good idea to measure the benefits with and without WAL
prefetching for all recovery types - crash recovery/archive or PITR
recovery/standby recovery. For standby recovery, the WAL files may be
in OS cache if there wasn't a huge apply lag.

Regards,
Bharath Rupireddy.

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Andrey Borodin (#3)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it. Kirill is trying to solve the problem of reading WAL segments that are our of OS page cache.

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

--
With Regards,
Amit Kapila.

#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#9)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it. Kirill is trying to solve the problem of reading WAL segments that are our of OS page cache.

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

That's not always true. If there's a huge apply lag and/or
restartpoint is infrequent/frequent or there are many reads on the
standby - in all of these cases the OS cache can replace the WAL from
it causing the startup process to hit the disk for WAL reading.

Regards,
Bharath Rupireddy.

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Bharath Rupireddy (#10)
Re: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 5:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it. Kirill is trying to solve the problem of reading WAL segments that are our of OS page cache.

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

That's not always true. If there's a huge apply lag and/or
restartpoint is infrequent/frequent or there are many reads on the
standby - in all of these cases the OS cache can replace the WAL from
it causing the startup process to hit the disk for WAL reading.

It is possible that due to one or more these reasons startup process
has to physically read the WAL. I think it is better to find out what
is going on for the OP. AFAICS, there is no mention of any other kind
of reads on the problematic standby. As per the analysis shared in the
initial email, the replication lag is due to disk reads, so there
doesn't seem to be a very clear theory as to why the OP is seeing disk
reads.

--
With Regards,
Amit Kapila.

#12Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Thomas Munro (#7)
RE: Use fadvise in wal replay

On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com>
wrote:

Maybe the important question is why would be readahead mechanism
be

disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be
prefetched and which should not. Big OS readahead affects index scan
performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ

which is not cheap either. The code is already hot and there is example from the
past where syscalls were limiting the performance [1]. Maybe it could be
prefetching in larger batches (128kB? 1MB? 16MB?) ?

I've always thought we'd want to tell it about the *next* segment file, to
smooth the transition from one file to the next, something like the attached (not
tested).

Hey Thomas!

Apparently it's false theory. Redo-bench [1]https://github.com/macdice/redo-bench results (1st is total recovery time in seconds, 3.1GB pgdata (out of which 2.6 pg_wals/166 files). Redo-bench was slightly hacked to drop fs caches always after copying so that there is nothing in fscache (both no pgdata and no pg_wals; shared fs). M_io_c is at default (10), recovery_prefetch same (try; on by default)

master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master, blockdev --setra 0 /dev/nvme0n1:
53.151, 0.603
58.329, 0.525
52.435, 0.536

master, with yours patch (readaheads disabled) -- double checked, calls to fadvise64(offset=0 len=0) were there
58.063, 0.593
51.369, 0.574
51.716, 0.59

master, with Kirill's original patch (readaheads disabled)
38.25, 1.134
36.563, 0.582
37.711, 0.584

I've noted also that in both cases POSIX_FADV_SEQUENTIAL is being used instead of WILLNEED (?).
I haven't quantified the tradeoff of master vs Kirill's with readahead, but I think that 1 additional syscall is not going to be cheap just for non-standard OS configurations (?)

-J.

[1]: https://github.com/macdice/redo-bench

#13Andrey Borodin
amborodin@acm.org
In reply to: Jakub Wartak (#12)
Re: Use fadvise in wal replay

On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:

Oh, wow, your benchmarks show really impressive improvement.

I think that 1 additional syscall is not going to be cheap just for non-standard OS configurations

Also we can reduce number of syscalls by something like

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8, POSIX_FADV_WILLNEED);
#endif

and maybe define\reuse the some GUC to control number of prefetched pages at once.

Best regards, Andrey Borodin.

#14Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Andrey Borodin (#13)
Re: Use fadvise in wal replay

On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:

Oh, wow, your benchmarks show really impressive improvement.

FWIW I was trying to speedup long sequential file reads in Postgres using
fadvise hints. I've found no detectable improvements.
Then I've written 1Mb - 1Gb sequential read test with both fadvise
POSIX_FADV_WILLNEED
and POSIX_FADV_SEQUENTIAL in Linux. The only improvement I've found was

1. when the size of read was around several Mb and fadvise len also around
several Mb.
2. when before fdavice and the first read there was a delay (which was
supposedly used by OS for reading into prefetch buffer)
3. If I read sequential blocks i saw speedup only on first ones. Overall
read speed of say 1Gb file remained unchanged no matter what.

I became convinced that if I read something long, OS does necessary
speedups automatically (which is also in agreement with fadvise manual/code
comments).
Could you please elaborate how have you got the results with that big
difference? (Though I don't against fadvise usage, at worst it is expected
to be useless).

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com&gt;

#15Andrey Borodin
amborodin@acm.org
In reply to: Pavel Borisov (#14)
Re: Use fadvise in wal replay

On 21 Jun 2022, at 20:52, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:

Oh, wow, your benchmarks show really impressive improvement.

FWIW I was trying to speedup long sequential file reads in Postgres using fadvise hints. I've found no detectable improvements.
Then I've written 1Mb - 1Gb sequential read test with both fadvise POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux.

Did you drop caches?

The only improvement I've found was

1. when the size of read was around several Mb and fadvise len also around several Mb.
2. when before fdavice and the first read there was a delay (which was supposedly used by OS for reading into prefetch buffer)

That's the case of startup process: you read a xlog page, then redo records from this page.

3. If I read sequential blocks i saw speedup only on first ones. Overall read speed of say 1Gb file remained unchanged no matter what.

I became convinced that if I read something long, OS does necessary speedups automatically (which is also in agreement with fadvise manual/code comments).
Could you please elaborate how have you got the results with that big difference? (Though I don't against fadvise usage, at worst it is expected to be useless).

FWIW we with Kirill observed drastically reduced lag on a production server when running patched version. Fidvise surely works :) The question is how to use it optimally.

Best regards, Andrey Borodin.

#16Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Andrey Borodin (#15)
Re: Use fadvise in wal replay

On Wed, Jun 22, 2022 at 2:07 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 21 Jun 2022, at 20:52, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com>

wrote:

Oh, wow, your benchmarks show really impressive improvement.

FWIW I was trying to speedup long sequential file reads in Postgres

using fadvise hints. I've found no detectable improvements.

Then I've written 1Mb - 1Gb sequential read test with both fadvise

POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux.
Did you drop caches?

Yes. I saw nothing changes speed of long file (50Mb+) read.

The only improvement I've found was

1. when the size of read was around several Mb and fadvise len also

around several Mb.

2. when before fdavice and the first read there was a delay (which was

supposedly used by OS for reading into prefetch buffer)
That's the case of startup process: you read a xlog page, then redo
records from this page.

Then I'd guess that your speedup is due to speeding up the first several
Mb's in many files opened (and delay for kernel prefetch is due to some
other reason). That may differ from the case I've tried to measure speedup
and this could be the cause of speedup in your case.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com <http://www.postgrespro.com&gt;

#17Andrey Borodin
amborodin@acm.org
In reply to: Pavel Borisov (#16)
Re: Use fadvise in wal replay

On 22 Jun 2022, at 13:26, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Then I'd guess that your speedup is due to speeding up the first several Mb's in many files opened

I think in this case Thomas' aproach of prefetching next WAL segment would do better. But Jakub observed opposite results.

Best regards, Andrey Borodin.

#18Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Andrey Borodin (#13)
RE: Use fadvise in wal replay

On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:

Oh, wow, your benchmarks show really impressive improvement.

I think that 1 additional syscall is not going to be cheap just for
non-standard OS configurations

Also we can reduce number of syscalls by something like

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8,
POSIX_FADV_WILLNEED); #endif

and maybe define\reuse the some GUC to control number of prefetched pages
at once.

Hi, I was thinking the same, so I got the patch (attached) to the point it gets the identical performance with and without readahead enabled:

baseline, master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master+patched, readahead disabled:
34.338, 0.528
34.568, 0.575
34.007, 1.136

master+patched, readahead enabled (as default):
33.935, 0.523
34.109, 0.501
33.408, 0.557

Thoughts?

Notes:
- no GUC, as the default/identical value seems to be the best
- POSIX_FADV_SEQUENTIAL is apparently much slower and doesn't seem to have effect from xlogreader.c at all while _WILLNEED does (testing again contradicts "common wisdom"?)

-J.

Attachments:

0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patchapplication/octet-stream; name=0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patchDownload+16-1
#19Andrey Borodin
amborodin@acm.org
In reply to: Jakub Wartak (#18)
Re: Use fadvise in wal replay

23 июня 2022 г., в 13:50, Jakub Wartak <Jakub.Wartak@tomtom.com> написал(а):

Thoughts?

The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra branch for 120KB after 1st block when readOff==0?
Or maybe do
+		posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK, POSIX_FADV_WILLNEED);
instead of
+		posix_fadvise(readFile, readOff + RACHUNK    , RACHUNK, POSIX_FADV_WILLNEED);
?

Notes:
- no GUC, as the default/identical value seems to be the best

I think adding this performance boost on most systems definitely worth 1 syscall per 16 pages. And I believe 128KB to be optimal for most storages. And having no GUCs sounds great.

But storage systems might be different, far beyond benchmarks.
All in all, I don't have strong opinion on having 1 or 0 GUCs to configure this.

I've added patch to the CF.

Thanks!

Best regards, Andrey Borodin.

#20Jakub Wartak
Jakub.Wartak@tomtom.com
In reply to: Andrey Borodin (#19)
RE: Use fadvise in wal replay

Hey Andrey,

23 июня 2022 г., в 13:50, Jakub Wartak <Jakub.Wartak@tomtom.com>

написал(а):

Thoughts?

The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra
branch for 120KB after 1st block when readOff==0?
Or maybe do
+		posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK,
POSIX_FADV_WILLNEED);
instead of
+		posix_fadvise(readFile, readOff + RACHUNK    , RACHUNK,
POSIX_FADV_WILLNEED);
?

Notes:
- no GUC, as the default/identical value seems to be the best

I think adding this performance boost on most systems definitely worth 1 syscall
per 16 pages. And I believe 128KB to be optimal for most storages. And having
no GUCs sounds great.

But storage systems might be different, far beyond benchmarks.
All in all, I don't have strong opinion on having 1 or 0 GUCs to configure this.

I've added patch to the CF.

Cool. As for GUC I'm afraid there's going to be resistance of adding yet another GUC (to avoid many knobs). Ideally it would be nice if we had some advanced/deep/hidden parameters , but there isn't such thing.
Maybe another option would be to use (N * maintenance_io_concurrency * XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default value, and still can be tweaked by enduser). Let's wait what others say?

-J.

#21Justin Pryzby
pryzby@telsasoft.com
In reply to: Jakub Wartak (#20)
#22Andrey Borodin
amborodin@acm.org
In reply to: Jakub Wartak (#18)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Jakub Wartak (#20)
#24Andrey Borodin
amborodin@acm.org
In reply to: Robert Haas (#23)
#25Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#24)
#26Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#25)
#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andrey Borodin (#26)
#28Andrey Borodin
amborodin@acm.org
In reply to: Bharath Rupireddy (#27)
#29Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#28)
#30Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Andrey Borodin (#29)
#31Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Pavel Borisov (#30)
#32Andrey Borodin
amborodin@acm.org
In reply to: Pavel Borisov (#30)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#33)
#35Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: Andres Freund (#34)