patch to allow disable of WAL recycling
Hello All,
Attached is a patch to provide an option to disable WAL recycling. We have
found that this can help performance by eliminating read-modify-write
behavior on old WAL files that are no longer resident in the filesystem
cache. The is a lot more detail on the background of the motivation for
this in the following thread.
/messages/by-id/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w@mail.gmail.com
A similar change has been tested against our 9.6 branch that we're
currently running, but the attached patch is against master.
Thanks,
Jerry
Attachments:
0001-option-to-disable-WAL-recycling.patchapplication/octet-stream; name=0001-option-to-disable-WAL-recycling.patchDownload+36-2
On 26.06.18 15:35, Jerry Jelinek wrote:
Attached is a patch to provide an option to disable WAL recycling. We
have found that this can help performance by eliminating
read-modify-write behavior on old WAL files that are no longer resident
in the filesystem cache. The is a lot more detail on the background of
the motivation for this in the following thread.
Your patch describes this feature as a performance feature. We would
need to see more measurements about what this would do on other
platforms and file systems than your particular one. Also, we need to
be careful with user options that trade off reliability for performance
and describe them in much more detail.
If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().
Then again, I can understand that turning off WAL recycling is sensible
on ZFS, since there is no point in preallocating space that will never
be used. But then we should also turn off all other preallocation of
WAL files, including the creation of new (non-recycled) ones.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter,
Thanks for taking a look a this. I have a few responses in line. I am not a
PG expert, so if there is something here that I've misunderstood, please
let me know.
On Sun, Jul 1, 2018 at 6:54 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:
On 26.06.18 15:35, Jerry Jelinek wrote:
Attached is a patch to provide an option to disable WAL recycling. We
have found that this can help performance by eliminating
read-modify-write behavior on old WAL files that are no longer resident
in the filesystem cache. The is a lot more detail on the background of
the motivation for this in the following thread.Your patch describes this feature as a performance feature. We would
need to see more measurements about what this would do on other
platforms and file systems than your particular one. Also, we need to
be careful with user options that trade off reliability for performance
and describe them in much more detail.
I don't think this change really impacts the reliability of PG, since PG
doesn't actually preallocate all of the WAL files. I think PG will allocate
WAL files as it runs, up to the wal_keep_segments limit, at which point it
would start recycling. If the filesystem fills up before that limit is
reached, PG would have to handle the filesystem being full when attempting
to allocate a new WAL file (as it would with my change if WAL recycling is
disabled). Of course once all of the WAL files have finally been allocated,
then PG won't need additional space on a non-COW filesystem. I'd be happy
to add more details to the man page change describing this new option and
the implications if the underlying filesystem fills up.
If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().
I'm not sure that solves the problem for non-cached files, which is where
we've observed the performance impact of recycling, where what should be a
write intensive workload turns into a read-modify-write workload because
we're now reading an old WAL file that is many hours, or even days, old and
has thus fallen out of the memory-cached data for the filesystem. The disk
reads still have to happen.
Then again, I can understand that turning off WAL recycling is sensible
on ZFS, since there is no point in preallocating space that will never
be used. But then we should also turn off all other preallocation of
WAL files, including the creation of new (non-recycled) ones.
I don't think we'd see any benefit from that (since the newly allocated
file is certainly cached), and the change would be much more intrusive, so
I opted for the trivial change in the patch I proposed.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks again,
Jerry
On 05.07.18 17:37, Jerry Jelinek wrote:
Your patch describes this feature as a performance feature. We would
need to see more measurements about what this would do on other
platforms and file systems than your particular one. Also, we need to
be careful with user options that trade off reliability for performance
and describe them in much more detail.I don't think this change really impacts the reliability of PG, since PG
doesn't actually preallocate all of the WAL files. I think PG will
allocate WAL files as it runs, up to the wal_keep_segments limit, at
which point it would start recycling. If the filesystem fills up before
that limit is reached, PG would have to handle the filesystem being full
when attempting to allocate a new WAL file (as it would with my change
if WAL recycling is disabled). Of course once all of the WAL files have
finally been allocated, then PG won't need additional space on a non-COW
filesystem. I'd be happy to add more details to the man page change
describing this new option and the implications if the underlying
filesystem fills up.
The point is, the WAL recycling has a purpose, perhaps several. If it
didn't have one, we wouldn't do it. So if we add an option to turn it
off to get performance gains, we have to do some homework.
If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().I'm not sure that solves the problem for non-cached files, which is
where we've observed the performance impact of recycling, where what
should be a write intensive workload turns into a read-modify-write
workload because we're now reading an old WAL file that is many hours,
or even days, old and has thus fallen out of the memory-cached data for
the filesystem. The disk reads still have to happen.
But they could happen ahead of time.
Then again, I can understand that turning off WAL recycling is sensible
on ZFS, since there is no point in preallocating space that will never
be used. But then we should also turn off all other preallocation of
WAL files, including the creation of new (non-recycled) ones.I don't think we'd see any benefit from that (since the newly allocated
file is certainly cached), and the change would be much more intrusive,
so I opted for the trivial change in the patch I proposed.
The change would be more invasive, but I think it would ultimately make
the code more clear and maintainable and the user interfaces more
understandable in the long run. I think that would be better than a
slightly ad hoc knob that fixed one particular workload once upon a time.
But we're probably not there yet. We should start with a more detailed
performance analysis of the originally proposed patch.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2018-06-26 07:35:57 -0600, Jerry Jelinek wrote:
+ <varlistentry id="guc-wal-recycle" xreflabel="wal_recycle"> + <term><varname>wal_recycle</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_recycle</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + When this parameter is <literal>on</literal>, past log file segments + in the <filename>pg_wal</filename> directory are recycled for future + use. + </para> + + <para> + Turning this parameter off causes past log files segments to be deleted + when no longer needed. This can improve performance by eliminating + read-modify-write operations on old files which are no longer in the + filesystem cache. + </para> + </listitem> + </varlistentry>
This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
performance impact of non COW filesystems, and very likely even negative
impacts in a number of COWed scenarios (when there's enough memory to
keep all WAL files in memory).
I still think that fixing this another way would be preferrable. This'll
be too much of a magic knob that depends on the fs, hardware and
workload.
Greetings,
Andres Freund
On Fri, Jul 6, 2018 at 3:37 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
wrote:
If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().I'm not sure that solves the problem for non-cached files, which is where
we've observed the performance impact of recycling, where what should be a
write intensive workload turns into a read-modify-write workload because
we're now reading an old WAL file that is many hours, or even days, old
and
has thus fallen out of the memory-cached data for the filesystem. The disk
reads still have to happen.
What ZFS record size are you using? PostgreSQL's XLOG_BLCKSZ is usually
8192 bytes. When XLogWrite() calls write(some multiple of XLOG_BLCKSZ), on
a traditional filesystem the kernel will say 'oh, that's overwriting whole
pages exactly, so I have no need to read it from disk' (for example in
FreeBSD ffs_vnops.c ffs_write() see the comment "We must peform a
read-before-write if the transfer size does not cover the entire buffer").
I assume ZFS has a similar optimisation, but it uses much larger records
than the traditional 4096 byte pages, defaulting to 128KB. Is that the
reason for this?
--
Thomas Munro
http://www.enterprisedb.com
Thomas,
We're using a zfs recordsize of 8k to match the PG blocksize of 8k, so what
you're describing is not the issue here.
Thanks,
Jerry
On Thu, Jul 5, 2018 at 3:44 PM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
Show quoted text
On Fri, Jul 6, 2018 at 3:37 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
wrote:If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().I'm not sure that solves the problem for non-cached files, which is where
we've observed the performance impact of recycling, where what should bea
write intensive workload turns into a read-modify-write workload because
we're now reading an old WAL file that is many hours, or even days, oldand
has thus fallen out of the memory-cached data for the filesystem. The
disk
reads still have to happen.
What ZFS record size are you using? PostgreSQL's XLOG_BLCKSZ is usually
8192 bytes. When XLogWrite() calls write(some multiple of XLOG_BLCKSZ), on
a traditional filesystem the kernel will say 'oh, that's overwriting whole
pages exactly, so I have no need to read it from disk' (for example in
FreeBSD ffs_vnops.c ffs_write() see the comment "We must peform a
read-before-write if the transfer size does not cover the entire buffer").
I assume ZFS has a similar optimisation, but it uses much larger records
than the traditional 4096 byte pages, defaulting to 128KB. Is that the
reason for this?--
Thomas Munro
http://www.enterprisedb.com
Thanks to everyone who took the time to look at the patch and send me
feedback. I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'm trying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.
1) Disabling WAL recycling could have a negative performance impact on a
COW filesystem if all WAL files could be kept in the filesystem cache.
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
3) Using something like posix_fadvise to reload recycled WAL files into the
filesystem cache is better even for a COW filesystem.
4) There are "several" other purposes for WAL recycling which this tunable
would impact.
5) A WAL recycling tunable is too specific and a more general solution is
needed.
6) Need more performance data.
For #1, #2 and #3, I don't understand these concerns. It would be helpful
if these could be more specific
For #4, can anybody enumerate these other purposes for WAL recycling?
For #5, perhaps I am making an incorrect assumption about what the original
response was requesting, but I understand that WAL recycling is just one
aspect of WAL file creation/allocation. However, the creation of a new WAL
file is not a problem we've ever observed. In general, any modern
filesystem should do a good job of caching recently accessed files. We've
never observed a problem with the allocation of a new WAL file slightly
before it is needed. The problem we have observed is specifically around
WAL file recycling when we have to access old files that are long gone from
the filesystem cache. The semantics around recycling seem pretty crisp as
compared to some other tunable which would completely change how WAL files
are created. Given that a change like that is also much more intrusive, it
seems better to provide a tunable to disable WAL recycling vs. some other
kind of tunable for which we can't articulate any improvement except in the
recycling scenario.
For #6, there is no feasible way for us to recreate our workload on other
operating systems or filesystems. Can anyone expand on what performance
data is needed?
I'd like to restate the original problem we observed.
When PostgreSQL decides to reuse an old WAL file whose contents have been
evicted from the cache (because they haven't been used in hours), this
turns what should be a workload bottlenecked by synchronous write
performance (that can be well-optimized with an SSD log device) into a
random read workload (that's much more expensive for any system). What's
significantly worse is that we saw this on synchronous standbys. When that
happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes. This is particularly bad for us when the sync is
doing other I/O (e.g., for an autovacuum or a database backup) that causes
disk reads to take hundreds of milliseconds.
To summarize, recycling old WAL files seems like an optimization designed
for certain filesystems that allocate disk blocks up front. Given that the
existing behavior is already filesystem specific, is there specific reasons
why we can't provide a tunable to disable this behavior for filesystems
which don't behave that way?
Thanks again,
Jerry
On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
wrote:
Show quoted text
Hello All,
Attached is a patch to provide an option to disable WAL recycling. We have
found that this can help performance by eliminating read-modify-write
behavior on old WAL files that are no longer resident in the filesystem
cache. The is a lot more detail on the background of the motivation for
this in the following thread./messages/by-id/CACukRjO7DJvub8e2AijOayj8BfKK3
XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3
XXBTwu3KKARiTr67M3E3w@mail.gmail.comA similar change has been tested against our 9.6 branch that we're
currently running, but the attached patch is against master.Thanks,
Jerry
On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me
feedback. I'm happy to work on improving the documentation of this
new tunable to clarify when it should be used and the implications.
I'm trying to understand more specifically what else needs to be done
next. To summarize, I think the following general concerns were
brought up.For #6, there is no feasible way for us to recreate our workload on
other operating systems or filesystems. Can anyone expand on what
performance data is needed?
I think a simple way to prove this would be to run BenchmarkSQL against
PostgreSQL in a default configuration with pg_xlog/pg_wal on a
filesystem that is COW (zfs) and then run another test where
pg_xlog/pg_wal is patched with your patch and new behavior and then run
the test again. BenchmarkSQL is a more thorough benchmarking tool that
something like pg_bench and is very easy to setup.
The reason you would use a default configuration is because it will
cause a huge amount of wal churn, although a test with a proper wal
configuration would also be good.
Thanks,
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****
On 2018-Jul-10, Jerry Jelinek wrote:
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full. If you remove it,
that protection goes out the window. You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice. I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards? Is there any corrupt data? What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something? As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me
feedback. I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'm trying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.For #6, there is no feasible way for us to recreate our workload on other
operating systems or filesystems. Can anyone expand on what performance data
is needed?I think a simple way to prove this would be to run BenchmarkSQL against
PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem
that is COW (zfs) and then run another test where pg_xlog/pg_wal is patched
with your patch and new behavior and then run the test again. BenchmarkSQL
is a more thorough benchmarking tool that something like pg_bench and is
very easy to setup.
I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS
on spinning rust. It occurred to me that such an anaemic machine
might show this effect easily because its cold reads are as slow as a
Lada full of elephants going uphill. Let's see...
# os setup
sysctl vfs.zfs.arc_min=134217728
sysctl vfs.zfs.arc_max=134217728
zfs create zoot/data/test
zfs set mountpoint=/data/test zroot/data/test
zfs set compression=off zroot/data/test
zfs set recordsize=8192 zroot/data/test
# initdb into /data/test/pgdata, then set postgresql.conf up like this:
fsync=off
max_wal_size = 600MB
min_wal_size = 600MB
# small scale test, we're only interested in producing WAL, not db size
pgbench -i -s 100 postgres
# do this a few times first, to make sure we have lots of WAL segments
pgbench -M prepared -c 4 -j 4 -T 60 postgres
# now test...
With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10
shows numbers like this:
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ad0 ad1 in sy cs us sy id
3 0 3 1.2G 3.1G 4496 0 0 0 52 76 144 138 607 84107 29713 55 17 28
4 0 3 1.2G 3.1G 2955 0 0 0 84 77 134 130 609 82942 34324 61 17 22
4 0 3 1.2G 3.1G 2327 0 0 0 0 77 114 125 454 83157 29638 68 15 18
5 0 3 1.2G 3.1G 1966 0 0 0 82 77 86 81 335 84480 25077 74 13 12
3 0 3 1.2G 3.1G 1793 0 0 0 533 74 72 68 310 127890 31370 77 16 7
4 0 3 1.2G 3.1G 1113 0 0 0 151 73 95 94 363 128302 29827 74 18 8
With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10
shows numbers like this:
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ad0 ad1 in sy cs us sy id
0 0 3 1.2G 3.1G 148 0 0 0 402 71 38 38 153 16668 5656 10 3 87
5 0 3 1.2G 3.1G 4527 0 0 0 50 73 28 27 123 123986 23373 68 15 17
5 0 3 1.2G 3.1G 3036 0 0 0 151 73 47 49 181 148014 29412 83 16 0
4 0 3 1.2G 3.1G 2063 0 0 0 233 73 56 54 200 143018 28699 81 17 2
4 0 3 1.2G 3.1G 1202 0 0 0 95 73 48 49 189 147276 29196 81 18 1
4 0 3 1.2G 3.1G 732 0 0 0 0 73 56 55 207 146805 29265 82 17 1
I don't have time to investigate further for now and my knowledge of
ZFS is superficial, but the patch seems to have a clear beneficial
effect, reducing disk IOs and page faults on my little storage box.
Obviously this isn't representative of a proper server environment, or
some other OS, but it's a clue. That surprised me... I was quietly
hoping it was hoping it was going to be 'oh, if you turn off
compression and use 8kb it doesn't happen because the pages line up'.
But nope.
--
Thomas Munro
http://www.enterprisedb.com
Alvaro,
I'll perform several test runs with various combinations and post the
results.
Thanks,
Jerry
On Tue, Jul 10, 2018 at 2:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Show quoted text
On 2018-Jul-10, Jerry Jelinek wrote:
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full. If you remove it,
that protection goes out the window. You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice. I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards? Is there any corrupt data? What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something? As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
On 2018-Jul-10, Jerry Jelinek wrote:
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full. If you remove it,
that protection goes out the window. You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice. I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards? Is there any corrupt data? What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something? As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.
If the result of hitting ENOSPC when creating or writing to a WAL file was
that the database could become corrupted, then wouldn't that risk already
be present (a) on any system, for the whole period from database init until
the maximum number of WAL files was created, and (b) all the time on any
copy-on-write filesystem?
Thanks,
Dave
Hi,
On 2018-07-10 14:15:30 -0600, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me
feedback. I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'm trying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.1) Disabling WAL recycling could have a negative performance impact on a
COW filesystem if all WAL files could be kept in the filesystem cache.
For #1, #2 and #3, I don't understand these concerns. It would be helpful
if these could be more specific
We perform more writes (new files are zeroed, which needs to be
fsynced), and increase metadata traffic (creation of files), when not
recycling.
Regards,
Andres
On Tue, Jul 10, 2018 at 10:32 PM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:
On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake <jd@commandprompt.com>
wrote:On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me
feedback. I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'mtrying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.For #6, there is no feasible way for us to recreate our workload on
other
operating systems or filesystems. Can anyone expand on what performance
data
is needed?
I think a simple way to prove this would be to run BenchmarkSQL against
PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem
that is COW (zfs) and then run another test where pg_xlog/pg_wal ispatched
with your patch and new behavior and then run the test again.
BenchmarkSQL
is a more thorough benchmarking tool that something like pg_bench and is
very easy to setup.I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS
on spinning rust. It occurred to me that such an anaemic machine
might show this effect easily because its cold reads are as slow as a
Lada full of elephants going uphill. Let's see...# os setup
sysctl vfs.zfs.arc_min=134217728
sysctl vfs.zfs.arc_max=134217728
zfs create zoot/data/test
zfs set mountpoint=/data/test zroot/data/test
zfs set compression=off zroot/data/test
zfs set recordsize=8192 zroot/data/test# initdb into /data/test/pgdata, then set postgresql.conf up like this:
fsync=off
max_wal_size = 600MB
min_wal_size = 600MB# small scale test, we're only interested in producing WAL, not db size
pgbench -i -s 100 postgres# do this a few times first, to make sure we have lots of WAL segments
pgbench -M prepared -c 4 -j 4 -T 60 postgres# now test...
With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10
shows numbers like this:procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ad0 ad1 in sy cs us
sy id
3 0 3 1.2G 3.1G 4496 0 0 0 52 76 144 138 607 84107 29713 55
17 28
4 0 3 1.2G 3.1G 2955 0 0 0 84 77 134 130 609 82942 34324 61
17 22
4 0 3 1.2G 3.1G 2327 0 0 0 0 77 114 125 454 83157 29638 68
15 18
5 0 3 1.2G 3.1G 1966 0 0 0 82 77 86 81 335 84480 25077 74
13 12
3 0 3 1.2G 3.1G 1793 0 0 0 533 74 72 68 310 127890 31370 77
16 7
4 0 3 1.2G 3.1G 1113 0 0 0 151 73 95 94 363 128302 29827 74
18 8With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10
shows numbers like this:procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr ad0 ad1 in sy cs us
sy id
0 0 3 1.2G 3.1G 148 0 0 0 402 71 38 38 153 16668 5656 10
3 87
5 0 3 1.2G 3.1G 4527 0 0 0 50 73 28 27 123 123986 23373 68
15 17
5 0 3 1.2G 3.1G 3036 0 0 0 151 73 47 49 181 148014 29412 83
16 0
4 0 3 1.2G 3.1G 2063 0 0 0 233 73 56 54 200 143018 28699 81
17 2
4 0 3 1.2G 3.1G 1202 0 0 0 95 73 48 49 189 147276 29196 81
18 1
4 0 3 1.2G 3.1G 732 0 0 0 0 73 56 55 207 146805 29265 82
17 1I don't have time to investigate further for now and my knowledge of
ZFS is superficial, but the patch seems to have a clear beneficial
effect, reducing disk IOs and page faults on my little storage box.
Obviously this isn't representative of a proper server environment, or
some other OS, but it's a clue. That surprised me... I was quietly
hoping it was hoping it was going to be 'oh, if you turn off
compression and use 8kb it doesn't happen because the pages line up'.
But nope.--
Thomas Munro
http://www.enterprisedb.com
Hi Thomas,
Thanks for testing! It's validating that you saw the same results.
-- Dave
On 07/12/2018 02:25 AM, David Pacheco wrote:
On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:On 2018-Jul-10, Jerry Jelinek wrote:
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full. If you remove it,
that protection goes out the window. You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice. I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards? Is there any corrupt data? What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something? As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.If the result of hitting ENOSPC when creating or writing to a WAL file
was that the database could become corrupted, then wouldn't that risk
already be present (a) on any system, for the whole period from database
init until the maximum number of WAL files was created, and (b) all the
time on any copy-on-write filesystem?
I don't follow Alvaro's reasoning, TBH. There's a couple of things that
confuse me ...
I don't quite see how reusing WAL segments actually protects against
full filesystem? On "traditional" filesystems I would not expect any
difference between "unlink+create" and reusing an existing file. On CoW
filesystems (like ZFS or btrfs) the space management works very
differently and reusing an existing file is unlikely to save anything.
But even if it reduces the likelihood of ENOSPC, it does not eliminate
it entirely. max_wal_size is not a hard limit, and the disk may be
filled by something else (when WAL is not on a separate device, when
there is think provisioning, etc.). So it's not a protection against
data corruption we could rely on. (And as was discussed in the recent
fsync thread, ENOSPC is a likely source of past data corruption issues
on NFS and possibly other filesystems.)
I might be missing something, of course.
AFAICS the original reason for reusing WAL segments was the belief that
overwriting an existing file is faster than writing a new file. That
might have been true in the past, but the question is if it's still true
on current filesystems. The results posted here suggest it's not true on
ZFS, at least.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I was asked to perform two different tests:
1) A benchmarksql run with WAL recycling on and then off, for comparison
2) A test when the filesystem fills up
For #1, I did two 15 minute benchmarksql runs and here are the results.
wal_recycle=on
--------------
Term-00, Running Average tpmTOTAL: 299.84 Current tpmTOTAL: 29412
Memory U14:49:02,470 [Thread-1] INFO jTPCC : Term-00,
14:49:02,470 [Thread-1] INFO jTPCC : Term-00,
14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Measured tpmC (NewOrders) =
136.49
14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Measured tpmTOTAL = 299.78
14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Session Start =
2018-07-12 14:34:02
14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Session End =
2018-07-12 14:49:02
14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Transaction Count = 4497
wal_recycle=off
---------------
Term-00, Running Average tpmTOTAL: 299.85 Current tpmTOTAL: 29520
Memory U15:10:15,712 [Thread-1] INFO jTPCC : Term-00,
15:10:15,712 [Thread-1] INFO jTPCC : Term-00,
15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Measured tpmC (NewOrders) =
135.89
15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Measured tpmTOTAL = 299.79
15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Session Start =
2018-07-12 14:55:15
15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Session End =
2018-07-12 15:10:15
15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Transaction Count = 4497
As can be seen, disabling WAL recycling does not cause any performance
regression.
For #2, I ran the test with WAL recycling on (the current behavior as well
as the default with this patch) since the behavior of postgres is
orthogonal to WAL recycling when the filesystem fills up.
I capped the filesystem with 32MB of free space. I setup a configuration
with wal_keep_segments=50 and started a long benchmarksql run. I had 4 WAL
files already in existence when the run started.
As the filesystem fills up, the performance of postgres gets slower and
slower, as would be expected. This is due to the COW nature of the
filesystem and the fact that all writes need to find space.
When a new WAL file is created, this essentially consumes no space since it
is a zero-filled file, so no filesystem space is consumed, except for a
little metadata for the file. However, as writes occur to the WAL
file, space is being consumed. Eventually all space in the filesystem is
consumed. I could not tell if this occurred during a write to an existing
WAL file or a write to the database itself. As other people have observed,
WAL file creation in a COW filesystem is not the problematic operation when
the filesystem fills up. It is the writes to existing files that will fail.
When postgres core dumped there were 6 WAL files in the pg_wal directory
(well short of the 50 configured).
When the filesystem filled up, postgres core dumped and benchmarksql
emitted a bunch of java debug information which I could provide if anyone
is interested.
Here is some information for the postgres core dump. It looks like postgres
aborted itself, but since the filesystem is full, there is nothing in the
log file.
::status
debugging core file of postgres (64-bit) from
f6c22f98-38aa-eb51-80d2-811ed25bed6b
file: /zones/f6c22f98-38aa-eb51-80d2-811ed25bed6b/local/pgsql/bin/postgres
initial argv: /usr/local/pgsql/bin/postgres -D /home/postgres/data
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=76019 uid=1001 code=-1
$C
fffff9ffffdfa4b0 libc.so.1`_lwp_kill+0xa()
fffff9ffffdfa4e0 libc.so.1`raise+0x20(6)
fffff9ffffdfa530 libc.so.1`abort+0x98()
fffff9ffffdfa560 errfinish+0x230()
fffff9ffffdfa5e0 XLogWrite+0x294()
fffff9ffffdfa610 XLogBackgroundFlush+0x18d()
fffff9ffffdfaa50 WalWriterMain+0x1a8()
fffff9ffffdfaab0 AuxiliaryProcessMain+0x3ff()
fffff9ffffdfab40 0x7b5566()
fffff9ffffdfab90 reaper+0x60a()
fffff9ffffdfaba0 libc.so.1`__sighndlr+6()
fffff9ffffdfac30 libc.so.1`call_user_handler+0x1db(12, 0, fffff9ffffdfaca0)
fffff9ffffdfac80 libc.so.1`sigacthandler+0x116(12, 0, fffff9ffffdfaca0)
fffff9ffffdfb0f0 libc.so.1`__pollsys+0xa()
fffff9ffffdfb220 libc.so.1`pselect+0x26b(7, fffff9ffffdfdad0, 0, 0,
fffff9ffffdfb230, 0)
fffff9ffffdfb270 libc.so.1`select+0x5a(7, fffff9ffffdfdad0, 0, 0,
fffff9ffffdfb6c0)
fffff9ffffdffb00 ServerLoop+0x289()
fffff9ffffdffb70 PostmasterMain+0xcfa()
fffff9ffffdffba0 main+0x3cd()
fffff9ffffdffbd0 _start_crt+0x83()
fffff9ffffdffbe0 _start+0x18()
Let me know if there is any other information I could provide.
Thanks,
Jerry
On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
wrote:
Show quoted text
Hello All,
Attached is a patch to provide an option to disable WAL recycling. We have
found that this can help performance by eliminating read-modify-write
behavior on old WAL files that are no longer resident in the filesystem
cache. The is a lot more detail on the background of the motivation for
this in the following thread./messages/by-id/CACukRjO7DJvub8e2AijOayj8BfKK3
XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3
XXBTwu3KKARiTr67M3E3w@mail.gmail.comA similar change has been tested against our 9.6 branch that we're
currently running, but the attached patch is against master.Thanks,
Jerry
On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I don't follow Alvaro's reasoning, TBH. There's a couple of things that
confuse me ...I don't quite see how reusing WAL segments actually protects against full
filesystem? On "traditional" filesystems I would not expect any difference
between "unlink+create" and reusing an existing file. On CoW filesystems
(like ZFS or btrfs) the space management works very differently and reusing
an existing file is unlikely to save anything.
Yeah, I had the same thoughts.
But even if it reduces the likelihood of ENOSPC, it does not eliminate it
entirely. max_wal_size is not a hard limit, and the disk may be filled by
something else (when WAL is not on a separate device, when there is think
provisioning, etc.). So it's not a protection against data corruption we
could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a
likely source of past data corruption issues on NFS and possibly other
filesystems.)
Right. That ENOSPC discussion was about checkpointing though, not
WAL. IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.
We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?
AFAICS the original reason for reusing WAL segments was the belief that
overwriting an existing file is faster than writing a new file. That might
have been true in the past, but the question is if it's still true on
current filesystems. The results posted here suggest it's not true on ZFS,
at least.
Yeah.
The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.
Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:
So then where are those faults coming from? Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?
--
Thomas Munro
http://www.enterprisedb.com
Thanks to everyone who has taken the time to look at this patch and provide
all of the feedback.
I'm going to wait another day to see if there are any more comments. If
not, then first thing next week, I will send out a revised patch with
improvements to the man page change as requested. If anyone has specific
things they want to be sure are covered, please just let me know.
Thanks again,
Jerry
On Thu, Jul 12, 2018 at 7:06 PM, Thomas Munro <thomas.munro@enterprisedb.com
Show quoted text
wrote:
On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I don't follow Alvaro's reasoning, TBH. There's a couple of things that
confuse me ...I don't quite see how reusing WAL segments actually protects against full
filesystem? On "traditional" filesystems I would not expect anydifference
between "unlink+create" and reusing an existing file. On CoW filesystems
(like ZFS or btrfs) the space management works very differently andreusing
an existing file is unlikely to save anything.
Yeah, I had the same thoughts.
But even if it reduces the likelihood of ENOSPC, it does not eliminate it
entirely. max_wal_size is not a hard limit, and the disk may be filled by
something else (when WAL is not on a separate device, when there is think
provisioning, etc.). So it's not a protection against data corruption we
could rely on. (And as was discussed in the recent fsync thread, ENOSPCis a
likely source of past data corruption issues on NFS and possibly other
filesystems.)Right. That ENOSPC discussion was about checkpointing though, not
WAL. IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?AFAICS the original reason for reusing WAL segments was the belief that
overwriting an existing file is faster than writing a new file. Thatmight
have been true in the past, but the question is if it's still true on
current filesystems. The results posted here suggest it's not true onZFS,
at least.
Yeah.
The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:https://github.com/freebsd/freebsd/blob/master/sys/cddl/
contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034So then where are those faults coming from? Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?--
Thomas Munro
http://www.enterprisedb.com
On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund <andres@anarazel.de> wrote:
This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
performance impact of non COW filesystems, and very likely even negative
impacts in a number of COWed scenarios (when there's enough memory to
keep all WAL files in memory).I still think that fixing this another way would be preferrable. This'll
be too much of a magic knob that depends on the fs, hardware and
workload.
I tend to agree with you, but unless we have a pretty good idea what
that other way would be, I think we should probably accept the patch.
Could we somehow make this self-tuning? On any given
filesystem/hardware/workload, either creating a new 16MB file is
faster, or recycling an old file is faster. If the old file is still
cached, recycling it figures to win on almost any hardware. If not,
it seems like something of a toss-up. I suppose we could try to keep
a running average of how long it is taking us to recycle WAL files and
how long it is taking us to create new ones; if we do each one of
those things at least sometimes, then we'll eventually get an idea of
which one is quicker. But it's not clear to me that such data would
be very reliable unless we tried to make sure that we tried both
things fairly regularly under circumstances where we could have chosen
to do the other one.
I think part of the problem here is that whether a WAL segment is
likely to be cached depends on a host of factors which we don't track
very carefully, like whether it's been streamed or decoded recently.
If we knew when that a particular WAL segment hadn't been accessed for
any purpose in 10+ minutes, it would probably be fairly safe to guess
that it's no longer in cache; if we knew that it had been accessed <15
seconds ago, that it is probably still in cache. But we have no idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company