fallocate / posix_fallocate for new WAL file creation (etc...)

Started by Jon Nelsonalmost 13 years ago91 messageshackers
Jump to latest
#1Jon Nelson
jnelson+pgsql@jamponi.net

Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

Advice on how to proceed is appreciated.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2David Fetter
david@fetter.org
In reply to: Jon Nelson (#1)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Mon, May 13, 2013 at 08:54:39PM -0500, Jon Nelson wrote:

Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

Advice on how to proceed is appreciated.

Thanks for hopping in!

Please re-base the patch vs. git master, as new features like this go
there. Please also to send along the tests you're doing so others can
riff. Tests that find any weak points are also good.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Jon Nelson (#1)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

Advice on how to proceed is appreciated.

Make sure to list it here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Robert Haas (#3)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Tue, May 14, 2013 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

Pertinent to another thread titled
[HACKERS] corrupt pages detected by enabling checksums
I hope to explore the possibility of using fallocate (or
posix_fallocate) for new WAL file creation.

Most modern Linux filesystems support fast fallocate/posix_fallocate,
reducing extent fragmentation (where extents are used) and frequently
offering a pretty significant speed improvement. In my tests, using
posix_fallocate (followed by pg_fsync) is at least 28 times quicker
than using the current method (which writes zeroes followed by
pg_fsync).

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.

--
Jon

Attachments:

0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patchapplication/octet-stream; name=0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patchDownload+63-36
#5Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#4)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

Hi,

On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.

Some where quick comments, without thinking about this:

* needs a configure check for posix_fallocate. The current version will
e.g. fail to compile on windows or many other non linux systems. Check
how its done for posix_fadvise.
* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?
* According to the man page posix_fallocate doesn't set errno but rather
returns the error code.
* I wonder whether we ever want to actually disable this? Afair the libc
contains emulation for posix_fadvise if the filesystem doesn't support
it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#5)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:

I have written up a patch to use posix_fallocate in new WAL file
creation, including configuration by way of a GUC variable, but I've
not contributed to the PostgreSQL project before. Therefore, I'm
fairly certain the patch is not formatted properly or conforms to the
appropriate style guides. Currently, the patch is based on 9.2, and is
quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.

Some where quick comments, without thinking about this:

Thank you for the kind feedback.

* needs a configure check for posix_fallocate. The current version will
e.g. fail to compile on windows or many other non linux systems. Check
how its done for posix_fadvise.

I will address as soon as I am able.

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

However, a file allocated with fallocate is (supposed to be) less
fragmented than one created by the traditional means.

* According to the man page posix_fallocate doesn't set errno but rather
returns the error code.

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

* I wonder whether we ever want to actually disable this? Afair the libc
contains emulation for posix_fadvise if the filesystem doesn't support
it.

I know that glibc does, but I don't know about other libc implementations.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#6)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:

..

Some where quick comments, without thinking about this:

Thank you for the kind feedback.

* needs a configure check for posix_fallocate. The current version will
e.g. fail to compile on windows or many other non linux systems. Check
how its done for posix_fadvise.

The following patch includes the changes to configure.in.
I had to make other changes (not included here) because my local
system uses autoconf 2.69, but I did test this successfully.

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

--
Jon

Attachments:

fallocate.patch-v2application/octet-stream; name=fallocate.patch-v2Download+101-35
#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jon Nelson (#7)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

Jon Nelson escribió:

On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

Would it work to just assign the value you got from posix_fallocate (if
nonzero) to errno and then use %m in the errmsg() call in ereport()?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Alvaro Herrera (#8)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Jon Nelson escribió:

On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

Would it work to just assign the value you got from posix_fallocate (if
nonzero) to errno and then use %m in the errmsg() call in ereport()?

That strikes me as a better way. I'll work something up soon.
Thanks!

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#9)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Wed, May 15, 2013 at 10:36 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Jon Nelson escribió:

On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

Would it work to just assign the value you got from posix_fallocate (if
nonzero) to errno and then use %m in the errmsg() call in ereport()?

That strikes me as a better way. I'll work something up soon.
Thanks!

Please find attached version 3.
Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?

--
Jon

Attachments:

fallocate-v3.patchapplication/octet-stream; name=fallocate-v3.patchDownload+81-35
#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jon Nelson (#10)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

Jon Nelson escribió:

Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?

Full patch each time is okay. Context-format patch is even better.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Greg Smith
gsmith@gregsmith.com
In reply to: Jon Nelson (#10)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On 5/16/13 9:16 AM, Jon Nelson wrote:

Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?

There are guidelines for getting your patch in the right format at
https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
that would improve this one. You have some formatting issues with tab
spacing at lines 120 through 133 in your v3 patch. And it looks like
there was a formatting change on line 146 that is making the diff larger
than it needs to be.

The biggest thing missing from this submission is information about what
performance testing you did. Ideally performance patches are submitted
with enough information for a reviewer to duplicate the same test the
author did, as well as hard before/after performance numbers from your
test system. It often turns tricky to duplicate a performance gain, and
being able to run the same test used for initial development eliminates
a lot of the problems.

Second bit of nitpicking. There are already some GUC values that appear
or disappear based on compile time options. They're all debugging
related things though. I would prefer not to see this one go away when
it's implementation isn't available. That's going to break any scripts
that SHOW the setting to see if it's turned on or not as a first
problem. I think the right model to follow here is the IFDEF setup used
for effective_io_concurrency. I wouldn't worry about this too much
though. Having a wal_use_fallocate GUC is good for testing. But if it
works out well, when it's ready for commit I don't see why anyone would
want it turned off on platforms where it works. There are already too
many performance tweaking GUCs. Something has to be very likely to be
changed from the default before its worth adding one for it.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#6)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#13)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial? We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context. Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#14)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial? We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context. Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

Here is sample output of filefrag on a somewhat busy database from our
testing environment that exactly duplicates our production workloads..
It does a lot of batch processing at night and a mix of 80%oltp 20%
olap during the day. This is on ext3. Interestingly, on ext4 servers
I never saw more than 2 extents per file (but those servers are mostly
not as busy).

[root@rpisatysw001 pg_xlog]# filefrag *
00000001000006D200000064: 490 extents found, perfection would be 1 extent
00000001000006D200000065: 33 extents found, perfection would be 1 extent
00000001000006D200000066: 43 extents found, perfection would be 1 extent
00000001000006D200000067: 71 extents found, perfection would be 1 extent
00000001000006D200000068: 43 extents found, perfection would be 1 extent
00000001000006D200000069: 156 extents found, perfection would be 1 extent
00000001000006D20000006A: 52 extents found, perfection would be 1 extent
00000001000006D20000006B: 108 extents found, perfection would be 1 extent

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#15)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:

On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial? We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context. Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

Well, it needs to be tested et al. And its a fairly critical code
path. I seem to remember that there were older glibc versions that
didn't do such a great job at emulating fallocate for example.

Here is sample output of filefrag on a somewhat busy database from our
testing environment that exactly duplicates our production workloads..
It does a lot of batch processing at night and a mix of 80%oltp 20%
olap during the day. This is on ext3. Interestingly, on ext4 servers
I never saw more than 2 extents per file (but those servers are mostly
not as busy).

Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
shutting down the cluster, copying the wal files in a moment where there
is enough free space. Just don't forget to sync afterwards.
EXT4 is notably better at allocating space in growing files than ext3
due to delayed allocation (and other things), so it wouldn't surprise me
similar differences in fragmentation even if the load were comparable.

Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
patch anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#16)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Fri, May 17, 2013 at 4:18 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:

On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:

* Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial? We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context. Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

Well, it needs to be tested et al. And its a fairly critical code
path. I seem to remember that there were older glibc versions that
didn't do such a great job at emulating fallocate for example.

Here is sample output of filefrag on a somewhat busy database from our
testing environment that exactly duplicates our production workloads..
It does a lot of batch processing at night and a mix of 80%oltp 20%
olap during the day. This is on ext3. Interestingly, on ext4 servers
I never saw more than 2 extents per file (but those servers are mostly
not as busy).

Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
shutting down the cluster, copying the wal files in a moment where there
is enough free space. Just don't forget to sync afterwards.
EXT4 is notably better at allocating space in growing files than ext3
due to delayed allocation (and other things), so it wouldn't surprise me
similar differences in fragmentation even if the load were comparable.

Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
patch anyway.

yeah -- I see your point. The object lesson isn't so much 'improve
postgres' as it is to 'use a modern filesystem'.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#12)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Thu, May 16, 2013 at 7:05 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 5/16/13 9:16 AM, Jon Nelson wrote:

Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?

There are guidelines for getting your patch in the right format at
https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
that would improve this one. You have some formatting issues with tab
spacing at lines 120 through 133 in your v3 patch. And it looks like there
was a formatting change on line 146 that is making the diff larger than it
needs to be.

I've corrected the formatting change (end-of-line whitespace was
stripped) on line 146.
The other whitespace changes are - I think - due to newly-indented
code due to a new code block.
Included please find a v4 patch which uses context diffs per the above url.

The biggest thing missing from this submission is information about what
performance testing you did. Ideally performance patches are submitted with
enough information for a reviewer to duplicate the same test the author did,
as well as hard before/after performance numbers from your test system. It
often turns tricky to duplicate a performance gain, and being able to run
the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

Attached you'll find a small test script (t.sh) which creates a new
cluster in 'foo', changes some config values, starts the cluster, and
then times how long it takes pgbench to prepare a database. I've used
"wal_level = hot_standby" in the hopes that this generates the largest
number of WAL files (and I set the number of such files to 1024). The
hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on
kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are
not that surprising. The longer the test (the larger the scale factor)
the less of a difference using posix_fallocate makes. With a scale
factor of 100, I see an average of 10-11% reduction in the time taken
to initialize the database. With 300, it's about 5.5% and with 900,
it's between 0 and 1.2%. I will be doing more testing but this is what
I started with. I'm very open to suggestions.

Second bit of nitpicking. There are already some GUC values that appear or
disappear based on compile time options. They're all debugging related
things though. I would prefer not to see this one go away when it's
implementation isn't available. That's going to break any scripts that SHOW
the setting to see if it's turned on or not as a first problem. I think the
right model to follow here is the IFDEF setup used for
effective_io_concurrency. I wouldn't worry about this too much though.
Having a wal_use_fallocate GUC is good for testing. But if it works out
well, when it's ready for commit I don't see why anyone would want it turned
off on platforms where it works. There are already too many performance
tweaking GUCs. Something has to be very likely to be changed from the
default before its worth adding one for it.

Ack. I've revised the patch to always have the GUC (for now), default
to false, and if configure can't find posix_fallocate (or the user
disables it by way of pg_config_manual.h) then it remains a GUC that
simply can't be changed.

I'll also be re-running the tests.

--
Jon

Attachments:

plot.pngimage/png; name=plot.pngDownload
fallocate-v4.patchapplication/octet-stream; name=fallocate-v4.patchDownload+108-87
t.shapplication/x-sh; name=t.shDownload
#19Robert Haas
robertmhaas@gmail.com
In reply to: Jon Nelson (#18)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

The biggest thing missing from this submission is information about what
performance testing you did. Ideally performance patches are submitted with
enough information for a reviewer to duplicate the same test the author did,
as well as hard before/after performance numbers from your test system. It
often turns tricky to duplicate a performance gain, and being able to run
the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

It's nice to be able to test things in the context of a running
install, but sometimes a microbenchmark is just as good. I mean, if
posix_fallocate() is faster, then it's just faster, right? It's
likely to be pretty hard to get reproducible numbers for how much this
actually helps in the real world because write tests are inherently
pretty variable depending on a lot of factors we don't control, so
even if Jon has got the best possible test, the numbers may bounce
around so much that you can't really measure the (probably small) gain
from this approach. But that doesn't seem like a reason not to adopt
the approach and take whatever gain there is. At least, not that I
can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#19)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)

On 2013-05-28 10:03:58 -0400, Robert Haas wrote:

On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

The biggest thing missing from this submission is information about what
performance testing you did. Ideally performance patches are submitted with
enough information for a reviewer to duplicate the same test the author did,
as well as hard before/after performance numbers from your test system. It
often turns tricky to duplicate a performance gain, and being able to run
the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

It's nice to be able to test things in the context of a running
install, but sometimes a microbenchmark is just as good. I mean, if
posix_fallocate() is faster, then it's just faster, right?

Well, it's a bit more complex than that. Fallocate doesn't actually
initializes the disk space in most filesystems, just marks it as
allocated and zeroed which is one of the reasons it can be noticeably
faster. But that can make the runtime overhead of writing to those pages
higher.

I wonder whether noticeably upping checkpoint segments and then
a) COPY in a large table
b) a pgbench on a previously initialized table.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#20)
#22Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Robert Haas (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#22)
#24Greg Smith
gsmith@gregsmith.com
In reply to: Jon Nelson (#22)
#25Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#24)
#26Greg Smith
gsmith@gregsmith.com
In reply to: Jon Nelson (#25)
#27Peter Eisentraut
peter_e@gmx.net
In reply to: Greg Smith (#24)
#28Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#29)
#32Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#31)
#34Andres Freund
andres@anarazel.de
In reply to: Greg Smith (#32)
#35Greg Smith
gsmith@gregsmith.com
In reply to: Andres Freund (#34)
#36Andres Freund
andres@anarazel.de
In reply to: Greg Smith (#35)
#37Greg Smith
gsmith@gregsmith.com
In reply to: Andres Freund (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#33)
#39Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#38)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#33)
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#38)
#42Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#41)
#43Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#40)
#44Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#40)
#45Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#42)
#46Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#43)
#47Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#46)
#48Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#47)
#49Greg Smith
gsmith@gregsmith.com
In reply to: Stephen Frost (#45)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#39)
#51Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Greg Smith (#49)
#52Greg Smith
gsmith@gregsmith.com
In reply to: Alvaro Herrera (#51)
#53Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#47)
#54Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Noah Misch (#53)
#55Stefan Drees
stefan@drees.name
In reply to: Jon Nelson (#54)
#56Merlin Moncure
mmoncure@gmail.com
In reply to: Stefan Drees (#55)
#57Stephen Frost
sfrost@snowman.net
In reply to: Merlin Moncure (#56)
#58Greg Smith
gsmith@gregsmith.com
In reply to: Merlin Moncure (#56)
#59Stefan Drees
stefan@drees.name
In reply to: Greg Smith (#58)
#60Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Stefan Drees (#59)
#61Merlin Moncure
mmoncure@gmail.com
In reply to: Greg Smith (#58)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#57)
#63Jeff Davis
pgsql@j-davis.com
In reply to: Jon Nelson (#18)
#64Greg Smith
gsmith@gregsmith.com
In reply to: Jeff Davis (#63)
#65Jeff Davis
pgsql@j-davis.com
In reply to: Stephen Frost (#57)
#66Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#64)
#67Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jeff Davis (#63)
#68Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#67)
#69Jeff Davis
pgsql@j-davis.com
In reply to: Jon Nelson (#68)
#70Josh Berkus
josh@agliodbs.com
In reply to: Jon Nelson (#1)
#71Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#26)
#72Jeff Davis
pgsql@j-davis.com
In reply to: Josh Berkus (#70)
#73Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#72)
#74Andrew Dunstan
andrew@dunslane.net
In reply to: Jeff Davis (#73)
#75Greg Smith
gsmith@gregsmith.com
In reply to: Jeff Davis (#71)
#76Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#75)
#77Greg Smith
gsmith@gregsmith.com
In reply to: Jon Nelson (#25)
#78Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#77)
#79Greg Smith
gsmith@gregsmith.com
In reply to: Jon Nelson (#78)
#80Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#79)
#81Jeff Davis
pgsql@j-davis.com
In reply to: Andrew Dunstan (#74)
#82Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#75)
#83Fujii Masao
masao.fujii@gmail.com
In reply to: Jeff Davis (#82)
#84Jeff Davis
pgsql@j-davis.com
In reply to: Fujii Masao (#83)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#84)
#86Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#81)
#87Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#85)
#88Jeff Davis
pgsql@j-davis.com
In reply to: Greg Smith (#79)
#89Greg Smith
gsmith@gregsmith.com
In reply to: Jeff Davis (#88)
#90Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Greg Smith (#89)
#91Jeff Davis
pgsql@j-davis.com
In reply to: Jon Nelson (#90)