9.4 regression

Started by Thom Brownover 12 years ago49 messageshackers
Jump to latest
#1Thom Brown
thom@linux.com

Hi all,

I recently tried a simple benchmark to see how far 9.4 had come since
8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
suggestion), I found that those were fine, so the issue must be in
9.4devel. I used identical configurations for each test, and used
9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
The common config changes were:

max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
commit_delay = 50
commit_siblings = 15

System info:
8GB DDR3 RAM (yes, I know the config isn't optimal here)
64-bit Linux Mint 15 (3.8.0 kernel)
ext4

Only build option used was --enable-depend. I did have
--enable-cassert for the shorter 5 min benchmarks, but was removed for
the 30 min tests.

Here are the results:

pgbench -j 80 -c 80 -T 300:

8.4 - 535.990042
9.2 - 820.798141
9.3 - 828.395498
9.4 - 197.851720

pgbench -j 20 -c 20 -T 300:

8.4 - 496.529343
9.2 - 569.626729
9.3 - 575.831264
9.4 - 385.658893

pgbench -j 20 -c 400 -T 300:

8.4 - 580.186868
9.2 - 824.590252
9.3 - 784.638848
9.4 - 524.493308

These were only run for 5 minutes each, but subsequent ones were run
for 30 mins. All tests were run with -s 20.

Given a few key commits Robert Haas directed me to, I narrowed down
the regression to a time period, so benchmarked a few select commits.

pgbench -j 80 -c 80 -T 1800:

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

Ironically, that was intended to be a performance improvement.

Regards

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Thom Brown (#1)
Re: 9.4 regression

On 2013-08-07 17:21:01 +0100, Thom Brown wrote:

Only build option used was --enable-depend. I did have
--enable-cassert for the shorter 5 min benchmarks, but was removed for
the 30 min tests.

pgbench -j 80 -c 80 -T 300:

8.4 - 535.990042
9.2 - 820.798141
9.3 - 828.395498
9.4 - 197.851720

e>

pgbench -j 80 -c 80 -T 1800:

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

The differences between those runs look to small for enable/disable
cassert to me. Are you you properly rebuilt for that?

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Thom Brown (#1)
Re: 9.4 regression

On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote:

Hi all,

I recently tried a simple benchmark to see how far 9.4 had come since
8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
suggestion), I found that those were fine, so the issue must be in
9.4devel. I used identical configurations for each test, and used
9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
The common config changes were:

...

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

Ironically, that was intended to be a performance improvement.

Would it be possible for you to download, compile, and run the test
program as described and located in this email:

/messages/by-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com

I also wonder if there is a problem with the 3.8.0 kernel specifically.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Thom Brown
thom@linux.com
In reply to: Andres Freund (#2)
Re: 9.4 regression

On 7 August 2013 17:49, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 17:21:01 +0100, Thom Brown wrote:

Only build option used was --enable-depend. I did have
--enable-cassert for the shorter 5 min benchmarks, but was removed for
the 30 min tests.

pgbench -j 80 -c 80 -T 300:

8.4 - 535.990042
9.2 - 820.798141
9.3 - 828.395498
9.4 - 197.851720

e>

pgbench -j 80 -c 80 -T 1800:

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

The differences between those runs look to small for enable/disable
cassert to me. Are you you properly rebuilt for that?

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Thom Brown
thom@linux.com
In reply to: Jon Nelson (#3)
Re: 9.4 regression

On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote:

Hi all,

I recently tried a simple benchmark to see how far 9.4 had come since
8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
suggestion), I found that those were fine, so the issue must be in
9.4devel. I used identical configurations for each test, and used
9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
The common config changes were:

...

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

Ironically, that was intended to be a performance improvement.

Would it be possible for you to download, compile, and run the test
program as described and located in this email:

/messages/by-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com

I shall do after the 2 x 1 hour benchmarks are complete.

I also wonder if there is a problem with the 3.8.0 kernel specifically.

Well my laptop has the same kernel (and also 64-bit Linux Mint), so
took 3 quick sample benchmarks on those two commits, and I get the
following (all without --enable-cassert):

269e78: 1162.593665 / 1158.644302 / 1153.955611
8800d8: 2446.087618 / 2443.797252 / 2321.876431

And running your test program gives the following (again, just on my laptop):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Thom Brown (#5)
Re: 9.4 regression

On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom@linux.com> wrote:

On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote:

Hi all,

I recently tried a simple benchmark to see how far 9.4 had come since
8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for
performance. After checking 9.2 and 9.3 (as per Kevin Grittner's
suggestion), I found that those were fine, so the issue must be in
9.4devel. I used identical configurations for each test, and used
9.1's pgbench to ensure changes in pgbench didn't affect the outcome.
The common config changes were:

...

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

Ironically, that was intended to be a performance improvement.

Would it be possible for you to download, compile, and run the test
program as described and located in this email:

/messages/by-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com

I shall do after the 2 x 1 hour benchmarks are complete.

I also wonder if there is a problem with the 3.8.0 kernel specifically.

Well my laptop has the same kernel (and also 64-bit Linux Mint), so
took 3 quick sample benchmarks on those two commits, and I get the
following (all without --enable-cassert):

269e78: 1162.593665 / 1158.644302 / 1153.955611
8800d8: 2446.087618 / 2443.797252 / 2321.876431

And running your test program gives the following (again, just on my laptop):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

OK. That's interesting, and a good data point.

One thing you could try manually disabling the use of posix_fallocate in 269e78.
After running ./configure --stuff-here
edit src/include/pg_config.h and comment out the following line (on or
around line 374)
#define HAVE_POSIX_FALLOCATE 1

*then* build postgresql and see if the performance hit is still there.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Thom Brown
thom@linux.com
In reply to: Jon Nelson (#6)
Re: 9.4 regression

On 7 August 2013 18:49, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom@linux.com> wrote:

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.6380s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s
method: classic. 2 open/close iterations, 1 rewrite in 1.2908s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s
method: classic. 5 open/close iterations, 1 rewrite in 3.1419s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s
method: classic. 10 open/close iterations, 1 rewrite in 6.2912s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s
method: classic. 100 open/close iterations, 1 rewrite in 67.4174s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s

OK. That's interesting, and a good data point.

One thing you could try manually disabling the use of posix_fallocate in 269e78.
After running ./configure --stuff-here
edit src/include/pg_config.h and comment out the following line (on or
around line 374)
#define HAVE_POSIX_FALLOCATE 1

*then* build postgresql and see if the performance hit is still there.

Okay, done that. The TPS increases again:

2308.807568 / 2554.264572 / 2563.190204

And I did run ./configure... before removing the line, and built it
after the change.

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Thom Brown
thom@linux.com
In reply to: Thom Brown (#4)
Re: 9.4 regression

On 7 August 2013 17:54, Thom Brown <thom@linux.com> wrote:

On 7 August 2013 17:49, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 17:21:01 +0100, Thom Brown wrote:

Only build option used was --enable-depend. I did have
--enable-cassert for the shorter 5 min benchmarks, but was removed for
the 30 min tests.

pgbench -j 80 -c 80 -T 300:

8.4 - 535.990042
9.2 - 820.798141
9.3 - 828.395498
9.4 - 197.851720

e>

pgbench -j 80 -c 80 -T 1800:

8.4: 812.482108
9.4 HEAD: 355.397658
9.4 e5592c (9th July): 356.485625
9.4 537227 (7th July): 365.992518
9.4 9b2543 (7th July): 362.587339
9.4 269e78 (5th July): 359.439143
9.4 8800d8 (5th July): 821.933082
9.4 568d41 (2nd July): 822.991160

The differences between those runs look to small for enable/disable
cassert to me. Are you you properly rebuilt for that?

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

Results
(checkpoint_timeout = 3min)

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

Jon, here are the test results you asked for:

$ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Thom Brown (#8)
Re: 9.4 regression

Thom Brown <thom@linux.com> wrote:

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

As another data point I duplicated Thom's original tests:

max_connections = 500
shared_buffers = 4GB
effective_cache_size = 12GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
commit_delay = 50
commit_siblings = 15

pgbench -i -s 20 pgbench
pgbench -j 80 -c 80 -T 1800 pgbench
(note the high contention pgbench setup)

On this system:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  (hyperthreading enabled)
2 x 15k RPS drives in software RAID 1, ext4, write cache disabled
3.5.0-37-generic #58-Ubuntu SMP Mon Jul 8 22:07:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

commit 8800d8061dd151d6556f5f8d58f8211fd830169f
number of transactions actually processed: 1298864
tps = 721.384612 (including connections establishing)
tps = 721.388938 (excluding connections establishing)

commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd
number of transactions actually processed: 875900
tps = 486.524741 (including connections establishing)
tps = 486.527891 (excluding connections establishing)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Bruce Momjian
bruce@momjian.us
In reply to: Thom Brown (#8)
Re: 9.4 regression

Did you report information about the system affected? What filesystem is it
on? If it's ext4 does it have extents enabled?

I think on ext3 or ext4 without extents it won't have any benefit but it
shouldn't really be any slower either since the libc implementation is very
similar to what we used to do.

I wouldn't expect the benefit from it to have a huge effect either. The
main benefit is actually the reduced fragmentation so it would be pretty
dependent on the filesystem it's on anyways. Avoiding stalls when creating
new wal files is also nice but shouldn't happen often enough n your
benchmarks to cause an effect on the average numbers.

#11Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Kevin Grittner (#9)
Re: 9.4 regression

On Wed, Aug 7, 2013 at 4:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Thom Brown <thom@linux.com> wrote:

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

I have also been running some tests and - as yet - they are
inconclusive. What I can say about them so far is that - at times -
they agree with these results and do show a performance hit.

I took the settings as posted and adjusted them slightly for my much
lower-powered personal laptop, changing checkpoint_completion_target
to 1.0 and checkpoint_timeout to 1min.

I am testing with the latest git HEAD, turning fallocate support on
and off by editing xlog.c directly. Furthermore, before each run I
would try to reduce the number of existing WAL files by making a "pre"
run with checkpoint_segments = 3 before changing it to 32.

For reasons I don't entirely understand, when WAL files are being
created (rather than recycled) I found a performance hit, but when
they were being recycled I got a slight performance gain (this may be
due to reduced fragmentation) but the gain was never more than 10% and
frequently less than that.

I can't explain - yet (if ever!) - why my initial tests (and many of
those done subsequently) showed improvement - it may very well have
something to do with how the code interacts with these settings.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Thom Brown
thom@linux.com
In reply to: Bruce Momjian (#10)
Re: 9.4 regression

On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote:

Did you report information about the system affected? What filesystem is it
on? If it's ext4 does it have extents enabled?

Yes, ext4. It's using whatever the default options are, but running
lsattr on my data dir shows that extents are being used.
--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Thom Brown
thom@linux.com
In reply to: Thom Brown (#12)
Re: 9.4 regression

On 8 August 2013 00:04, Thom Brown <thom@linux.com> wrote:

On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote:

Did you report information about the system affected? What filesystem is it
on? If it's ext4 does it have extents enabled?

Yes, ext4. It's using whatever the default options are, but running
lsattr on my data dir shows that extents are being used.

I've just created an ext2 partition (so no extents), and did some
quick benchmarks of those 2 commits. This is on the original system
(not my laptop), with the same config settings as before. All tests
initialised with -s 20 and tested with -j 80 -c 80.

30 seconds:

8800d8: 1327.983018 / 1317.848514
269e78: 1388.009819 / 1285.929100

5 minutes:

8800d8: 1744.446592 / 1729.602706
269e78: 1677.561277 / 1677.967873

10 minutes:

8800d8: 1753.147609 / 1806.600335
269e78: 1815.862842 / 1804.906769

So this appears to be okay.

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Josh Berkus
josh@agliodbs.com
In reply to: Thom Brown (#1)
Re: 9.4 regression

On 08/07/2013 05:40 PM, Thom Brown wrote:

On 8 August 2013 00:04, Thom Brown <thom@linux.com> wrote:

On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote:

Did you report information about the system affected? What filesystem is it
on? If it's ext4 does it have extents enabled?

Yes, ext4. It's using whatever the default options are, but running
lsattr on my data dir shows that extents are being used.

Would it be worth testing XFS? If so, any particular options I should use?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Thom Brown (#8)
Re: 9.4 regression

On 2013-08-07 20:23:55 +0100, Thom Brown wrote:

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

Results
(checkpoint_timeout = 3min)

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

Ok, so the performance difference is lower. But, it seems to be still to
high to be explaineable by performance differences during
initialization/fallocate.
In a very quick test, I don't see the same on my laptop. I am currently
travelling and I don't have convenient access to anything else.

Could you:
- run filefrag on a couple of wal segments of both clusters after the
test and post the results here?
- enable log_checkpoints, post the lines spat out together with the results
- run pgbench without reinitializing the cluster/pgbench tables
inbetween?

Basically, I'd like to know whether its the initialization that's slow
(measurable because we're regularly creating new files because of a too
low checkpoint_segments) or whether it's the actual writes.

Jon, here are the test results you asked for:

$ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

Ok, this seems to indicate that fallocate works nicely. Jon, wasn't
there a version of the test that rewrote the file afterwards?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#15)
Re: 9.4 regression

On Wed, Aug 7, 2013 at 10:05 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 20:23:55 +0100, Thom Brown wrote:

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

Results
(checkpoint_timeout = 3min)

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

Ok, so the performance difference is lower. But, it seems to be still to
high to be explaineable by performance differences during
initialization/fallocate.
In a very quick test, I don't see the same on my laptop. I am currently
travelling and I don't have convenient access to anything else.

Could you:
- run filefrag on a couple of wal segments of both clusters after the
test and post the results here?

For me, there is no meaningful difference, but I have a relatively
fresh filesystem (ext4) with lots of free space.

- enable log_checkpoints, post the lines spat out together with the results
- run pgbench without reinitializing the cluster/pgbench tables
inbetween?

When I do this (as I note below), the performance difference (for me)
disappears.

Basically, I'd like to know whether its the initialization that's slow
(measurable because we're regularly creating new files because of a too
low checkpoint_segments) or whether it's the actual writes.

What I've found so far is very confusing.
I start by using initdb to initialize a temporary cluster, copy in a
postgresql.conf (with the modifcations from Thom Brown tweaked for my
small laptop), start the cluster, create a test database, initialize
it with pgbench, and then run. I'm also only running for two minutes
at this time.

Every time I test, the non-fallocate version is faster. I modifed
xlog.c to use posix_fallocate (or not) based on an environment
variable.
Once the WAL files have been rewritten at least once, then it doesn't
seem to matter which method is used to allocate them (although
posix_fallocate seems to have a slight edge). I even went to far as to
modify the code to posix_fallocate the file *and then zero it out
anyway*, and it was almost as fast as zeroing alone, and faster than
using posix_fallocate alone.

Jon, here are the test results you asked for:

$ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done
method: classic. 1 open/close iterations, 1 rewrite in 0.9157s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s
method: classic. 2 open/close iterations, 1 rewrite in 1.1417s
method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s
method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s
method: classic. 5 open/close iterations, 1 rewrite in 3.6490s
method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s
method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s
method: classic. 10 open/close iterations, 1 rewrite in 5.7562s
method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s
method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s
method: classic. 100 open/close iterations, 1 rewrite in 64.9483s
method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s
method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s

Ok, this seems to indicate that fallocate works nicely. Jon, wasn't
there a version of the test that rewrote the file afterwards?

Yes. If you use a different number besides '1' as the third argument
in the command line above, it will rewrite the file that many times.

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#16)
Re: 9.4 regression

On 2013-08-07 23:55:22 -0500, Jon Nelson wrote:

- enable log_checkpoints, post the lines spat out together with the results
- run pgbench without reinitializing the cluster/pgbench tables
inbetween?

When I do this (as I note below), the performance difference (for me)
disappears.

Ok. In this case the files have been written to and fsync()ed before the
test is run.

Basically, I'd like to know whether its the initialization that's slow
(measurable because we're regularly creating new files because of a too
low checkpoint_segments) or whether it's the actual writes.

What I've found so far is very confusing.
I start by using initdb to initialize a temporary cluster, copy in a
postgresql.conf (with the modifcations from Thom Brown tweaked for my
small laptop), start the cluster, create a test database, initialize
it with pgbench, and then run. I'm also only running for two minutes
at this time.

Hm. In a 2 minutes run there won't be any checkpoints..

Every time I test, the non-fallocate version is faster. I modifed
xlog.c to use posix_fallocate (or not) based on an environment
variable.
Once the WAL files have been rewritten at least once, then it doesn't
seem to matter which method is used to allocate them (although
posix_fallocate seems to have a slight edge). I even went to far as to
modify the code to posix_fallocate the file *and then zero it out
anyway*, and it was almost as fast as zeroing alone, and faster than
using posix_fallocate alone.

Ok, so it seems there's performance issues with allocating the data even
though the allocation supposedly has already been made. Access by
several backends may play a role here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#17)
Re: 9.4 regression

A follow-up.
I found this thread that seems to explain some things:

http://comments.gmane.org/gmane.comp.file-systems.ext4/33024

Short version: if we are writing into the "middle" of the
newly-fallocated file on ext4 (with extents) the extent tree can
fragment badly, causing poor performance due to extent merging.

I also ran some tests with xfs, and the results were even stranger:

xfs performed very slightly better with fallocate vs. without.
xfs (fallocate) 216 tps vs. (no fallocate) 206
ext4 (no fallocate) 605 vs (fallocate) 134.

I made an ext4 filesystem without extents using the same block device
as above - a real, single hard drive with nothing else on it.
On this filesystem, the performance remained the same (or - perhaps -
improved very slightly): 633tps (with fallocate) vs. 607.

Using the following additions postgresql.conf:

max_connections = 500
shared_buffers = 1GB
effective_cache_size = 1GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
commit_delay = 50
commit_siblings = 15
checkpoint_segments = 256
log_checkpoints = on

and this partial script for testing:

pg_ctl initdb -D tt -w
cp ~/postgresql.conf tt
pg_ctl -D tt -w start
createdb -p 54320 pgb
pgbench -s 20 -p 54320 -d pgb -i
pgbench -j 80 -c 80 -T 120 -p 54320 pgb
pg_ctl -D tt -w stop

--
Jon

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Thom Brown
thom@linux.com
In reply to: Andres Freund (#15)
Re: 9.4 regression

On 8 August 2013 04:05, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-07 20:23:55 +0100, Thom Brown wrote:

269e78 was the commit immediately after 8800d8, so it appears that
introduced the regression.

"Use posix_fallocate() for new WAL files, where available."

This is curious. Could you either run a longer test before/after the
commit or reduce checkpoint_timeout to something like 3min?

Okay, I'll rerun the test for both those commits at 1 hour each with
checkpoint_timeout set at 3mins, but with all other configuration
settings the same

Results
(checkpoint_timeout = 3min)

pgbench -j 80 -c 80 -T 3600

269e78: 606.268013
8800d8: 779.583129

Ok, so the performance difference is lower. But, it seems to be still to
high to be explaineable by performance differences during
initialization/fallocate.
In a very quick test, I don't see the same on my laptop. I am currently
travelling and I don't have convenient access to anything else.

Could you:
- run filefrag on a couple of wal segments of both clusters after the
test and post the results here?

I ran a benchmark for 10 minutes. I ran filefrag against all files in
pg_xlog and they all say "1 extent found"

- enable log_checkpoints, post the lines spat out together with the

results

LOG: database system was shut down at 2013-08-08 16:38:28 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s,
total=0.132 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.015 s, sync=0.000 s,
total=0.071 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39137 buffers (7.5%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=143.757 s, sync=5.659 s,
total=149.620 s; sync files=44, longest=3.814 s, average=0.128 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21141 buffers (4.0%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.082 s, sync=13.144 s,
total=156.421 s; sync files=13, longest=10.336 s, average=1.011 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27702 buffers (5.3%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.739 s, sync=4.470 s,
total=148.337 s; sync files=12, longest=2.545 s, average=0.372 s
LOG: checkpoint starting: time

- run pgbench without reinitializing the cluster/pgbench tables
inbetween?

Set 1 (10 mins each):
First run after initialisation: 289.410795
Re-run without re-initialisation: 519.694545

These additional lines were added to the log for the re-run:

LOG: checkpoint complete: wrote 14793 buffers (2.8%); 0 transaction log
file(s) added, 0 removed, 16 recycled; write=143.299 s, sync=2.318 s,
total=145.691 s; sync files=13, longest=1.910 s, average=0.178 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 37236 buffers (7.1%); 0 transaction log
file(s) added, 0 removed, 13 recycled; write=143.410 s, sync=7.363 s,
total=150.981 s; sync files=16, longest=4.871 s, average=0.460 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 28271 buffers (5.4%); 0 transaction log
file(s) added, 0 removed, 20 recycled; write=143.597 s, sync=11.761 s,
total=155.472 s; sync files=13, longest=8.494 s, average=0.904 s
LOG: checkpoint starting: time

Set 2 (10 mins each):
First run after initialisation: 295.925985
Re-run without re-initialisation: 503.239516

Here's the log for set 2 after the first run:

LOG: database system was shut down at 2013-08-08 17:06:41 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s,
total=0.153 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.014 s, sync=0.000 s,
total=0.057 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39148 buffers (7.5%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=143.322 s, sync=5.519 s,
total=149.256 s; sync files=44, longest=3.691 s, average=0.125 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21447 buffers (4.1%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.161 s, sync=13.459 s,
total=156.843 s; sync files=13, longest=10.915 s, average=1.035 s
LOG: checkpoint starting: time

After the re-run in set 2, these lines were added:

LOG: checkpoint complete: wrote 26609 buffers (5.1%); 0 transaction log
file(s) added, 0 removed, 15 recycled; write=143.253 s, sync=6.588 s,
total=150.036 s; sync files=15, longest=2.606 s, average=0.439 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 16931 buffers (3.2%); 0 transaction log
file(s) added, 0 removed, 16 recycled; write=143.171 s, sync=16.489 s,
total=160.014 s; sync files=14, longest=12.660 s, average=1.177 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 35345 buffers (6.7%); 0 transaction log
file(s) added, 0 removed, 14 recycled; write=143.170 s, sync=10.596 s,
total=153.977 s; sync files=16, longest=7.996 s, average=0.662 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27907 buffers (5.3%); 0 transaction log
file(s) added, 0 removed, 19 recycled; write=143.184 s, sync=12.562 s,
total=155.969 s; sync files=14, longest=7.844 s, average=0.897 s
LOG: checkpoint starting: time

I rechecked WAL file fragmentation after each run, and it always says "1
extent found".

--
Thom

#20Hannu Krosing
hannu@tm.ee
In reply to: Jon Nelson (#18)
Re: 9.4 regression

On 08/08/2013 05:28 PM, Jon Nelson wrote:

A follow-up.
I found this thread that seems to explain some things:

http://comments.gmane.org/gmane.comp.file-systems.ext4/33024

Short version: if we are writing into the "middle" of the
newly-fallocated file on ext4 (with extents) the extent tree can
fragment badly, causing poor performance due to extent merging.

I also ran some tests with xfs, and the results were even stranger:

xfs performed very slightly better with fallocate vs. without.
xfs (fallocate) 216 tps vs. (no fallocate) 206
ext4 (no fallocate) 605 vs (fallocate) 134.

I made an ext4 filesystem without extents using the same block device
as above - a real, single hard drive with nothing else on it.
On this filesystem, the performance remained the same (or - perhaps -
improved very slightly): 633tps (with fallocate) vs. 607.

Using the following additions postgresql.conf:

max_connections = 500
shared_buffers = 1GB
effective_cache_size = 1GB
random_page_cost = 2.0
cpu_tuple_cost = 0.03
wal_buffers = 32MB
work_mem = 100MB
maintenance_work_mem = 512MB
commit_delay = 50
commit_siblings = 15
checkpoint_segments = 256
log_checkpoints = on

and this partial script for testing:

pg_ctl initdb -D tt -w
cp ~/postgresql.conf tt
pg_ctl -D tt -w start
createdb -p 54320 pgb
pgbench -s 20 -p 54320 -d pgb -i

Just an idea - can you check if using a fillfactor different form 100
changes anything

pgbench -s 20 -p 54320 -d pgb -F 90 -i

pgbench -j 80 -c 80 -T 120 -p 54320 pgb
pg_ctl -D tt -w stop

That is, does extending tables and indexes to add updated tuples play
any role here

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Hannu Krosing (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Jon Nelson (#21)
#23Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Bruce Momjian (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jon Nelson (#21)
#25Bruce Momjian
bruce@momjian.us
In reply to: Jon Nelson (#23)
#26Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Tom Lane (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jon Nelson (#26)
#28Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Tom Lane (#27)
#29Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Jon Nelson (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#21)
#31Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#31)
#33Bruce Momjian
bruce@momjian.us
In reply to: Thom Brown (#7)
#34Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Bruce Momjian (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Jon Nelson (#34)
#36Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Bruce Momjian (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Jon Nelson (#36)
#38Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#37)
#39Bruce Momjian
bruce@momjian.us
In reply to: Stephen Frost (#38)
#40Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#24)
#41Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#40)
#42Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#41)
#43Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#42)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#44)
#46KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Andres Freund (#41)
#47Bruce Momjian
bruce@momjian.us
In reply to: Stephen Frost (#38)
#48Thom Brown
thom@linux.com
In reply to: Bruce Momjian (#47)
#49Jon Nelson
jnelson+pgsql@jamponi.net
In reply to: Thom Brown (#48)