8.4 open item: copy performance regression?
Any objections if I add:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
to the (currently empty) list of open items for 8.4?
A 25-30% performance regression in our main bulk loading mechanism
should at least be explained before the release...
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
A 25-30% performance regression in our main bulk loading mechanism
should at least be explained before the release...
I think a performance regression of that magnitude merits holding up
a release to resolve.
Note that in a follow-up post showing the slow 8.4 copying on Linux,
the copy was 11.7% to 19.3% *faster* on 8.4 when the WAL writing was
suppressed:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00219.php
Extracting from that post:
# I can reproduce that on Linux(CentoS 5.3/x86_64, Nehalem Xeon E5530)
# on 8.4 I get:
#
# 3m59/4m01/3m56s runtime and a profile of
#
# samples % symbol name
# 636302 19.6577 XLogInsert
# 415510 12.8366 CopyReadLine
# on 8.3.7 I get 2m58s,2m54s,2m55s
#
# and a profile of:
#
# samples % symbol name
# 460966 16.2924 XLogInsert
# 307386 10.8643 CopyReadLine
# If I do the same test utilizing WAL bypass the picture changes:
#
# 8.3 runtimes:2m16,2min14s,2min22s
# 8.4 runtime: 2m1s,2m,1m59s
Is there a reason to believe that the XLogInsert part of this *only*
affects bulk loads?
-Kevin
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Any objections if I add:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
to the (currently empty) list of open items for 8.4?
I am unable to duplicate any slowdown on this test case. AFAICT
8.4 and 8.3 branch tip are about the same speed; if anything 8.4
is faster. Testing on x86_64 Fedora 10 ...
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Any objections if I add:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
to the (currently empty) list of open items for 8.4?
I am unable to duplicate any slowdown on this test case. AFAICT
8.4 and 8.3 branch tip are about the same speed; if anything 8.4
is faster. Testing on x86_64 Fedora 10 ...
I just ran the specified test on:
Linux ATHENA 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux
Linux version 2.6.16.60-0.31-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2
I did configure with a prefix and these options:
--enable-integer-datetimes --enable-debug --disable-nls --with-libxml
I did initdb with --no-locale
8.3.7
real 0m24.249s
real 0m24.054s
real 0m24.361s
8.4rc1
real 0m33.503s
real 0m34.198s
real 0m33.931s
-Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I am unable to duplicate any slowdown on this test case.
[ Kevin can ]
It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output? Also, try turning
off autovacuum in both cases, just to see if that's related.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote:
It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output? Also, try turning
off autovacuum in both cases, just to see if that's related.
Both took slightly longer with autovacuum off, but probably just
within the noise.
It's hard to interpret the vmstat output, largely because I chose to
run this on one of our biggest servers, which is not currently serving
an application, per se, but as a replication target, and this being
"off hours" is busy running the sync process to the source machines.
This involves generating md5 sums on both sides for blocks of rows,
which is pretty CPU-intensive. There is very little disk output from
that right now, pretty light on the disk reads, but keeping a few CPUs
pretty busy generating those md5 sums.
I've got to go keep an appointment, but I'll come back and see if I
can do more. For now, here's the raw vmstat, in case someone can pick
out info I'm missing:
-Kevin
Attachments:
vmstat.txttext/plain; name=vmstat.txtDownload
On Thu, Jun 18, 2009 at 05:20:08PM -0400, Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Any objections if I add:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
to the (currently empty) list of open items for 8.4?I am unable to duplicate any slowdown on this test case. AFAICT
8.4 and 8.3 branch tip are about the same speed; if anything 8.4
is faster. Testing on x86_64 Fedora 10 ...regards, tom lane
What is not clear from Stefen's function listing is how the 8.4
server could issue 33% more XLogInsert() and CopyReadLine()
calls than the 8.3.7 server using the same input file. That would
account for the slow down but now why it is happening.
Cheers,
Ken
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
I've got to go keep an appointment
Sorry about that. Back now. Anything else I can do to help with
this?
-Kevin
Kenneth Marshall <ktm@rice.edu> wrote:
What is not clear from Stefen's function listing is how the 8.4
server could issue 33% more XLogInsert() and CopyReadLine()
calls than the 8.3.7 server using the same input file.
I thought those were profiling numbers -- the number of times a timer
checked what was executing and found it in that method. Which
suggests that those two methods are probably slower now than in 8.3.7,
at least in some environments.
-Kevin
It doesn't look like it's related to autovacuum. I re-ran the test against
the two solaris boxes with autovacuum turned off and the results look about
the same.
8.3.7 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m43.662s
user 0m0.001s
sys 0m0.003s
real 0m43.565s
user 0m0.001s
sys 0m0.003s
real 0m43.742s
user 0m0.001s
sys 0m0.003s
8.4rc1 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m59.304s
user 0m0.001s
sys 0m0.003s
real 0m58.653s
user 0m0.001s
sys 0m0.003s
real 1m0.253s
user 0m0.001s
sys 0m0.003s
8.3.7 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m38.981s
user 0m0.002s
sys 0m0.004s
real 0m39.879s
user 0m0.002s
sys 0m0.004s
real 0m39.111s
user 0m0.002s
sys 0m0.004s
8.4rc1 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m50.647s
user 0m0.002s
sys 0m0.004s
real 0m49.453s
user 0m0.002s
sys 0m0.004s
real 0m49.725s
user 0m0.002s
sys 0m0.004s
Alan
On Thu, Jun 18, 2009 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I am unable to duplicate any slowdown on this test case.
[ Kevin can ]
It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output? Also, try turning
off autovacuum in both cases, just to see if that's related.regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Yes, you are right. I thought that they were absolute function
counts. The data makes more sense now.
Regards,
Ken
Show quoted text
On Thu, Jun 18, 2009 at 07:03:34PM -0500, Kevin Grittner wrote:
Kenneth Marshall <ktm@rice.edu> wrote:
What is not clear from Stefen's function listing is how the 8.4
server could issue 33% more XLogInsert() and CopyReadLine()
calls than the 8.3.7 server using the same input file.I thought those were profiling numbers -- the number of times a timer
checked what was executing and found it in that method. Which
suggests that those two methods are probably slower now than in 8.3.7,
at least in some environments.-Kevin
Kevin Grittner wrote:
8.3.7
real 0m24.249s
real 0m24.054s
real 0m24.361s8.4rc1
real 0m33.503s
real 0m34.198s
real 0m33.931s
Ugh. This looks like a poster child case for a benchfarm ...
Is there any chance you guys could triangulate this a bit? Good initial
triangulation points might be the end of each commitfest. (I have a
vested interest in making sure COPY performance doesn't regress, since
it will affect parallel restore's speed in spades.)
cheers
andrew
On 6/19/09, Andrew Dunstan <andrew@dunslane.net> wrote:
Kevin Grittner wrote:
8.3.7
real 0m24.249s
real 0m24.054s
real 0m24.361s
8.4rc1
real 0m33.503s
real 0m34.198s
real 0m33.931sUgh. This looks like a poster child case for a benchfarm ...
Is there any chance you guys could triangulate this a bit? Good initial
triangulation points might be the end of each commitfest. (I have a vested
interest in making sure COPY performance doesn't regress, since it will
affect parallel restore's speed in spades.)
git bisect?
--
marko
Andrew Dunstan wrote:
Kevin Grittner wrote:
8.3.7
real 0m24.249s
real 0m24.054s
real 0m24.361s8.4rc1
real 0m33.503s
real 0m34.198s
real 0m33.931sUgh. This looks like a poster child case for a benchfarm ...
indeed...
Is there any chance you guys could triangulate this a bit? Good initial
triangulation points might be the end of each commitfest. (I have a
vested interest in making sure COPY performance doesn't regress, since
it will affect parallel restore's speed in spades.)
Maybe parallel restore is the issue why we haven't noticed this earlier.
The case that regressed this way is WAL logged COPY, COPY that can
bypass WAL (which typically happens in parallel restore now) is actually
a bit faster in my testing in 8.4.
I will try and see if I can figure out what caused the regression...
Stefan
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL. Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
if (!XLogArchivingActive() &&
(get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
regards, tom lane
Tom Lane wrote:
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL. Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
if (!XLogArchivingActive() &&
(get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
doesn't seem to cause any noticable difference for me...
Stefan
Tom Lane wrote:
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL. Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
if (!XLogArchivingActive() &&
(get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif
ok after a bit of bisecting I'm happy to announce the winner of the contest:
http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
this patch causes a 25-30% performance regression for WAL logged copy,
however in the WAL bypass case (maybe that was what got tested?) it
results in a 20% performance increase.
the raw numbers using the upthread posted minimal postgresql.conf are:
post patch/wal logged: 4min10s/4min19/4min12
post patch/wal bypass: 1m55s/1m58s/2m00
prepatch/wal logged: 2m55s/3min00/2m59
prepatch/wal bypass: 2m22s/2m18s/2m20s
Stefan
On Fri, Jun 19, 2009 at 07:49:31PM +0200, Stefan Kaltenbrunner wrote:
Tom Lane wrote:
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL. Any of the complaints want to try diking
out this bit of code (near line 2580 in
src/backend/access/transam/xlog.c)?
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
if (!XLogArchivingActive() &&
(get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endifok after a bit of bisecting I'm happy to announce the winner of the
contest:http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
this patch causes a 25-30% performance regression for WAL logged copy,
however in the WAL bypass case (maybe that was what got tested?) it results
in a 20% performance increase.the raw numbers using the upthread posted minimal postgresql.conf are:
post patch/wal logged: 4min10s/4min19/4min12
post patch/wal bypass: 1m55s/1m58s/2m00
prepatch/wal logged: 2m55s/3min00/2m59
prepatch/wal bypass: 2m22s/2m18s/2m20sStefan
Great! Maybe just increasing the size of the BULKWRITE ring,
possibly as a function of the shared_memory is all that is
needed. 256kB is the currently coded ring_size in storage/buffer/freelist.c
Ken
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
ok after a bit of bisecting I'm happy to announce the winner of the contest:
http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
this patch causes a 25-30% performance regression for WAL logged copy,
however in the WAL bypass case (maybe that was what got tested?) it
results in a 20% performance increase.
Hmm. What that patch actually changes is that it prevents a bulk insert
(ie COPY in) from trashing the entire shared-buffers arena. I think the
reason for the WAL correlation is that once it's filled the ring buffer,
creating new pages requires writing out old ones, and the
WAL-before-data rule means that the copy process has to block waiting
for WAL to go down to disk before it can write. When it's allowed to
use the whole arena there is more chance for some of that writing to be
done by the walwriter or bgwriter. But the details are going to depend
on the platform's CPU vs I/O balance, which no doubt explains why some
of us don't see it.
I don't think we want to revert that patch --- not trashing the whole
buffer arena seems like a Good Thing from a system-wide point of view,
even if it makes individual COPY operations go slower. However, we
could maybe play around with the tradeoffs a bit. In particular it
seems like it would be useful to experiment with different ring buffer
sizes. Could you try increasing the ring size allowed in
src/backend/storage/buffer/freelist.c for the BULKWRITE case
***************
*** 384,389 ****
--- 384,392 ----
case BAS_BULKREAD:
ring_size = 256 * 1024 / BLCKSZ;
break;
+ case BAS_BULKWRITE:
+ ring_size = 256 * 1024 / BLCKSZ;
+ break;
case BAS_VACUUM:
ring_size = 256 * 1024 / BLCKSZ;
break;
and see if maybe we can buy back most of the loss with not too much
of a ring size increase?
regards, tom lane
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
ok after a bit of bisecting I'm happy to announce the winner of the contest:
http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.phpthis patch causes a 25-30% performance regression for WAL logged copy,
however in the WAL bypass case (maybe that was what got tested?) it
results in a 20% performance increase.Hmm. What that patch actually changes is that it prevents a bulk insert
(ie COPY in) from trashing the entire shared-buffers arena. I think the
reason for the WAL correlation is that once it's filled the ring buffer,
creating new pages requires writing out old ones, and the
WAL-before-data rule means that the copy process has to block waiting
for WAL to go down to disk before it can write. When it's allowed to
use the whole arena there is more chance for some of that writing to be
done by the walwriter or bgwriter. But the details are going to depend
on the platform's CPU vs I/O balance, which no doubt explains why some
of us don't see it.
hmm - In my case both the CPU (an Intel E5530 Nehalem) and the IO
subsystem (8GB Fiberchannel connected NetApp with 4GB cache) are pretty
fast. and even with say fsync=off 8.4RC1 is only slightly faster than
8.3 with the same config and fsync=on so maybe there is a secondary
effect at play too.
I don't think we want to revert that patch --- not trashing the whole
buffer arena seems like a Good Thing from a system-wide point of view,
even if it makes individual COPY operations go slower. However, we
could maybe play around with the tradeoffs a bit. In particular it
seems like it would be useful to experiment with different ring buffer
sizes. Could you try increasing the ring size allowed in
src/backend/storage/buffer/freelist.c for the BULKWRITE case*************** *** 384,389 **** --- 384,392 ---- case BAS_BULKREAD: ring_size = 256 * 1024 / BLCKSZ; break; + case BAS_BULKWRITE: + ring_size = 256 * 1024 / BLCKSZ; + break; case BAS_VACUUM: ring_size = 256 * 1024 / BLCKSZ; break;and see if maybe we can buy back most of the loss with not too much
of a ring size increase?
already started testing that once I found the offending commit.
256 * 1024 / BLCKSZ
4min10s/4min19/4min12
512 * 1024 / BLCKSZ
3min27s/3min32s
1024 * 1024 / BLCKSZ
3min14s/3min12s
2048 * 1024 / BLCKSZ
3min02/3min02
4096 * 1024 / BLCKSZ
2m59/2m58s
8192 * 1024 / BLCKSZ
2m59/2m59s
so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in
more or less the same performance that 8.3 had.
Stefan