testing HS/SR - 1 vs 2 performance
Using 9.0devel cvs HEAD, 2010.04.08.
I am trying to understand the performance difference
between primary and standby under a standard pgbench
read-only test.
server has 32 GB, 2 quadcores.
primary:
tps = 34606.747930 (including connections establishing)
tps = 34527.078068 (including connections establishing)
tps = 34654.297319 (including connections establishing)
standby:
tps = 700.346283 (including connections establishing)
tps = 717.576886 (including connections establishing)
tps = 740.522472 (including connections establishing)
transaction type: SELECT only
scaling factor: 1000
query mode: simple
number of clients: 20
number of threads: 1
duration: 900 s
both instances have
max_connections = 100
shared_buffers = 256MB
checkpoint_segments = 50
effective_cache_size= 16GB
See also:
http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
(differences with scale 10_000)
I understand that in the scale=1000 case, there is a huge
cache effect, but why doesn't that apply to the pgbench runs
against the standby? (and for the scale=10_000 case the
differences are still rather large)
Maybe these differences are as expected. I don't find
any explanation in the documentation.
thanks,
Erik Rijkers
On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
I understand that in the scale=1000 case, there is a huge
cache effect, but why doesn't that apply to the pgbench runs
against the standby? (and for the scale=10_000 case the
differences are still rather large)
I guess that this performance degradation happened because a number of
buffer replacements caused UpdateMinRecoveryPoint() often. So I think
increasing shared_buffers would improve the performance significantly.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
I understand that in the scale=1000 case, there is a huge
cache effect, but why doesn't that apply to the pgbench runs
against the standby? (and for the scale=10_000 case the
differences are still rather large)I guess that this performance degradation happened because a number of
buffer replacements caused UpdateMinRecoveryPoint() often. So I think
increasing shared_buffers would improve the performance significantly.
I think we need to investigate this more. It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.
...Robert
On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
Using 9.0devel cvs HEAD, 2010.04.08.
I am trying to understand the performance difference
between primary and standby under a standard pgbench
read-only test.server has 32 GB, 2 quadcores.
primary:
tps = 34606.747930 (including connections establishing)
tps = 34527.078068 (including connections establishing)
tps = 34654.297319 (including connections establishing)standby:
tps = 700.346283 (including connections establishing)
tps = 717.576886 (including connections establishing)
tps = 740.522472 (including connections establishing)transaction type: SELECT only
scaling factor: 1000
query mode: simple
number of clients: 20
number of threads: 1
duration: 900 sboth instances have
max_connections = 100
shared_buffers = 256MB
checkpoint_segments = 50
effective_cache_size= 16GBSee also:
http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
(differences with scale 10_000)
To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and
the primary slow.
FWIW, I've overnight run a larget set of tests. (against same 9.0devel
instances as the ones from the earlier email).
These results are generally more balanced.
for scale in
for clients in 1 5 10 20
for port in 6565 6566 --> primaryport standbyport
for run in `seq 1 3`
pgbench ...
sleep ((scale / 10) * 60)
done
done
done
done
(so below, alternating 3 primary, followed by 3 standby runs)
scale: 10 clients: 1 tps = 15219.019272 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15301.847615 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15238.907436 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12129.928289 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12151.711589 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12203.494512 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 5 tps = 60248.120599 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 60827.949875 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 61167.447476 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50750.385403 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50600.891436 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50486.857610 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 10 tps = 60307.739327 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60264.230349 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60146.370598 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50455.537671 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 49877.000813 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50097.949766 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 20 tps = 43355.220657 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43352.725422 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43496.085623 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37169.126299 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37100.260450 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37342.758507 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 1 tps = 12514.185089 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12542.842198 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12595.688640 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10435.681851 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10456.983353 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10434.213044 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 5 tps = 48682.166988 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48656.883485 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48687.894655 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41901.629933 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41953.386791 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41787.962712 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 10 tps = 48704.247239 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48941.190050 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48603.077936 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42948.666272 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42767.793899 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42612.670983 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 20 tps = 36350.454258 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36373.088111 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36490.886781 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32235.811228 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32253.837906 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32144.189047 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 1 tps = 11733.254970 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11726.665739 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11617.622548 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9769.861175 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9878.465752 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9808.236216 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 5 tps = 45185.900553 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45170.334037 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45136.596374 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39231.863815 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39336.889619 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39269.483772 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 10 tps = 45468.080680 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45727.159963 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45399.241367 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40759.108042 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40783.287718 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40858.007847 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 20 tps = 34729.742313 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34705.119029 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34617.517224 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31252.355034 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31234.885791 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31273.307637 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 1 tps = 220.024691 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 294.855794 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 375.152757 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 295.965959 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 1036.517110 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 9167.012603 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 5 tps = 1241.224282 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1894.806301 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 18532.885549 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1497.491279 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1480.164166 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 3470.769236 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 10 tps = 2414.552333 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 19248.609443 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 45059.231609 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 1648.526373 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 3659.800008 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 35900.769857 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 20 tps = 2462.855864 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 27168.407568 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 34438.802096 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 2933.220489 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 25586.972428 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 30926.189621 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
On Mon, Apr 12, 2010 at 7:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
I understand that in the scale=1000 case, there is a huge
cache effect, but why doesn't that apply to the pgbench runs
against the standby? (and for the scale=10_000 case the
differences are still rather large)I guess that this performance degradation happened because a number of
buffer replacements caused UpdateMinRecoveryPoint() often. So I think
increasing shared_buffers would improve the performance significantly.I think we need to investigate this more. It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.
As a data point, I did a read only pgbench test and found that the
standby runs about 15% slower than the primary with identical hardware
and configs.
...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
--
Jim Mlodgenski
EnterpriseDB (http://www.enterprisedb.com)
On Mon, April 12, 2010 14:22, Erik Rijkers wrote:
On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
Oops, typos in that pseudo loop:
of course there was a pgbench init step after that first line.
for scale in 10 100 500 1000
pgbench ... # initialise
sleep ((scale / 10) * 60)
for clients in 1 5 10 20
for port in 6565 6566 --> primaryport standbyport
for run in `seq 1 3`
pgbench ...
sleep 120
Show quoted text
done
done
done
done
On Mon, Apr 12, 2010 at 8:32 AM, Jim Mlodgenski <jimmy76@gmail.com> wrote:
I think we need to investigate this more. It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.As a data point, I did a read only pgbench test and found that the
standby runs about 15% slower than the primary with identical hardware
and configs.
Hmm. That's not great, but it's a lot better than 50x. I wonder what
was different in Erik's environment. Does running in standby mode use
more memory, such that it might have pushed the machine over the line
into swap?
Or if it's CPU load, maybe Erik could gprof it?
...Robert
* Robert Haas <robertmhaas@gmail.com> [100412 07:10]:
I think we need to investigate this more. It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.
Yes, it's not "good", but it's a known problem. We've had people
complaining that wal-replay can't keep up with a wal stream from a heavy
server.
The master producing the wal stream has $XXX seperate read/modify
processes working over the data dir, and is bottle-necked by the
serialized WAL stream. All the seek+read delays are parallized and
overlapping.
But on the slave (traditionally PITR slave, now also HS/SR), has al
lthat read-modify-write happening in a single thread fasion, meaning
that WAL record $X+1 waits until the buffer $X needs to modify is read
in. All the seek+read delays are serialized.
You can optimize that by keepdng more of them in buffers (shared, or OS
cache), but the WAL producer, by it's very nature being a
multi-task-io-load producing random read/write is always going to go
quicker than single-stream random-io WAL consumer...
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
And I see now that he's doing a stream of read-only queries on a slave,
presumably with no WAL even being replayed...
Sorry for the noise....
a.
* Aidan Van Dyk <aidan@highrise.ca> [100412 09:40]:
* Robert Haas <robertmhaas@gmail.com> [100412 07:10]:
I think we need to investigate this more. It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.Yes, it's not "good", but it's a known problem. We've had people
complaining that wal-replay can't keep up with a wal stream from a heavy
server.The master producing the wal stream has $XXX seperate read/modify
processes working over the data dir, and is bottle-necked by the
serialized WAL stream. All the seek+read delays are parallized and
overlapping.But on the slave (traditionally PITR slave, now also HS/SR), has al
lthat read-modify-write happening in a single thread fasion, meaning
that WAL record $X+1 waits until the buffer $X needs to modify is read
in. All the seek+read delays are serialized.You can optimize that by keepdng more of them in buffers (shared, or OS
cache), but the WAL producer, by it's very nature being a
multi-task-io-load producing random read/write is always going to go
quicker than single-stream random-io WAL consumer...a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> wrote:
We've had people complaining that wal-replay can't keep up with a
wal stream from a heavy server.
I thought this thread was about the slow performance running a mix
of read-only queries on the slave versus the master, which doesn't
seem to have anything to do with the old issue you're describing.
-Kevin
resending this message, as it seems to have bounced.
(below, I did fix the typo in the pseudocode loop)
---------------------------------------- Original Message ----------------------------------------
Subject: Re: [HACKERS] testing HS/SR - 1 vs 2 performance
From: "Erik Rijkers" <er@xs4all.nl>
Date: Mon, April 12, 2010 14:22
To: pgsql-hackers@postgresql.org
--------------------------------------------------------------------------------------------------
On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
Using 9.0devel cvs HEAD, 2010.04.08.
I am trying to understand the performance difference
between primary and standby under a standard pgbench
read-only test.server has 32 GB, 2 quadcores.
primary:
tps = 34606.747930 (including connections establishing)
tps = 34527.078068 (including connections establishing)
tps = 34654.297319 (including connections establishing)standby:
tps = 700.346283 (including connections establishing)
tps = 717.576886 (including connections establishing)
tps = 740.522472 (including connections establishing)transaction type: SELECT only
scaling factor: 1000
query mode: simple
number of clients: 20
number of threads: 1
duration: 900 sboth instances have
max_connections = 100
shared_buffers = 256MB
checkpoint_segments = 50
effective_cache_size= 16GBSee also:
http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
(differences with scale 10_000)
To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and
the primary slow.
FWIW, I've overnight run a larget set of tests. (against same 9.0devel
instances as the ones from the earlier email).
These results are generally more balanced.
for scale in 10 100 500 1000
pgbench ... # initialise
sleep ((scale / 10) * 60)
for clients in 1 5 10 20
for port in 6565 6566 --> primaryport standbyport
for run in `seq 1 3`
pgbench ...
sleep ((scale / 10) * 60)
done
done
done
done
(so below, alternating 3 primary, followed by 3 standby runs)
scale: 10 clients: 1 tps = 15219.019272 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15301.847615 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15238.907436 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12129.928289 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12151.711589 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12203.494512 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 5 tps = 60248.120599 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 60827.949875 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 61167.447476 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50750.385403 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50600.891436 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50486.857610 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 10 tps = 60307.739327 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60264.230349 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60146.370598 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50455.537671 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 49877.000813 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50097.949766 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 20 tps = 43355.220657 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43352.725422 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43496.085623 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37169.126299 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37100.260450 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37342.758507 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 1 tps = 12514.185089 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12542.842198 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12595.688640 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10435.681851 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10456.983353 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10434.213044 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 5 tps = 48682.166988 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48656.883485 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48687.894655 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41901.629933 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41953.386791 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41787.962712 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 10 tps = 48704.247239 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48941.190050 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48603.077936 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42948.666272 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42767.793899 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42612.670983 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 20 tps = 36350.454258 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36373.088111 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36490.886781 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32235.811228 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32253.837906 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32144.189047 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 1 tps = 11733.254970 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11726.665739 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11617.622548 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9769.861175 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9878.465752 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9808.236216 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 5 tps = 45185.900553 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45170.334037 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45136.596374 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39231.863815 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39336.889619 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39269.483772 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 10 tps = 45468.080680 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45727.159963 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45399.241367 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40759.108042 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40783.287718 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40858.007847 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 20 tps = 34729.742313 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34705.119029 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34617.517224 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31252.355034 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31234.885791 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31273.307637 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 1 tps = 220.024691 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 294.855794 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 375.152757 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 295.965959 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 1036.517110 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 9167.012603 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 5 tps = 1241.224282 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1894.806301 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 18532.885549 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1497.491279 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1480.164166 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 3470.769236 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 10 tps = 2414.552333 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 19248.609443 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 45059.231609 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 1648.526373 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 3659.800008 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 35900.769857 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 20 tps = 2462.855864 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 27168.407568 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 34438.802096 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 2933.220489 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 25586.972428 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 30926.189621 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
Import Notes
Resolved by subject fallback
I could reproduce this on my laptop, standby is about 20% slower. I ran
oprofile, and what stands out as the difference between the master and
standby is that on standby about 20% of the CPU time is spent in
hash_seq_search(). The callpath is GetSnapshotDat() ->
KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
difference in performance.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
I could reproduce this on my laptop, standby is about 20% slower. I ran
oprofile, and what stands out as the difference between the master and
standby is that on standby about 20% of the CPU time is spent in
hash_seq_search(). The callpath is GetSnapshotDat() ->
KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
difference in performance.
The slowdown is proportional to the max_connections setting in the
standby. 20% slowdown might still be acceptable, but if you increase
max_connections to say 1000, things get really slow. I wouldn't
recommend max_connections=1000, of course, but I think we need to do
something about this. Changing the KnownAssignedXids data structure from
hash table into something that's quicker to scan. Preferably something
with O(N), where N is the number of entries in the data structure, not
the maximum number of entries it can hold as it is with the hash table
currently.
A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Changing the KnownAssignedXids data structure from
hash table into something that's quicker to scan. Preferably something
with O(N), where N is the number of entries in the data structure, not
the maximum number of entries it can hold as it is with the hash table
currently.
So that's pretty good news RedBlack Trees made it in 9.0, isn't it? :)
A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.
Does not sound like typical, does it?
--
dim
On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
I could reproduce this on my laptop, standby is about 20% slower. I ran
oprofile, and what stands out as the difference between the master and
standby is that on standby about 20% of the CPU time is spent in
hash_seq_search(). The callpath is GetSnapshotDat() ->
KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
difference in performance.The slowdown is proportional to the max_connections setting in the
standby. 20% slowdown might still be acceptable, but if you increase
max_connections to say 1000, things get really slow. I wouldn't
recommend max_connections=1000, of course, but I think we need to do
something about this. Changing the KnownAssignedXids data structure from
hash table into something that's quicker to scan. Preferably something
with O(N), where N is the number of entries in the data structure, not
the maximum number of entries it can hold as it is with the hash table
currently.
There's a tradeoff here to consider. KnownAssignedXids faces two
workloads: one for each WAL record where we check if the xid is already
known assigned, one for snapshots. The current implementation is
optimised towards recovery performance, not snapshot performance.
A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.
Any fix should be fairly quick because of the way its modularised - with
something like this in mind.
I'll try a circular buffer implementation, with fastpath.
Have something in a few hours.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.Any fix should be fairly quick because of the way its modularised - with
something like this in mind.I'll try a circular buffer implementation, with fastpath.
I started experimenting with a sorted array based implementation on
Tuesday but got carried away with other stuff. I now got back to that
and cleaned it up.
How does the attached patch look like? It's probably similar to what you
had in mind.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
knownassignedxids-array-2.patchtext/x-diff; name=knownassignedxids-array-2.patchDownload+162-110
On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.Any fix should be fairly quick because of the way its modularised - with
something like this in mind.I'll try a circular buffer implementation, with fastpath.
I started experimenting with a sorted array based implementation on
Tuesday but got carried away with other stuff. I now got back to that
and cleaned it up.How does the attached patch look like? It's probably similar to what you
had in mind.
It looks like a second version of what I'm working on and about to
publish. I'll take that as a compliment!
My patch is attached here also, for discussion.
The two patches look same in their main parts, though I have quite a few
extra tweaks in there, which you can read about in comments. One tweak I
don't have is the use of the presence array that allows a sensible
bsearch, so I'll to alter my patch to use that idea but keep the rest of
my code.
--
Simon Riggs www.2ndQuadrant.com
Attachments:
circular_knownassigned.patchtext/x-patch; charset=UTF-8; name=circular_knownassigned.patchDownload+826-43
Simon Riggs wrote:
On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
How does the attached patch look like? It's probably similar to what you
had in mind.It looks like a second version of what I'm working on and about to
publish. I'll take that as a compliment!My patch is attached here also, for discussion.
The two patches look same in their main parts, though I have quite a few
extra tweaks in there, which you can read about in comments.
Yeah. Yours seems a lot more complex with all those extra tweaks, I
would suggest to keep it simple. I did realize one bug in my patch: I
didn't handle xid wraparound correctly in the binary search, need to use
TransactionIdFollows instead of plan >.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-04-16 at 14:47 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
How does the attached patch look like? It's probably similar to what you
had in mind.It looks like a second version of what I'm working on and about to
publish. I'll take that as a compliment!My patch is attached here also, for discussion.
The two patches look same in their main parts, though I have quite a few
extra tweaks in there, which you can read about in comments.Yeah. Yours seems a lot more complex with all those extra tweaks, I
would suggest to keep it simple. I did realize one bug in my patch: I
didn't handle xid wraparound correctly in the binary search, need to use
TransactionIdFollows instead of plan >.
Almost done, yes, much simpler. I wrote a lot of that in the wee small
hours last night, so the difference is amusing.
And I spotted that bug, plus the off by one error also. Just rewritten
all other parts, so no worries.
--
Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
I didn't handle xid wraparound correctly in the binary search, need to use
TransactionIdFollows instead of plan >.
I think you're outsmarting yourself there. A binary search will in fact
*not work* with circular xid comparison (this is exactly why there's no
btree opclass for XID). You need to use plain >, and make sure the
array you're searching is ordered that way too. The other way might
accidentally fail to malfunction if you only tested ranges of XIDs that
weren't long enough to wrap around, but that doesn't make it right.
regards, tom lane