pgbench vs. wait events
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5
Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:
1 LWLockTranche | buffer_mapping
9 LWLockNamed | CLogControlLock
14 LWLockNamed | ProcArrayLock
16 Lock | tuple
25 LWLockNamed | CheckpointerCommLock
49 LWLockNamed | WALBufMappingLock
122 LWLockTranche | clog
182 Lock | transactionid
287 LWLockNamed | XidGenLock
1300 Client | ClientRead
1375 LWLockTranche | buffer_content
3990 Lock | extend
21014 LWLockNamed | WALWriteLock
28497 |
58279 LWLockTranche | wal_insert
tps = 1150.803133 (including connections establishing)
What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:
1 BufferPin | BufferPin
1 LWLockTranche | lock_manager
2 LWLockTranche | buffer_mapping
8 LWLockNamed | ProcArrayLock
9 LWLockNamed | CheckpointerCommLock
9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content
37 LWLockTranche | clog
153 Lock | tuple
388 LWLockNamed | XidGenLock
827 Lock | transactionid
1267 Client | ClientRead
20631 Lock | extend
91767 |
tps = 1223.239416 (including connections establishing)
If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.
Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:
14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |
tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.
However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
get these results:
6 LWLockTranche | wal_insert
12 Lock | tuple
26 LWLockTranche | lock_manager
39 LWLockTranche | buffer_content
353 LWLockNamed | CLogControlLock
377 Lock | extend
491 Lock | transactionid
500 LWLockNamed | ProcArrayLock
696 LWLockNamed | XidGenLock
27685 Client | ClientRead
84930 |
WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.
64 clients:
1 BufferPin | BufferPin
2 LWLockTranche | buffer_mapping
51 LWLockTranche | wal_insert
52 Lock | relation
132 LWLockTranche | lock_manager
154 LWLockTranche | buffer_content
209 Lock | tuple
940 Lock | extend
1412 LWLockNamed | XidGenLock
2337 LWLockNamed | CLogControlLock
2813 LWLockNamed | ProcArrayLock
3859 Lock | transactionid
83760 Client | ClientRead
133840 |
tps = 36672.484524 (including connections establishing)
96 clients:
1 LWLockTranche | proc
53 LWLockTranche | wal_insert
305 LWLockTranche | buffer_content
391 LWLockTranche | lock_manager
797 Lock | tuple
1074 Lock | extend
2492 LWLockNamed | XidGenLock
4185 LWLockNamed | ProcArrayLock
6003 LWLockNamed | CLogControlLock
11348 Lock | transactionid
142881 Client | ClientRead
174011 |
tps = 37214.257661 (including connections establishing)
192 clients:
1 LWLockTranche | buffer_mapping
2 BufferPin | BufferPin
4 LWLockTranche | clog
8 LWLockTranche | proc
87 LWLockTranche | wal_insert
708 LWLockTranche | buffer_content
1892 LWLockTranche | lock_manager
2225 Lock | extend
4796 LWLockNamed | ProcArrayLock
5105 LWLockNamed | XidGenLock
7982 Lock | tuple
11882 LWLockNamed | CLogControlLock
64091 Lock | transactionid
275481 |
308627 Client | ClientRead
tps = 35018.906863 (including connections establishing)
There are a couple of interesting things here. First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others. As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant. Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time. It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:
Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit
in RAM? That is, are the backends doing real IO, or they just doing fake
IO to the kernel's fs cache?
Cheers,
Jeff
On Thu, Oct 6, 2016 at 4:40 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit
in RAM? That is, are the backends doing real IO, or they just doing fake IO
to the kernel's fs cache?
That's a good question.
[rhaas@hydra ~]$ free -g
total used free shared buffers cached
Mem: 61 26 34 0 0 24
-/+ buffers/cache: 2 58
Swap: 19 4 15
rhaas=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));
pg_size_pretty
----------------
38 GB
(1 row)
rhaas=# select pg_size_pretty(pg_database_size(current_database()));
pg_size_pretty
----------------
44 GB
(1 row)
That's pretty tight, especially since I now notice Andres also left a
postmaster running on this machine back in April, with
shared_buffers=8GB. 44GB for this database plus 8GB for
shared_buffers plus 8GB for the other postmaster's shared_buffers
leaves basically no slack, so it was probably doing quite a bit of
real I/O, especially after the database got a bit of bloat.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 7, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:
Nice. Thanks for sharing.
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.
Or an INSERT SELECT on an unlogged table? No need of extra maths then.
First, I tried scale factor
3000 with 32 clients and got these results:1 LWLockTranche | buffer_mapping
[...]
21014 LWLockNamed | WALWriteLock
28497 |
58279 LWLockTranche | wal_inserttps = 1150.803133 (including connections establishing)
What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
Increasing the number of WAL insert slots would help? With your tests
this is at 8 all the time.
The first thing I tried was switching to unlogged tables, which
produces these results:1 BufferPin | BufferPin
1 LWLockTranche | lock_manager
2 LWLockTranche | buffer_mapping
8 LWLockNamed | ProcArrayLock
9 LWLockNamed | CheckpointerCommLock
9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content
37 LWLockTranche | clog
153 Lock | tuple
388 LWLockNamed | XidGenLock
827 Lock | transactionid
1267 Client | ClientRead
20631 Lock | extend
91767 |tps = 1223.239416 (including connections establishing)
If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.
Maybe something in fd.c.. It may be a good idea to actually have a
look at a perf output on Linux to see where this contention is
happening, use this conclusion to decide the place of a wait point,
and then see if on other OSes share a similar pattern.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert,
This contention on WAL reminds me of another scenario I've heard about
that was similar.
To fix things what happened was that anyone that the first person to
block would be responsible for writing out all buffers for anyone
blocked behind "him".
The for example if you have many threads, A, B, C, D
If while A is writing to WAL and hold the lock, then B arrives, B of
course blocks, then C comes along and blocks as well, then D.
Finally A finishes its write and then ....
Now you have two options for resolving this, either
1) A drops its lock, B picks up the lock... B writes its buffer and then
drops the lock. Then C gets the lock, writes its buffer, drops the
lock, then finally D gets the lock, writes its buffer and then drops the
lock.
2) A then writes out B's, C's, and D's buffers, then A drops the lock,
B, C and D wake up, note that their respective buffers are written and
just return. This greatly speeds up the system. (just be careful to
make sure A doesn't do "too much work" otherwise you can get a sort of
livelock if too many threads are blocked behind it, generally only issue
one additional flush on behalf of other threads, do not "loop until the
queue is empty")
I'm not sure if this is actually possible with the way WAL is
implemented, (or perhaps if this strategy is already implemented) but
it's definitely worth if not done already as it can speed things up
enormously.
On 10/6/16 11:38 AM, Robert Haas wrote:
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:1 LWLockTranche | buffer_mapping
9 LWLockNamed | CLogControlLock
14 LWLockNamed | ProcArrayLock
16 Lock | tuple
25 LWLockNamed | CheckpointerCommLock
49 LWLockNamed | WALBufMappingLock
122 LWLockTranche | clog
182 Lock | transactionid
287 LWLockNamed | XidGenLock
1300 Client | ClientRead
1375 LWLockTranche | buffer_content
3990 Lock | extend
21014 LWLockNamed | WALWriteLock
28497 |
58279 LWLockTranche | wal_inserttps = 1150.803133 (including connections establishing)
What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:1 BufferPin | BufferPin
1 LWLockTranche | lock_manager
2 LWLockTranche | buffer_mapping
8 LWLockNamed | ProcArrayLock
9 LWLockNamed | CheckpointerCommLock
9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content
37 LWLockTranche | clog
153 Lock | tuple
388 LWLockNamed | XidGenLock
827 Lock | transactionid
1267 Client | ClientRead
20631 Lock | extend
91767 |tps = 1223.239416 (including connections establishing)
If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
get these results:6 LWLockTranche | wal_insert
12 Lock | tuple
26 LWLockTranche | lock_manager
39 LWLockTranche | buffer_content
353 LWLockNamed | CLogControlLock
377 Lock | extend
491 Lock | transactionid
500 LWLockNamed | ProcArrayLock
696 LWLockNamed | XidGenLock
27685 Client | ClientRead
84930 |WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.64 clients:
1 BufferPin | BufferPin
2 LWLockTranche | buffer_mapping
51 LWLockTranche | wal_insert
52 Lock | relation
132 LWLockTranche | lock_manager
154 LWLockTranche | buffer_content
209 Lock | tuple
940 Lock | extend
1412 LWLockNamed | XidGenLock
2337 LWLockNamed | CLogControlLock
2813 LWLockNamed | ProcArrayLock
3859 Lock | transactionid
83760 Client | ClientRead
133840 |tps = 36672.484524 (including connections establishing)
96 clients:
1 LWLockTranche | proc
53 LWLockTranche | wal_insert
305 LWLockTranche | buffer_content
391 LWLockTranche | lock_manager
797 Lock | tuple
1074 Lock | extend
2492 LWLockNamed | XidGenLock
4185 LWLockNamed | ProcArrayLock
6003 LWLockNamed | CLogControlLock
11348 Lock | transactionid
142881 Client | ClientRead
174011 |tps = 37214.257661 (including connections establishing)
192 clients:
1 LWLockTranche | buffer_mapping
2 BufferPin | BufferPin
4 LWLockTranche | clog
8 LWLockTranche | proc
87 LWLockTranche | wal_insert
708 LWLockTranche | buffer_content
1892 LWLockTranche | lock_manager
2225 Lock | extend
4796 LWLockNamed | ProcArrayLock
5105 LWLockNamed | XidGenLock
7982 Lock | tuple
11882 LWLockNamed | CLogControlLock
64091 Lock | transactionid
275481 |
308627 Client | ClientReadtps = 35018.906863 (including connections establishing)
There are a couple of interesting things here. First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others. As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant. Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time. It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert,
This contention on WAL reminds me of another scenario I've heard about
that was similar.
To fix things what happened was that anyone that the first person to
block would be responsible for writing out all buffers for anyone
blocked behind "him".
The for example if you have many threads, A, B, C, D
If while A is writing to WAL and hold the lock, then B arrives, B of
course blocks, then C comes along and blocks as well, then D.
Finally A finishes its write and then ....
Now you have two options for resolving this, either
1) A drops its lock, B picks up the lock... B writes its buffer and then
drops the lock. Then C gets the lock, writes its buffer, drops the
lock, then finally D gets the lock, writes its buffer and then drops the
lock.
2) A then writes out B's, C's, and D's buffers, then A drops the lock,
B, C and D wake up, note that their respective buffers are written and
just return. This greatly speeds up the system. (just be careful to
make sure A doesn't do "too much work" otherwise you can get a sort of
livelock if too many threads are blocked behind it, generally only issue
one additional flush on behalf of other threads, do not "loop until the
queue is empty")
I'm not sure if this is actually possible with the way WAL is
implemented, (or perhaps if this strategy is already implemented) but
it's definitely worth if not done already as it can speed things up
enormously.
On 10/6/16 11:38 AM, Robert Haas wrote:
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:1 LWLockTranche | buffer_mapping
9 LWLockNamed | CLogControlLock
14 LWLockNamed | ProcArrayLock
16 Lock | tuple
25 LWLockNamed | CheckpointerCommLock
49 LWLockNamed | WALBufMappingLock
122 LWLockTranche | clog
182 Lock | transactionid
287 LWLockNamed | XidGenLock
1300 Client | ClientRead
1375 LWLockTranche | buffer_content
3990 Lock | extend
21014 LWLockNamed | WALWriteLock
28497 |
58279 LWLockTranche | wal_inserttps = 1150.803133 (including connections establishing)
What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload. Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test. The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:1 BufferPin | BufferPin
1 LWLockTranche | lock_manager
2 LWLockTranche | buffer_mapping
8 LWLockNamed | ProcArrayLock
9 LWLockNamed | CheckpointerCommLock
9 LWLockNamed | CLogControlLock
11 LWLockTranche | buffer_content
37 LWLockTranche | clog
153 Lock | tuple
388 LWLockNamed | XidGenLock
827 Lock | transactionid
1267 Client | ClientRead
20631 Lock | extend
91767 |tps = 1223.239416 (including connections establishing)
If you don't look at the TPS number, these results look like a vast
improvement. The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture. However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O. This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
get these results:6 LWLockTranche | wal_insert
12 Lock | tuple
26 LWLockTranche | lock_manager
39 LWLockTranche | buffer_content
353 LWLockNamed | CLogControlLock
377 Lock | extend
491 Lock | transactionid
500 LWLockNamed | ProcArrayLock
696 LWLockNamed | XidGenLock
27685 Client | ClientRead
84930 |WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.64 clients:
1 BufferPin | BufferPin
2 LWLockTranche | buffer_mapping
51 LWLockTranche | wal_insert
52 Lock | relation
132 LWLockTranche | lock_manager
154 LWLockTranche | buffer_content
209 Lock | tuple
940 Lock | extend
1412 LWLockNamed | XidGenLock
2337 LWLockNamed | CLogControlLock
2813 LWLockNamed | ProcArrayLock
3859 Lock | transactionid
83760 Client | ClientRead
133840 |tps = 36672.484524 (including connections establishing)
96 clients:
1 LWLockTranche | proc
53 LWLockTranche | wal_insert
305 LWLockTranche | buffer_content
391 LWLockTranche | lock_manager
797 Lock | tuple
1074 Lock | extend
2492 LWLockNamed | XidGenLock
4185 LWLockNamed | ProcArrayLock
6003 LWLockNamed | CLogControlLock
11348 Lock | transactionid
142881 Client | ClientRead
174011 |tps = 37214.257661 (including connections establishing)
192 clients:
1 LWLockTranche | buffer_mapping
2 BufferPin | BufferPin
4 LWLockTranche | clog
8 LWLockTranche | proc
87 LWLockTranche | wal_insert
708 LWLockTranche | buffer_content
1892 LWLockTranche | lock_manager
2225 Lock | extend
4796 LWLockNamed | ProcArrayLock
5105 LWLockNamed | XidGenLock
7982 Lock | tuple
11882 LWLockNamed | CLogControlLock
64091 Lock | transactionid
275481 |
308627 Client | ClientReadtps = 35018.906863 (including connections establishing)
There are a couple of interesting things here. First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others. As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant. Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time. It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |tps = 27651.562835 (including connections establishing)
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.
What happens if you turn fsync off? Once a xlog file is fully written, it
is immediately fsynced, even if the backend is holding WALWriteLock or
wal_insert (or both) at the time, and even if synchrounous_commit is off.
Assuming this machine has a BBU so that it doesn't have to wait for disk
rotation, still fsyncs are expensive because the kernel has to find all the
data and get it sent over to the BBU, while holding locks.
....
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.
I would be careful about that interpretation. If you asked pgbench, it
would probably have the opposite opinion.
The backend tosses its response at the kernel (which will never block,
because the pgbench responses are all small and the kernel will buffer
them) and then goes into ClientRead. After the backend goes into ClientRead,
the kernel needs to find and wake up the pgbench, deliver the response, and
pgbench has to receive and process the response. Only then does it create
a new query (I've toyed before with having pgbench construct the next query
while it is waiting for a response on the previous one, but that didn't
seem promising, and much of pgbench has been rewritten since then), pass
the query back to the kernel. Then the kernel has to find and wake up the
backend and deliver the new query. So for a reasonable chunk of the time
that the server thinks it is waiting for the client, the client also thinks
it is waiting for the server.
I think we need to come up with some benchmarking queries which get more
work done per round-trip to the database. And build them into the binary,
because otherwise people won't use them as much as they should if they have
to pass "-f" files around mailing lists and blog postings. For example,
we could enclose 5 statements of the TPC-B-like into a single function
which takes aid, bid, tid, and delta as arguments. And presumably we could
drop the other two statements (BEGIN and COMMIT) as well, and rely on
autocommit to get that job done. So we could go from 7 statements to 1.
Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.
What I have done in the past is chop a zero off from:
#define naccounts 100000
and recompile pgbench. Then you can increase the scale factor so that you
have less contention on pgbench_branches while still fitting the data in
shared_buffers, or in RAM.
Cheers,
Jeff
Hi,
On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.
I don't think you meant that, but this sounds a bit like that there's
nothing to do performance-wise - I don't think that's true. There's no
significant lock-contention, yes. But that obviously doesn't mean we can
use resources more efficiently.
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.
Isn't that more a question of pgbench threads beeing preemted, so server
processes can run? Since there's not enough hardware threads for both
servers and clients to always run, you *have* to switch between them.
Usually a read from the client after processing a query will cause a
directed wakeup of the other thread (which is why it's not a very
frequently observed state), but if the queue of to-be-run tasks is very
long that'll not happen.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2016-10-06 18:15:58 -0400, Robert Haas wrote:
That's pretty tight, especially since I now notice Andres also left a
postmaster running on this machine back in April, with
shared_buffers=8GB.
Oops, sorry for that.
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:
This contention on WAL reminds me of another scenario I've heard about that
was similar.To fix things what happened was that anyone that the first person to block
would be responsible for writing out all buffers for anyone blocked behind
"him".
We pretty much do that already. But while that's happening, the other
would-be-writers show up as blocking on the lock. We don't use kind of
an odd locking model for the waiters (LWLockAcquireOrWait()), which
waits for the lock to be released, but doesn't try to acquire it
afterwards. Instead the wal position is rechecked, and in many cases
we'll be done afterwards, because enough has been written out.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/7/16 10:42 AM, Andres Freund wrote:
Hi,
On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:
This contention on WAL reminds me of another scenario I've heard about that
was similar.To fix things what happened was that anyone that the first person to block
would be responsible for writing out all buffers for anyone blocked behind
"him".We pretty much do that already. But while that's happening, the other
would-be-writers show up as blocking on the lock. We don't use kind of
an odd locking model for the waiters (LWLockAcquireOrWait()), which
waits for the lock to be released, but doesn't try to acquire it
afterwards. Instead the wal position is rechecked, and in many cases
we'll be done afterwards, because enough has been written out.Greetings,
Andres Freund
Are the batched writes all done before fsync is called?
Are you sure that A only calls fsync after flushing all the buffers from
B, C, and D? Or will it fsync twice? Is there instrumentation to show
that?
I know there's a tremendous level of skill involved in this code, but
simply asking in case there's some tricks.
Another strategy that may work is actually intentionally
waiting/buffering some few ms between flushes/fsync, for example, make
sure that the number of flushes per second doesn't exceed some
configurable amount because each flush likely eats at least one iop from
the disk and there is a maximum iops per disk, so might as well buffer
more if you're exceeding that iops count. You'll trade some latency,
but gain throughput for doing that.
Does this make sense? Again apologies if this has been covered. Is
there a whitepaper or blog post or clear way I can examine the algorithm
wrt locks/buffering for flushing WAL logs?
-Alfred
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 7, 2016 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.I don't think you meant that, but this sounds a bit like that there's
nothing to do performance-wise - I don't think that's true. There's no
significant lock-contention, yes. But that obviously doesn't mean we can
use resources more efficiently.
Yeah, right. Doing the same stuff faster will surely help. It just
doesn't look like we can get anywhere by improving the locking, unless
it's something related to WAL writing.
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.Isn't that more a question of pgbench threads beeing preemted, so server
processes can run? Since there's not enough hardware threads for both
servers and clients to always run, you *have* to switch between them.
Usually a read from the client after processing a query will cause a
directed wakeup of the other thread (which is why it's not a very
frequently observed state), but if the queue of to-be-run tasks is very
long that'll not happen.
Yeah, maybe I shouldn't have said it suffers more than the server, but
rather along with the server.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
What happens if you turn fsync off? Once a xlog file is fully written, it
is immediately fsynced, even if the backend is holding WALWriteLock or
wal_insert (or both) at the time, and even if synchrounous_commit is off.
Assuming this machine has a BBU so that it doesn't have to wait for disk
rotation, still fsyncs are expensive because the kernel has to find all the
data and get it sent over to the BBU, while holding locks.
Scale factor 300, 32 clients, fsync=off:
5 Lock | tuple
18 LWLockTranche | lock_manager
24 LWLockNamed | WALWriteLock
88 LWLockTranche | buffer_content
265 LWLockTranche | wal_insert
373 LWLockNamed | CLogControlLock
496 LWLockNamed | ProcArrayLock
532 Lock | extend
540 LWLockNamed | XidGenLock
545 Lock | transactionid
27067 Client | ClientRead
85364 |
But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough()
look at enableFsync and just do nothing if it's false.
Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.I would be careful about that interpretation. If you asked pgbench, it
would probably have the opposite opinion.
Yeah, that's possible.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 7, 2016 at 11:34 PM, Alfred Perlstein <alfred@freebsd.org> wrote:
Are the batched writes all done before fsync is called?
In most cases, but if while writing, the wal segment (size - 16M) is
finished, we do fsync and then do the remaining writes and at end
again perform fsync.
Are you sure that A only calls fsync after flushing all the buffers from B,
C, and D? Or will it fsync twice?
I don't think it is guaranteed, however if B, C and D has finished
writing to wal buffers before A starts writing out buffer, then A will
ensure to combine all those writes and call fsync just once.
Is there instrumentation to show that?
Currently, I don't think we have instrumentation to show that, but you
can do it if you want. In XLogFlush(), we do print the write request
values (if
XLOG_DEBUG is set and we have compiled the code with WAL_DEBUG
option), you can put the one print in XLOGWrite once the write/flush
is done and another print when the request is satisfied without
calling XLOGWrite.
I know there's a tremendous level of skill involved in this code, but simply
asking in case there's some tricks.Another strategy that may work is actually intentionally waiting/buffering
some few ms between flushes/fsync,
We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.
Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock. However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 7, 2016 at 1:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
What happens if you turn fsync off? Once a xlog file is fully written,
it
is immediately fsynced, even if the backend is holding WALWriteLock or
wal_insert (or both) at the time, and even if synchrounous_commit is off.
Assuming this machine has a BBU so that it doesn't have to wait for disk
rotation, still fsyncs are expensive because the kernel has to find allthe
data and get it sent over to the BBU, while holding locks.
Scale factor 300, 32 clients, fsync=off:
5 Lock | tuple
18 LWLockTranche | lock_manager
24 LWLockNamed | WALWriteLock
88 LWLockTranche | buffer_content
265 LWLockTranche | wal_insert
373 LWLockNamed | CLogControlLock
496 LWLockNamed | ProcArrayLock
532 Lock | extend
540 LWLockNamed | XidGenLock
545 Lock | transactionid
27067 Client | ClientRead
85364 |
Did the TPS go up appreciably?
But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough()
look at enableFsync and just do nothing if it's false.
I think we are in agreement. I don't know which part you think is wrong.
When I said immediately, I didn't mean unconditionally.
Anyway, based on the reduced wait events, I think this shows that if we
need to do something in the xlog area, probably what it would be is to add
a queue of fully written but un-synced xlog files, so that the syncing can
be delegated to the background wal writer process. And of course anyone
needing to actually flush their xlog would have to start by flushing the
queue.
(Or perhaps just make the xlog files bigger, and call it a day)
Cheers,
Jeff
On Fri, Oct 7, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Another strategy that may work is actually intentionally
waiting/buffering
some few ms between flushes/fsync,
We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.
If you have a fast, high resolution timer, then one thing you can do is
keep track of when the previous xlog sync finished. Then instead of having
commit_delay be an absolute amount of time to sleep, it would mean "wait
until that amount of time has passed since the previous sync finished." So
you would set it based on the RPM of your drive, so that the time it is
sleeping to allow more work to happen from other processes is time it would
have to spend waiting on rotational delay anyway.
But I dropped this, because it would be hard to tune, hard to implement in
a cross-platform way, and because anyone with such high performance needs
is just going to buy a nonvolatile write-cache and be done with it.
Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock. However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.
What I implemented at one point was:
(Already have the lock before getting here)
Write to the extent it is ready to be written.
Update the shared structure to reflect written upto.
Drop the lock
fsync
Take the lock again
update the shared structure to reflect flushed upto.
Drop the lock again.
This way, multiple process could all be waiting on the kernel's fsync
response, rather than on each others locks. What I was hoping would happen
is that if one process wrote everything that was ready and called fsync,
while it was waiting for the platter to come around to the writing head,
more processes could make more data ready, write that more data, and call
an fsync of their own. And the kernel would be smart enough to amalgamate
them together. But the kernel evidently was not that smart, and performance
did not improve.
Cheers,
Jeff
On Fri, Oct 7, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I think we need to come up with some benchmarking queries which get more
work done per round-trip to the database. And build them into the binary,
because otherwise people won't use them as much as they should if they have
to pass "-f" files around mailing lists and blog postings. For example,
we could enclose 5 statements of the TPC-B-like into a single function
which takes aid, bid, tid, and delta as arguments. And presumably we could
drop the other two statements (BEGIN and COMMIT) as well, and rely on
autocommit to get that job done. So we could go from 7 statements to 1.
Here is an implementation of that. I've included the calling code as a
patch to pgbench, because if I make it a separate -f file then it is a pain
to get the correct scale and settings of naccounts, etc., into it.
The create script could be integrated into pgbench -i if this is something
we might want to commit.
This gives me an almost 3 fold increase in performance on a system with
fsync turned off:
pgbench -b tpcb-func -T30 -c8 -j8
tps = 24193.197877 (excluding connections establishing)
pgbench -b tpcb-like -T30 -c8 -j8
tps = 8434.746300 (excluding connections establishing)
Cheers,
Jeff
Attachments:
pgbench_function.patchapplication/octet-stream; name=pgbench_function.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
new file mode 100644
index d44cfda..a48076e
*** a/src/bin/pgbench/pgbench.c
--- b/src/bin/pgbench/pgbench.c
*************** static const BuiltinScript builtin_scrip
*** 423,428 ****
--- 423,437 ----
"END;\n"
},
{
+ "tpcb-func-like",
+ "<builtin: TPC-B (sort of) in PL/pgSQL>",
+ "\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+ "\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+ "\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+ "\\set delta random(-5000, 5000)\n"
+ "select * from pgbench_transaction(:aid, :bid, :tid, :delta);\n"
+ },
+ {
"simple-update",
"<builtin: simple update>",
"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
On Thu, Oct 6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
Hi,
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:
This is a great study that shows how the new instrumentation has given
us a new window into performance. I am frankly surprised we got as far
as we did in finding performance bottlenecks before we had this
instrumentation.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 10, 2016 at 9:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Oct 6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:This is a great study that shows how the new instrumentation has given
us a new window into performance. I am frankly surprised we got as far
as we did in finding performance bottlenecks before we had this
instrumentation.
Thanks, and +1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers