BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Started by Valentine Gogichashviliover 13 years ago12 messagesbugs
Jump to latest

The following bug has been logged on the website:

Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:

We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount of
WAL files on time and keep on with the master. But at some moments, the
slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite fast,
that also leads to overfill of the XLOG directory on the slave machines, as
the WAL receiver is putting the WAL files it gets via streaming replication
the XLOG directory (that, in many cases are quite a limited size separate
disk partition).

What we noticed also, is that reducing shared_buffers parameter from our
normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
replay dramatically. After restart of the slave machine with much lower
shared_buffers values, the replay becomes up to 10-20 times faster.

On the attached graph, there is a typical graph of WAL replication delay for
one of the slaves.

In that graph small (up to 6GB) replication delay peaks during the night are
caused by some long running transactions, stopping WAL replay on this slave,
to prevent replication collisions. But the last, big peaks are sometimes
start because of that waiting for a long running transaction on the slave,
but then they are growing as described above.

I know, that there is only one process that replays data, generated by many
threads on master machine. But why does the replay performance depend so
much on the shared_buffers parameter and can it be optimized?

With best regards,

Valentine Gogichashvili

#2Bruce Momjian
bruce@momjian.us
In reply to: Valentine Gogichashvili (#1)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

On Wed, Aug 15, 2012 at 10:10:42AM +0000, valgog@gmail.com wrote:

The following bug has been logged on the website:

Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:

We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount of
WAL files on time and keep on with the master. But at some moments, the
slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite fast,
that also leads to overfill of the XLOG directory on the slave machines, as
the WAL receiver is putting the WAL files it gets via streaming replication
the XLOG directory (that, in many cases are quite a limited size separate
disk partition).

What we noticed also, is that reducing shared_buffers parameter from our
normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
replay dramatically. After restart of the slave machine with much lower
shared_buffers values, the replay becomes up to 10-20 times faster.

On the attached graph, there is a typical graph of WAL replication delay for
one of the slaves.

In that graph small (up to 6GB) replication delay peaks during the night are
caused by some long running transactions, stopping WAL replay on this slave,
to prevent replication collisions. But the last, big peaks are sometimes
start because of that waiting for a long running transaction on the slave,
but then they are growing as described above.

I know, that there is only one process that replays data, generated by many
threads on master machine. But why does the replay performance depend so
much on the shared_buffers parameter and can it be optimized?

We warn against making shared buffers > 8GB, and this is perhaps another
good reason. The problem is probably due to the shared buffers filling
up with lots of dirty data, and the kernel being unable to contain all
the data coming during a checkpoint. It is also possible that the
buffer management overhead is just too high for that many buffers.

It is also possible that Postgres 9.1 or the coming 9.2 would do
better in this regard.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#3John R Pierce
pierce@hogranch.com
In reply to: Bruce Momjian (#2)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

On 08/15/12 6:36 AM, Bruce Momjian wrote:

We warn against making shared buffers > 8GB, and this is perhaps another
good reason.

I generally keep it at no more than 2gb as I've never found any
performance improvements going higher, on systems with 48gb ram, and had
more than a few performance degradations with larger shared buffers.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#4Andres Freund
andres@anarazel.de
In reply to: Valentine Gogichashvili (#1)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hi,

On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote:

The following bug has been logged on the website:

Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:

We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount
of WAL files on time and keep on with the master. But at some moments, the
slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite
fast, that also leads to overfill of the XLOG directory on the slave
machines, as the WAL receiver is putting the WAL files it gets via
streaming replication the XLOG directory (that, in many cases are quite a
limited size separate disk partition).

Could you try to get a profile of that 100% cpu time?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: Andres Freund (#4)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hallo Andreas,

Could you try to get a profile of that 100% cpu time?

Yes I will try to do it.

-- Valentine

In reply to: John R Pierce (#3)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hallo John,

On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com> wrote:

On 08/15/12 6:36 AM, Bruce Momjian wrote:

We warn against making shared buffers > 8GB, and this is perhaps another
good reason.

I generally keep it at no more than 2gb as I've never found any
performance improvements going higher, on systems with 48gb ram, and had
more than a few performance degradations with larger shared buffers.

we see up to 10x performance increase with bigger shared_buffers in case of
this database. Main database entities are about 20GB in size and we see
that performance drops considerably when running with smaller
shared_buffers smaller then that.

Regards,

-- Valentin

#7John R Pierce
pierce@hogranch.com
In reply to: Valentine Gogichashvili (#6)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

On 08/15/12 9:02 AM, Valentine Gogichashvili wrote:

On Wed, Aug 15, 2012 at 3:40 PM, John R Pierce <pierce@hogranch.com
<mailto:pierce@hogranch.com>> wrote:

On 08/15/12 6:36 AM, Bruce Momjian wrote:

We warn against making shared buffers > 8GB, and this is
perhaps another
good reason.

I generally keep it at no more than 2gb as I've never found any
performance improvements going higher, on systems with 48gb ram,
and had more than a few performance degradations with larger
shared buffers.

we see up to 10x performance increase with bigger shared_buffers in
case of this database. Main database entities are about 20GB in size
and we see that performance drops considerably when running with
smaller shared_buffers smaller then that.

do you adjust effective_cache_size accordingly? with the smaller
shared_buffers, we typically find at least half or more of physical
memory is available as OS level disk cache, as shown by the 'cached'
output of 'free' or whatever after the system has been running long
enough to fully populate its disk cache. this parameter has a
significant performance impact on the planner's estimation of the best
way of executing given queries. also, especially if you're executing
queries that process a lot of rows and have to do sorts and such,
increasing work_mem is quite helpful.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

In reply to: Andres Freund (#4)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hello Andreas,

here is the process, that now actually is not using CPU at all and the
shared_buffers are set to 2GB:

50978 postgres 20 0 2288m 2.0g 2.0g S 0.0 1.6 4225:34 postgres:
startup process recovering 000000050000262E000000FD

It is hanging on that file for several minutes now.

and here is the strace:

$ strace -c -f -p 50978
Process 50978 attached - interrupt to quit
Process 50978 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
94.82 0.007999 37 215 select
2.73 0.000230 1 215 getppid
2.45 0.000207 1 215 215 stat
------ ----------- ----------- --------- --------- ----------------
100.00 0.008436 645 215 total

What kind of additional profiling information would you like to see?

Regards,

-- Valentin

On Wed, Aug 15, 2012 at 4:09 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Show quoted text

Hi,

On Wednesday, August 15, 2012 12:10:42 PM valgog@gmail.com wrote:

The following bug has been logged on the website:

Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:

We are experiencing strange(?) behavior on the replication slave

machines.

The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount
of WAL files on time and keep on with the master. But at some moments,

the

slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite
fast, that also leads to overfill of the XLOG directory on the slave
machines, as the WAL receiver is putting the WAL files it gets via
streaming replication the XLOG directory (that, in many cases are quite a
limited size separate disk partition).

Could you try to get a profile of that 100% cpu time?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: Valentine Gogichashvili (#8)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hello again,

now I have restarted the slave with shared_buffers set to 32GB, and now the
recovery process is at 100% CPU:

$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
Process 27076 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
67.65 0.002127 0 18127 lseek
28.98 0.000911 0 16496 getppid
2.58 0.000081 0 758 read
0.80 0.000025 4 6 1 open
0.00 0.000000 0 4 close
0.00 0.000000 0 1 stat
0.00 0.000000 0 493 493 unlink
------ ----------- ----------- --------- --------- ----------------
100.00 0.003144 35885 494 total

$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
Process 27076 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
44.44 0.000240 0 5976 lseek
38.15 0.000206 0 5522 getppid
17.41 0.000094 1 172 read
0.00 0.000000 0 4 1 open
0.00 0.000000 0 2 close
0.00 0.000000 0 1 stat
0.00 0.000000 0 159 159 unlink
------ ----------- ----------- --------- --------- ----------------
100.00 0.000540 11836 160 total

Then restarted again with shared_buffers set back to 2GB:

now the replay process is relatively fast again and here is the strace:

$ strace -c -f -p 36450
Process 36450 attached - interrupt to quit
Process 39336 attached
Process 39337 attached
Process 39337 detached
Process 39338 attached
Process 39338 detached
Process 39339 attached
Process 39339 detached
Process 39336 detached
Process 39359 attached
Process 39360 attached
Process 39360 detached
Process 39361 attached
Process 39361 detached
Process 39362 attached
Process 39362 detached
Process 39359 detached
Process 40168 attached
Process 40169 attached
Process 40169 detached
Process 40170 attached
Process 40170 detached
Process 40171 attached
Process 40171 detached
Process 40168 detached
Process 36450 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
88.21 0.668159 31817 21 9 wait4
4.69 0.035553 6 5970 read
1.92 0.014572 0 120565 lseek
1.91 0.014474 5 2957 2954 unlink
1.55 0.011766 8 1545 write
1.27 0.009582 0 109274 getppid
0.31 0.002340 3 762 fcntl
0.03 0.000245 20 12 clone
0.02 0.000163 1 219 rt_sigprocmask
0.02 0.000162 1 154 10 open
0.01 0.000092 0 216 mmap
0.01 0.000084 1 133 42 stat
0.01 0.000066 6 12 getrlimit
0.01 0.000046 0 178 close
0.00 0.000037 1 27 munmap
0.00 0.000035 6 6 pipe
0.00 0.000035 12 3 set_tid_address
0.00 0.000030 1 42 getegid
0.00 0.000021 0 183 rt_sigaction
0.00 0.000000 0 108 fstat
0.00 0.000000 0 102 mprotect
0.00 0.000000 0 36 brk
0.00 0.000000 0 9 rt_sigreturn
0.00 0.000000 0 9 9 ioctl
0.00 0.000000 0 48 12 access
0.00 0.000000 0 6 dup2
0.00 0.000000 0 6 getpid
0.00 0.000000 0 12 execve
0.00 0.000000 0 9 uname
0.00 0.000000 0 42 getuid
0.00 0.000000 0 42 getgid
0.00 0.000000 0 45 geteuid
0.00 0.000000 0 6 getpgrp
0.00 0.000000 0 6 statfs
0.00 0.000000 0 12 arch_prctl
0.00 0.000000 0 6 3 futex
0.00 0.000000 0 3 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.757462 242786 3039 total

With best regards,

-- Valentine Gogichashvili

On Thu, Aug 16, 2012 at 4:53 PM, Valentine Gogichashvili
<valgog@gmail.com>wrote:

Show quoted text

Hello Andreas,

here is the process, that now actually is not using CPU at all and the
shared_buffers are set to 2GB:

50978 postgres 20 0 2288m 2.0g 2.0g S 0.0 1.6 4225:34 postgres:
startup process recovering 000000050000262E000000FD

It is hanging on that file for several minutes now.

and here is the strace:

$ strace -c -f -p 50978
Process 50978 attached - interrupt to quit
Process 50978 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
94.82 0.007999 37 215 select
2.73 0.000230 1 215 getppid
2.45 0.000207 1 215 215 stat
------ ----------- ----------- --------- --------- ----------------
100.00 0.008436 645 215 total

What kind of additional profiling information would you like to see?

Regards,

-- Valentin

In reply to: John R Pierce (#7)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hello John,

we see up to 10x performance increase with bigger shared_buffers in case
of this database. Main database entities are about 20GB in size and we see
that performance drops considerably when running with smaller
shared_buffers smaller then that.

do you adjust effective_cache_size accordingly? with the smaller
shared_buffers, we typically find at least half or more of physical memory
is available as OS level disk cache, as shown by the 'cached' output of
'free' or whatever after the system has been running long enough to fully
populate its disk cache. this parameter has a significant performance
impact on the planner's estimation of the best way of executing given
queries. also, especially if you're executing queries that process a lot
of rows and have to do sorts and such, increasing work_mem is quite helpful.

Yes, the effective_cache_size is set to the the 50% of the RAM = 64GB, but
as I mentioned already, we are measuring considerable performance increase
when increasing shared_buffers to the values, when it fits most important
tables completely.

Regards,

-- Valentine

#11Andres Freund
andres@anarazel.de
In reply to: Valentine Gogichashvili (#9)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

Hi Valentine,

On Thursday, August 16, 2012 05:05:11 PM Valentine Gogichashvili wrote:

Hello again,

now I have restarted the slave with shared_buffers set to 32GB, and now the
recovery process is at 100% CPU:

$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
Process 27076 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
44.44 0.000240 0 5976 lseek
38.15 0.000206 0 5522 getppid
17.41 0.000094 1 172 read
0.00 0.000000 0 4 1 open
0.00 0.000000 0 2 close
0.00 0.000000 0 1 stat
0.00 0.000000 0 159 159 unlink
------ ----------- ----------- --------- --------- ----------------
100.00 0.000540 11836 160 total

$ strace -c -f -p 36450
Process 36450 attached - interrupt to quit
Process 39336 attached
Process 39337 attached
Process 39337 detached
Process 39338 attached
Process 39338 detached
Process 39339 attached
Process 39339 detached
Process 39336 detached
Process 39359 attached
Process 39360 attached
Process 39360 detached
Process 39361 attached
Process 39361 detached
Process 39362 attached
Process 39362 detached
Process 39359 detached
Process 40168 attached
Process 40169 attached
Process 40169 detached
Process 40170 attached
Process 40170 detached
Process 40171 attached
Process 40171 detached
Process 40168 detached
Process 36450 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
88.21 0.668159 31817 21 9 wait4
4.69 0.035553 6 5970 read
1.92 0.014572 0 120565 lseek
1.91 0.014474 5 2957 2954 unlink
1.55 0.011766 8 1545 write
1.27 0.009582 0 109274 getppid
0.31 0.002340 3 762 fcntl
0.03 0.000245 20 12 clone
0.02 0.000163 1 219 rt_sigprocmask
0.02 0.000162 1 154 10 open
0.01 0.000092 0 216 mmap
0.01 0.000084 1 133 42 stat
0.01 0.000066 6 12 getrlimit
0.01 0.000046 0 178 close

Those straces don't indicate contention in the kernel to me.

Could you get a profile with 'perf' or if youre more familiar with it with
'oprofile'. With the former its just something like:
perf record -F 10000 -p 27076
perf report > somefile

And then send in somefile.

Greetings,

Andres

#12Andres Freund
andres@anarazel.de
In reply to: Valentine Gogichashvili (#1)
Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size

On Friday, August 17, 2012 12:51:44 PM Valentine Gogichashvili wrote:

Hello Andreas,

here are the results of perf profiling:
https://gist.github.com/3b8cb0c15661da439632
Also attached the files to that mail.

For a problematic case of big shared_buffers:

# Events: 320K cycles
#
# Overhead Command Shared Object Symbol
# ........ ........ ................. ..................................
#
98.70% postgres postgres [.] DropRelFileNodeBuffers
0.18% postgres postgres [.] RecordIsValid
0.11% postgres [kernel.kallsyms] [k] native_write_msr_safe
0.07% postgres [kernel.kallsyms] [k] dyntick_save_progress_counter
0.06% postgres [kernel.kallsyms] [k] scheduler_tick
0.03% postgres [kernel.kallsyms] [k] _spin_lock
0.03% postgres [kernel.kallsyms] [k] __do_softirq
0.03% postgres [kernel.kallsyms] [k] rcu_process_callbacks
0.03% postgres postgres [.] hash_search_with_hash_value
0.03% postgres [kernel.kallsyms] [k] native_read_msr_safe
0.02% postgres libc-2.12.so [.] memcpy
0.02% postgres [kernel.kallsyms] [k] rcu_process_gp_end
0.02% postgres [kernel.kallsyms] [k] apic_timer_interrupt
0.02% postgres [kernel.kallsyms] [k] run_timer_softirq
0.02% postgres [kernel.kallsyms] [k] system_call

Ok, that explains it. Youre frequently dropping/truncating tables? That
currently requires walking through shared buffers and loosing all buffered
pages related to that table. That obviously scales linearly with shared
buffers and is particularly expensive on multi socket machines.

Unless youre running 9.3 that will even lock each single page which is a
relatively expensive and slow operation. Depending on how adventurous you are
you could try backporting e8d029a30b5a5fb74b848a8697b1dfa3f66d9697 and see how
big the benefits are for you.

Greetings,

Andres