BUG #9635: Wal sender process is using 100% CPU

Started by Jamie Koceniakabout 12 years ago10 messagesbugs

jkoceniak@mediamath.com

about 12 years ago

The following bug has been logged on the website:

Bug reference: 9635
Logged by: Jamie Koceniak
Email address: jkoceniak@mediamath.com
PostgreSQL version: 9.1.9
Operating system: x86_64-unknown-linux-gnu (Debian 4.7.2-5) 64-bit
Description:

Periodically throughout the day, we keep seeing the wal sender process
utilize 100% of the CPU. We began noticing this after we added 2 new slave
servers, going from 2 to 4 slaves. See top results and I also included our
wal settings. Thanks!

top - 05:03:18 up 174 days, 4:51, 2 users, load average: 5.57, 4.75,
3.16
Tasks: 387 total, 8 running, 379 sleeping, 0 stopped, 0 zombie
%Cpu(s): 29.3 us, 4.7 sy, 0.0 ni, 65.3 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0
st
MiB Mem: 290797 total, 218532 used, 72264 free, 311 buffers
MiB Swap: 7812 total, 1 used, 7811 free, 206978 cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

30244 postgres 20 0 8497m 5608 2820 R 100 0.0 1:44.72 postgres: wal
sen
14447 postgres 20 0 8497m 5596 2816 R 100 0.0 3:11.27 postgres: wal
sen
16075 postgres 20 0 8497m 5600 2820 R 100 0.0 3:32.32 postgres: wal
sen
8177 postgres 20 0 8497m 5360 2820 S 36 0.0 0:03.35 postgres: wal
sen
4920 postgres 20 0 9647m 9.3g 8.1g S 3 3.3 1097:40 postgres:
writer
4923 postgres 20 0 68872 2072 788 S 3 0.0 511:01.76 postgres:
archive
4921 postgres 20 0 8496m 18m 17m S 2 0.0 593:36.38 postgres: wal
wri
7853 root 20 0 23432 1836 1176 R 1 0.0 0:00.44 top

4916 postgres 20 0 8492m 229m 228m S 0 0.1 598:44.57
/usr/lib/postgres

Current Wal settings:

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 12 years ago

In reply to: Jamie Koceniak (#1)

Re: BUG #9635: Wal sender process is using 100% CPU

On 03/19/2014 07:13 PM, jkoceniak@mediamath.com wrote:

The following bug has been logged on the website:

Bug reference: 9635
Logged by: Jamie Koceniak
Email address: jkoceniak@mediamath.com
PostgreSQL version: 9.1.9
Operating system: x86_64-unknown-linux-gnu (Debian 4.7.2-5) 64-bit
Description:

Periodically throughout the day, we keep seeing the wal sender process
utilize 100% of the CPU. We began noticing this after we added 2 new slave
servers, going from 2 to 4 slaves. See top results and I also included our
wal settings. Thanks!

Hmm. There was a bug that caused walsender to go into a busy-loop, when
the standby disconnected. But that was fixed in 9.1.4 already.

Could you attach to one of the processes with gdb and see what functions
are being called? Or strace, or "perf top" or something.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jamie Koceniak

jkoceniak@mediamath.com

about 12 years ago

In reply to: Heikki Linnakangas (#2)

Re: BUG #9635: Wal sender process is using 100% CPU

Hi Heikki,

I performed an strace on one of the pids that was consuming 100% of the cpu.
This is just a portion of the log.
I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
A majority of the calls in the trace log were recvfrom() and sendto().
EAGAIN (Resource temporarily unavailable) came up time and time again.

# strace log during 100% cpu utilization on the master

getppid()                               = 5474
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(7, ""..., 131072) = 131072
sendto(10, ""..., 3178, 0, NULL, 0) = 3178
sendto(10, ""..., 5381, 0, NULL, 0) = 5381
getppid()                               = 5474
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(7, ""..., 131072) = 131072
sendto(10, ""..., 3338, 0, NULL, 0) = 3338
getppid()                               = 5474
recvfrom(10, " ", 5, 0, NULL, NULL) = 5
recvfrom(10, "", 32, 0, NULL, NULL) = 32
recvfrom(10, "", 5, 0, NULL, NULL) = 5
recvfrom(10, ""..., 48, 0, NULL, NULL) = 48
fcntl(10, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(10, F_SETFL, O_RDWR)              = 0
fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
select(11, [3 10], [10], NULL, {1, 0})  = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
write(6, "\0", 1)                       = 1
rt_sigreturn(0x6)                       = -1 EINTR (Interrupted system call)
read(3, "\0", 16)                       = 1
getppid()                               = 5474
recvfrom(10, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(10, ""..., 2715, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(3, 0x7ffl961b0780, 16)             = -1 EAGAIN (Resource temporarily unavailable)
select(11, [3 10], [10], NULL, {1, 0})  = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
write(6, "\0", 1)                       = 1
rt_sigreturn(0x6)                       = -1 EINTR (Interrupted system call)
read(3, "\0", 16)                       = 1
getppid()                               = 5474

I currently don't have perf installed.
I can get that installed and send results if you need it.

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Tuesday, March 25, 2014 4:44 AM
To: Jamie Koceniak
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 03/19/2014 07:13 PM, jkoceniak@mediamath.com wrote:

The following bug has been logged on the website:

Bug reference: 9635
Logged by: Jamie Koceniak
Email address: jkoceniak@mediamath.com
PostgreSQL version: 9.1.9
Operating system: x86_64-unknown-linux-gnu (Debian 4.7.2-5) 64-bit
Description:

Periodically throughout the day, we keep seeing the wal sender process
utilize 100% of the CPU. We began noticing this after we added 2 new
slave servers, going from 2 to 4 slaves. See top results and I also
included our wal settings. Thanks!

Hmm. There was a bug that caused walsender to go into a busy-loop, when the standby disconnected. But that was fixed in 9.1.4 already.

Could you attach to one of the processes with gdb and see what functions are being called? Or strace, or "perf top" or something.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 12 years ago

In reply to: Jamie Koceniak (#3)

Re: BUG #9635: Wal sender process is using 100% CPU

On 05/05/2014 10:37 PM, Jamie Koceniak wrote:

Hi Heikki,

I performed an strace on one of the pids that was consuming 100% of the cpu.
This is just a portion of the log.
I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
A majority of the calls in the trace log were recvfrom() and sendto().
EAGAIN (Resource temporarily unavailable) came up time and time again.

# strace log during 100% cpu utilization on the master

Hmm, that trace looks normal to me. It doesn't look like it's
busy-looping, as I would expect if it's using 100% of the CPU.

Can you post the full log, or at least a larger chunk? I don't need the
data in the quotes from read() and sends() that you removed, but I would
like to see the trace from a longer period of time to see what the
pattern is.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jamie Koceniak

jkoceniak@mediamath.com

about 12 years ago

In reply to: Heikki Linnakangas (#4)

Re: BUG #9635: Wal sender process is using 100% CPU

Hi Heikki,

Attached is a trace route for one of the pids that was running at 100% cpu.
Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until the wal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in our application.

Here is perf top results:
39.49% libz.so.1.2.7 [.] 0x000000000000301e
32.97% postgres [.] 0x00000000000cf150
6.50% postgres [.] nocachegetattr
2.63% postgres [.] btint4cmp
1.64% postgres [.] _start
1.40% [unknown] [.] 0x00007f9dfca6a34b
1.06% libc-2.13.so [.] 0x000000000012934b
0.94% postgres [.] slot_getattr
0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f
0.69% postgres [.] ExecProject
0.67% libz.so.1.2.7 [.] adler32
0.66% postgres [.] GetMemoryChunkSpace
0.58% [kernel] [k] copy_user_generic_string
0.45% postgres [.] heap_fill_tuple
0.38% postgres [.] execTuplesMatch
0.27% postgres [.] FunctionCall2Coll
0.27% postgres [.] ExecProcNode
0.25% postgres [.] heap_tuple_untoast_attr
0.22% postgres [.] MemoryContextAlloc
0.21% postgres [.] heap_form_minimal_tuple
0.21% postgres [.] pfree
0.19% postgres [.] heap_compute_data_size

Here is some data using perf record and perf report:

+  33.93%        193019       postgres  postgres                  [.] 0x0000000000369e27
+  30.06%        173825       postgres  libz.so.1.2.7             [.] 0x00000000000036c0
+  10.21%         60618       postgres  [unknown]                 [.] 0x00007f9dfc2fd529
+   4.81%         27275       postgres  postgres                  [.] nocachegetattr
+   2.55%         14443       postgres  postgres                  [.] btint4cmp
+   1.22%          6916       postgres  postgres                  [.] ExecProject
+   1.20%          6837       postgres  postgres                  [.] _start
+   1.10%          6495       postgres  libc-2.13.so              [.] 0x000000000012b8e9
+   0.84%          4760       postgres  postgres                  [.] heap_fill_tuple
+   0.74%          4265       postgres  [kernel.kallsyms]         [k] copy_user_generic_string
+   0.71%          4001       postgres  postgres                  [.] GetMemoryChunkSpace
+   0.66%          3881       postgres  libcrypto.so.1.0.0        [.] 0x0000000000081ec7
+   0.63%          3673       postgres  libz.so.1.2.7             [.] adler32
+   0.43%          2448       postgres  postgres                  [.] slot_getattr
+   0.40%          2277       postgres  postgres                  [.] ExecProcNode
+   0.40%          2250       postgres  postgres                  [.] heap_form_minimal_tuple
+   0.37%          2078       postgres  postgres                  [.] heap_compute_data_size
+   0.31%          1739       postgres  postgres                  [.] ExecScan
+   0.27%          1779       postgres  [kernel.kallsyms]         [k] page_fault
+   0.24%          1374       postgres  postgres                  [.] LogicalTapeWrite
+   0.24%          4206        swapper  [kernel.kallsyms]         [k] intel_idle
+   0.21%          1206       postgres  postgres                  [.] slot_getsomeattrs

Also, here are our wal settings on the master.
Do these settings appear ok?

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Tuesday, May 06, 2014 4:01 AM
To: Jamie Koceniak
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 05/05/2014 10:37 PM, Jamie Koceniak wrote:

Hi Heikki,

I performed an strace on one of the pids that was consuming 100% of the cpu.
This is just a portion of the log.
I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
A majority of the calls in the trace log were recvfrom() and sendto().
EAGAIN (Resource temporarily unavailable) came up time and time again.

# strace log during 100% cpu utilization on the master

Hmm, that trace looks normal to me. It doesn't look like it's busy-looping, as I would expect if it's using 100% of the CPU.

Can you post the full log, or at least a larger chunk? I don't need the data in the quotes from read() and sends() that you removed, but I would like to see the trace from a longer period of time to see what the pattern is.

- Heikki

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 12 years ago

In reply to: Jamie Koceniak (#5)

Re: BUG #9635: Wal sender process is using 100% CPU

On 05/06/2014 07:44 PM, Jamie Koceniak wrote:

Hi Heikki,

Attached is a trace route for one of the pids that was running at 100% cpu.
Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until the wal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in our application.

Here is perf top results:
39.49% libz.so.1.2.7 [.] 0x000000000000301e
32.97% postgres [.] 0x00000000000cf150
6.50% postgres [.] nocachegetattr
2.63% postgres [.] btint4cmp
1.64% postgres [.] _start
1.40% [unknown] [.] 0x00007f9dfca6a34b
1.06% libc-2.13.so [.] 0x000000000012934b
0.94% postgres [.] slot_getattr
0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f
0.69% postgres [.] ExecProject
0.67% libz.so.1.2.7 [.] adler32
0.66% postgres [.] GetMemoryChunkSpace
0.58% [kernel] [k] copy_user_generic_string
0.45% postgres [.] heap_fill_tuple
0.38% postgres [.] execTuplesMatch
0.27% postgres [.] FunctionCall2Coll
0.27% postgres [.] ExecProcNode
0.25% postgres [.] heap_tuple_untoast_attr
0.22% postgres [.] MemoryContextAlloc
0.21% postgres [.] heap_form_minimal_tuple
0.21% postgres [.] pfree
0.19% postgres [.] heap_compute_data_size

Hmm, these results seem to include all processes, not just the WAL sender

Here is some data using perf record and perf report:

+  33.93%        193019       postgres  postgres                  [.] 0x0000000000369e27
+  30.06%        173825       postgres  libz.so.1.2.7             [.] 0x00000000000036c0
+  10.21%         60618       postgres  [unknown]                 [.] 0x00007f9dfc2fd529
+   4.81%         27275       postgres  postgres                  [.] nocachegetattr
+   2.55%         14443       postgres  postgres                  [.] btint4cmp
+   1.22%          6916       postgres  postgres                  [.] ExecProject
+   1.20%          6837       postgres  postgres                  [.] _start
+   1.10%          6495       postgres  libc-2.13.so              [.] 0x000000000012b8e9
+   0.84%          4760       postgres  postgres                  [.] heap_fill_tuple
+   0.74%          4265       postgres  [kernel.kallsyms]         [k] copy_user_generic_string
+   0.71%          4001       postgres  postgres                  [.] GetMemoryChunkSpace
+   0.66%          3881       postgres  libcrypto.so.1.0.0        [.] 0x0000000000081ec7
+   0.63%          3673       postgres  libz.so.1.2.7             [.] adler32
+   0.43%          2448       postgres  postgres                  [.] slot_getattr
+   0.40%          2277       postgres  postgres                  [.] ExecProcNode
+   0.40%          2250       postgres  postgres                  [.] heap_form_minimal_tuple
+   0.37%          2078       postgres  postgres                  [.] heap_compute_data_size
+   0.31%          1739       postgres  postgres                  [.] ExecScan
+   0.27%          1779       postgres  [kernel.kallsyms]         [k] page_fault
+   0.24%          1374       postgres  postgres                  [.] LogicalTapeWrite
+   0.24%          4206        swapper  [kernel.kallsyms]         [k] intel_idle
+   0.21%          1206       postgres  postgres                  [.] slot_getsomeattrs

Same here.

Also, here are our wal settings on the master.
Do these settings appear ok?

wal_block_size | 8192 |
wal_buffers | 16MB |
wal_keep_segments | 5000 |
wal_level | hot_standby |
wal_receiver_status_interval | 10s |
wal_segment_size | 16MB |
wal_sender_delay | 1s |
wal_sync_method | fdatasync |
wal_writer_delay | 200ms |

Looks reasonable to me.

So, the strace output repeats this, with small variations in the number
of bytes written in each sendto() call:

recvfrom(10, 0x7f9dfec1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
read(7, ..., 131072) = 131072
sendto(10, ..., 3690, 0, NULL, 0) = 3690
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3653, 0, NULL, 0) = 3653
sendto(10, ..., 3637, 0, NULL, 0) = 3637
sendto(10, ..., 3621, 0, NULL, 0) = 3621
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 53, 0, NULL, 0) = 53
getppid() = 5474

The read() reads 128kB of WAL from the file in pg_xlog, and it is then
sent to the client. Now, it's pretty surprising that the number of bytes
written with sendto() doesn't match the number of bytes read from the
file. Unless, you're using SSL with compression. That would also explain
the high amount of CPU time spent in libz in the perf output.

Try disabling SSL compression, with sslcompression=0 option in
primary_conninfo, or disable SSL altogether if the network is
trustworthy enough for that.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Jamie Koceniak (#1)

Re: BUG #9635: Wal sender process is using 100% CPU

Hi,

On 2014-03-19 17:13:56 +0000, jkoceniak@mediamath.com wrote:

Periodically throughout the day, we keep seeing the wal sender process
utilize 100% of the CPU. We began noticing this after we added 2 new slave
servers, going from 2 to 4 slaves. See top results and I also included our
wal settings. Thanks!

A typical reason for this is that the standbys are connecting over
ssl. Is that the case? If so, intentionally?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jamie Koceniak

jkoceniak@mediamath.com

about 12 years ago

In reply to: Heikki Linnakangas (#6)

Re: BUG #9635: Wal sender process is using 100% CPU

I was able to recreate the issue.
Basically, we have a job that inserts 30+ million records into a table.
During the loading of this table, the wal_sender cpu spikes to 100% and appears to stay at 100% until the replication is completed on the slave.
I was able to test inserting this much data on another master/slave configuration with streaming replication and saw the same results.

Other than offloading this job, is there anything we can do to decrease the replication lag?
Also, is this expected behavior due to the volume of data that got inserted and needs to get replicated?

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Tuesday, May 06, 2014 10:33 AM
To: Jamie Koceniak
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 05/06/2014 07:44 PM, Jamie Koceniak wrote:

Hi Heikki,

Attached is a trace route for one of the pids that was running at 100% cpu.
Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until the wal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in our application.

Here is perf top results:
39.49% libz.so.1.2.7 [.] 0x000000000000301e
32.97% postgres [.] 0x00000000000cf150
6.50% postgres [.] nocachegetattr
2.63% postgres [.] btint4cmp
1.64% postgres [.] _start
1.40% [unknown] [.] 0x00007f9dfca6a34b
1.06% libc-2.13.so [.] 0x000000000012934b
0.94% postgres [.] slot_getattr
0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f
0.69% postgres [.] ExecProject
0.67% libz.so.1.2.7 [.] adler32
0.66% postgres [.] GetMemoryChunkSpace
0.58% [kernel] [k] copy_user_generic_string
0.45% postgres [.] heap_fill_tuple
0.38% postgres [.] execTuplesMatch
0.27% postgres [.] FunctionCall2Coll
0.27% postgres [.] ExecProcNode
0.25% postgres [.] heap_tuple_untoast_attr
0.22% postgres [.] MemoryContextAlloc
0.21% postgres [.] heap_form_minimal_tuple
0.21% postgres [.] pfree
0.19% postgres [.] heap_compute_data_size

Hmm, these results seem to include all processes, not just the WAL sender

Here is some data using perf record and perf report:

+  33.93%        193019       postgres  postgres                  [.] 0x0000000000369e27
+  30.06%        173825       postgres  libz.so.1.2.7             [.] 0x00000000000036c0
+  10.21%         60618       postgres  [unknown]                 [.] 0x00007f9dfc2fd529
+   4.81%         27275       postgres  postgres                  [.] nocachegetattr
+   2.55%         14443       postgres  postgres                  [.] btint4cmp
+   1.22%          6916       postgres  postgres                  [.] ExecProject
+   1.20%          6837       postgres  postgres                  [.] _start
+   1.10%          6495       postgres  libc-2.13.so              [.] 0x000000000012b8e9
+   0.84%          4760       postgres  postgres                  [.] heap_fill_tuple
+   0.74%          4265       postgres  [kernel.kallsyms]         [k] copy_user_generic_string
+   0.71%          4001       postgres  postgres                  [.] GetMemoryChunkSpace
+   0.66%          3881       postgres  libcrypto.so.1.0.0        [.] 0x0000000000081ec7
+   0.63%          3673       postgres  libz.so.1.2.7             [.] adler32
+   0.43%          2448       postgres  postgres                  [.] slot_getattr
+   0.40%          2277       postgres  postgres                  [.] ExecProcNode
+   0.40%          2250       postgres  postgres                  [.] heap_form_minimal_tuple
+   0.37%          2078       postgres  postgres                  [.] heap_compute_data_size
+   0.31%          1739       postgres  postgres                  [.] ExecScan
+   0.27%          1779       postgres  [kernel.kallsyms]         [k] page_fault
+   0.24%          1374       postgres  postgres                  [.] LogicalTapeWrite
+   0.24%          4206        swapper  [kernel.kallsyms]         [k] intel_idle
+   0.21%          1206       postgres  postgres                  [.] slot_getsomeattrs

Same here.

Also, here are our wal settings on the master.
Do these settings appear ok?

wal_block_size | 8192 |
wal_buffers | 16MB |
wal_keep_segments | 5000 |
wal_level | hot_standby |
wal_receiver_status_interval | 10s |
wal_segment_size | 16MB |
wal_sender_delay | 1s |
wal_sync_method | fdatasync |
wal_writer_delay | 200ms |

Looks reasonable to me.

So, the strace output repeats this, with small variations in the number of bytes written in each sendto() call:

recvfrom(10, 0x7f9dfec1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) read(7, ..., 131072) = 131072 sendto(10, ..., 3690, 0, NULL, 0) = 3690 sendto(10, ..., 3605, 0, NULL, 0) = 3605 sendto(10, ..., 3669, 0, NULL, 0) = 3669 sendto(10, ..., 3653, 0, NULL, 0) = 3653 sendto(10, ..., 3637, 0, NULL, 0) = 3637 sendto(10, ..., 3621, 0, NULL, 0) = 3621 sendto(10, ..., 3669, 0, NULL, 0) = 3669 sendto(10, ..., 3605, 0, NULL, 0) = 3605 sendto(10, ..., 53, 0, NULL, 0) = 53
getppid() = 5474

The read() reads 128kB of WAL from the file in pg_xlog, and it is then sent to the client. Now, it's pretty surprising that the number of bytes written with sendto() doesn't match the number of bytes read from the file. Unless, you're using SSL with compression. That would also explain the high amount of CPU time spent in libz in the perf output.

Try disabling SSL compression, with sslcompression=0 option in primary_conninfo, or disable SSL altogether if the network is trustworthy enough for that.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jamie Koceniak

jkoceniak@mediamath.com

about 12 years ago

In reply to: Andres Freund (#7)

Re: BUG #9635: Wal sender process is using 100% CPU

Hi Andres,

The ssl option is on in postgres.conf but we don't have any hostssl lines or cert configured as an authentication method in our pg_hba.conf file.

In our pg_hba.conf we have configured each replication host as trust:
host replication all xx.xxx.xxx.xx/32 trust

I replied in a previous email that we have a job that inserts 30+ million records into a table.
During the loading of this table, the wal_sender cpu spikes to 100% and appears to stay at 100% until the replication is completed on the slave.
I have been able to reproduce this. Other than offloading this job, is there anything we can do to improve the replication lag?

Thanks!

-----Original Message-----
From: Andres Freund [mailto:andres@2ndquadrant.com]
Sent: Tuesday, May 06, 2014 3:42 PM
To: Jamie Koceniak
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

Hi,

On 2014-03-19 17:13:56 +0000, jkoceniak@mediamath.com wrote:

Periodically throughout the day, we keep seeing the wal sender process
utilize 100% of the CPU. We began noticing this after we added 2 new
slave servers, going from 2 to 4 slaves. See top results and I also
included our wal settings. Thanks!

A typical reason for this is that the standbys are connecting over ssl. Is that the case? If so, intentionally?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10

Andres Freund

andres@anarazel.de

about 12 years ago

In reply to: Jamie Koceniak (#9)

Re: BUG #9635: Wal sender process is using 100% CPU

Hi,

On 2014-05-07 03:02:22 +0000, Jamie Koceniak wrote:

The ssl option is on in postgres.conf but we don't have any hostssl lines or cert configured as an authentication method in our pg_hba.conf file.

Add sslmode=disable to the standby's primary_conninfo to be certain. The
profile suggests ssl is active. 'host' allows ssl connections, it just
doesn't force them to be used.

Also, zlib seems to be a contention point. If that profile is from a
standby, could it be that you're concurrently generating base backups?
Or compressing files in a archive_command?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

BUG #9635: Wal sender process is using 100% CPU

Attachments: