Linux OOM-Killer

Started by bastiabout 12 years ago6 messagesgeneral
Jump to latest
#1basti
mailinglist@unix-solution.de

Hello,

we have a database master Version:
PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-2) 4.7.2, 64-bit
and a WAL-Replication Slave with hot-standby version:
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit.

Since a few days we had problems with the Linux OOM-Killer.
Some simple query that normally take around 6-7 minutes now takes 5 hours.
We did not change any configuration values the last days.

First of all I have set

vm.overcommit_memory=2
vm.overcommit_ratio=80

Here are some values of my DB-Master config, the Server has 32GB RAM and
is only for database, no other service.
Did anybody see some mistakes?

I'am not shure if work_mem, maintenance_work_mem and
effective_cache_size is set correct.

/etc/postgresql/9.1/main/postgresql.conf

max_connections = 200
ssl = true
shared_buffers = 6GB # min 128kB
work_mem = 192MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
wal_level = hot_standby
synchronous_commit = off
wal_buffers = 16MB
checkpoint_segments = 16
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p -e "ssh -i
/var/lib/postgresql/.ssh/id_rsa"
postgres@my_postgres_slave:/var/lib/postgresql/9.1/wals/dolly_main/%f
</dev/null'
max_wal_senders = 1
wal_keep_segments = 32
random_page_cost = 2.0
effective_cache_size = 22GB
default_statistics_target = 100
constraint_exclusion = off
join_collapse_limit = 1
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_min_duration_statement = 40000
log_lock_waits = on
track_counts = on
autovacuum = on
log_autovacuum_min_duration = 5000
autovacuum_max_workers = 4
datestyle = 'iso, dmy'
deadlock_timeout = 1s

Thanks a lot!
Basti

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: basti (#1)
Re: Linux OOM-Killer

Hi,

On 17 Březen 2014, 11:45, basti wrote:

Hello,

we have a database master Version:
PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-2) 4.7.2, 64-bit
and a WAL-Replication Slave with hot-standby version:
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit.

You're missing >18 months of fixes on the master (slightly less on the
slave).

Since a few days we had problems with the Linux OOM-Killer.
Some simple query that normally take around 6-7 minutes now takes 5 hours.
We did not change any configuration values the last days.

First of all I have set

vm.overcommit_memory=2
vm.overcommit_ratio=80

Here are some values of my DB-Master config, the Server has 32GB RAM and
is only for database, no other service.
Did anybody see some mistakes?

How much swap do you have?

I'am not shure if work_mem, maintenance_work_mem and
effective_cache_size is set correct.

That's hard to say. I don't see any immediate issue there, but it really
depends on your application. For example 200 connections with
work_mem=192MB may be dangerous if many connections are active at the same
time.

/etc/postgresql/9.1/main/postgresql.conf

max_connections = 200
ssl = true
shared_buffers = 6GB # min 128kB
work_mem = 192MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
wal_level = hot_standby
synchronous_commit = off
wal_buffers = 16MB
checkpoint_segments = 16
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p -e "ssh -i
/var/lib/postgresql/.ssh/id_rsa"
postgres@my_postgres_slave:/var/lib/postgresql/9.1/wals/dolly_main/%f
</dev/null'
max_wal_senders = 1
wal_keep_segments = 32
random_page_cost = 2.0
effective_cache_size = 22GB
default_statistics_target = 100
constraint_exclusion = off
join_collapse_limit = 1
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_min_duration_statement = 40000
log_lock_waits = on
track_counts = on
autovacuum = on
log_autovacuum_min_duration = 5000
autovacuum_max_workers = 4
datestyle = 'iso, dmy'
deadlock_timeout = 1s

So what does the query do? Show us explain plan (explain analyze would be
nice, but if it's running so slow).

Which kernel is this? When the OOM strikes, it should print detailed into
into the log - what does it say?

When you look at "top" output, which processes consume most memory? Are
there multiple backends consuming a lot of memory? What queries are they
running?

Assuming you have a monitoring system in place, collecting memory stats
(you should have that), what does it say about history? Is there a sudden
increase in consumed memory or something suspicious?

regards
Tomas

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3basti
mailinglist@unix-solution.de
In reply to: Tomas Vondra (#2)
Re: Linux OOM-Killer

uname -a
Linux h2085616 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64
GNU/Linux

At any time there are not more than 20-30 Connections at once.

Swap is disabled.
free -m
total used free shared buffers cached
Mem: 32215 16163 16051 0 40 14842
-/+ buffers/cache: 1281 30934
Swap: 0 0 0

With the updates there is a little bit tricky:
https://wiki.postgresql.org/wiki/Nov2013ReplicationIssue

If you are currently using 9.2.4, 9.1.9 or 9.0.13 and use any form of
builtin replication do not install the most recent update. Instead, wait
for the next update (9.2.6, 9.1.11 and 9.0.15) to come out.

Options for users who have already updated, or are running 9.3, include:

if you are using 9.2.5, 9.1.10 or 9.0.14, downgrade your replica
servers to the prior update release (9.2.4, 9.1.9 or 9.0.13).

On 17.03.2014 12:12, Tomas Vondra wrote:

Hi,

On 17 Březen 2014, 11:45, basti wrote:

Hello,

we have a database master Version:
PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-2) 4.7.2, 64-bit
and a WAL-Replication Slave with hot-standby version:
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit.

You're missing >18 months of fixes on the master (slightly less on the
slave).

Since a few days we had problems with the Linux OOM-Killer.
Some simple query that normally take around 6-7 minutes now takes 5 hours.
We did not change any configuration values the last days.

First of all I have set

vm.overcommit_memory=2
vm.overcommit_ratio=80

Here are some values of my DB-Master config, the Server has 32GB RAM and
is only for database, no other service.
Did anybody see some mistakes?

How much swap do you have?

I'am not shure if work_mem, maintenance_work_mem and
effective_cache_size is set correct.

That's hard to say. I don't see any immediate issue there, but it really
depends on your application. For example 200 connections with
work_mem=192MB may be dangerous if many connections are active at the same
time.

/etc/postgresql/9.1/main/postgresql.conf

max_connections = 200
ssl = true
shared_buffers = 6GB # min 128kB
work_mem = 192MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
wal_level = hot_standby
synchronous_commit = off
wal_buffers = 16MB
checkpoint_segments = 16
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p -e "ssh -i
/var/lib/postgresql/.ssh/id_rsa"
postgres@my_postgres_slave:/var/lib/postgresql/9.1/wals/dolly_main/%f
</dev/null'
max_wal_senders = 1
wal_keep_segments = 32
random_page_cost = 2.0
effective_cache_size = 22GB
default_statistics_target = 100
constraint_exclusion = off
join_collapse_limit = 1
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_min_duration_statement = 40000
log_lock_waits = on
track_counts = on
autovacuum = on
log_autovacuum_min_duration = 5000
autovacuum_max_workers = 4
datestyle = 'iso, dmy'
deadlock_timeout = 1s

So what does the query do? Show us explain plan (explain analyze would be
nice, but if it's running so slow).

Which kernel is this? When the OOM strikes, it should print detailed into
into the log - what does it say?

When you look at "top" output, which processes consume most memory? Are
there multiple backends consuming a lot of memory? What queries are they
running?

Assuming you have a monitoring system in place, collecting memory stats
(you should have that), what does it say about history? Is there a sudden
increase in consumed memory or something suspicious?

regards
Tomas

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: basti (#3)
Re: Linux OOM-Killer

On Mon, Mar 17, 2014 at 12:21:30PM +0100, basti wrote:

uname -a
Linux h2085616 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64
GNU/Linux

At any time there are not more than 20-30 Connections at once.

Swap is disabled.
free -m
total used free shared buffers cached
Mem: 32215 16163 16051 0 40 14842
-/+ buffers/cache: 1281 30934
Swap: 0 0 0

One really should add at least a bit of swap (and monitor
it closely) such that an out of physical RAM situation
does not amount to a hard limit.

It doesn't matter if it is slow.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: basti (#3)
Re: Linux OOM-Killer

basti wrote:

Since a few days we had problems with the Linux OOM-Killer.
Some simple query that normally take around 6-7 minutes now takes 5 hours.
We did not change any configuration values the last days.

First of all I have set

vm.overcommit_memory=2
vm.overcommit_ratio=80

Swap is disabled.
free -m
total used free shared buffers cached
Mem: 32215 16163 16051 0 40 14842
-/+ buffers/cache: 1281 30934
Swap: 0 0 0

That together means that you cannot use more than 80% of your
RAM. Are you hitting that limit?

See the description of overcommit_ratio in
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

I would definitely add some swap.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: basti (#3)
Re: Linux OOM-Killer

On 03/17/2014 04:21 AM, basti wrote:

uname -a
Linux h2085616 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64
GNU/Linux

At any time there are not more than 20-30 Connections at once.

Swap is disabled.
free -m
total used free shared buffers cached
Mem: 32215 16163 16051 0 40 14842
-/+ buffers/cache: 1281 30934
Swap: 0 0 0

With the updates there is a little bit tricky:
https://wiki.postgresql.org/wiki/Nov2013ReplicationIssue

If you are currently using 9.2.4, 9.1.9 or 9.0.13 and use any form of
builtin replication do not install the most recent update. Instead, wait
for the next update (9.2.6, 9.1.11 and 9.0.15) to come out.

Options for users who have already updated, or are running 9.3, include:

if you are using 9.2.5, 9.1.10 or 9.0.14, downgrade your replica
servers to the prior update release (9.2.4, 9.1.9 or 9.0.13).

Well basically the above is saying (in your case) avoid 9.1.10 by either
staying below 9.1.10 or skipping over it to a higher version. FYI
currently the 9.1.x series is at 9.1.12

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general