core system is getting unresponsive because over 300 cpu load

Started by pinkerover 8 years ago20 messagesgeneral

pinker@onet.eu

over 8 years ago

Hi to all!

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes
unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

* page tables size
* Committed_AS
* Active anon

<http://www.postgresql-archive.org/file/t342733/pagetables.png>

and the total number of connections are increasing very fast (but I suppose
it's the symptom not the root cause of cpu load) and exceed max_connections
(1000).

System:
* CentOS Linux release 7.2.1511 (Core)
* Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
* postgresql95-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-server-9.5.5-1PGDG.rhel7.x86_64

* 4 sockets/80 cores
* vm.dirty_background_bytes = 0
* vm.dirty_background_ratio = 2
* vm.dirty_bytes = 0
* vm.dirty_expire_centisecs = 3000
* vm.dirty_ratio = 20
* vm.dirty_writeback_centisecs = 500

after the first incident we have changed:
* increased shared_buffers to 16GB (completely on huge pages. previously
2GB)
* adjusted vm.nr_hugepages to 8000 (we've got 2mb pages)
* changed vm.overcommit_memory = 2 and vm.overcommit_ratio = 99
* disabled transparent huge pages (they were set before unfortunately to
'always')

It's a highly transactional db. Today I've run:
select now(), txid_current();
and the results:
3 339 351 transactions between 2017-10-10 14:42 and 2017-10-10 16:24

* db size 1,1TB
* RAM over 500GB
* biggest tables (the rest isn't big):
369 GB
48 GB
48 GB
34 GB
23 GB
19 GB
19 GB
17 GB
16 GB
12 GB
9910 MB

We have captured some of db statistics, for instance bgwriter and
buffercache.
Today the load spides happened at:
1). 10:44
2). 11:04
(and then several times during a day)
The premiere was yesterday about 6PM.

What we observed back then was for instance autovacuum process to prevent
wraparound on the biggest table (369GB). We did vacuum freeze manually after
this happened but before that we gathered statistics with the query:
SELECT
oid::regclass::text AS table,
age(relfrozenxid) AS xid_age,
mxid_age(relminmxid) AS mxid_age,
least(
(SELECT setting::int
FROM pg_settings
WHERE name = 'autovacuum_freeze_max_age') - age(relfrozenxid),
(SELECT setting::int
FROM pg_settings
WHERE name = 'autovacuum_multixact_freeze_max_age') -
mxid_age(relminmxid)
) AS tx_before_wraparound_vacuum,
pg_size_pretty(pg_total_relation_size(oid)) AS size,
pg_stat_get_last_autovacuum_time(oid) AS last_autovacuum
FROM pg_class
ORDER BY tx_before_wraparound_vacuum;

and the biggest table which was vacuumed looked like:
217310511 8156548 -17310511 369 GB 2017-09-30
01:57:33.972068+02

So, from the kernel stats we know that the failure happens when db is trying
to alocate some huge amount of pages (page tables size, anons, commited_as).
But what is triggering this situation?
I suppose it could be lazy autovacuum (just standard settings). So
autovacuum had to read whole 369gb yesterday to clean xids. today did the
same on some other tables.
Another idea is too small shared buffers setting.
Today it looked like:
<http://www.postgresql-archive.org/file/t342733/buffercache1040.png>

c - means count
the number after c is the usage count, so c5dirty means here count of dirty
pages with usagecount=5

that is the snapshot before and after the failure at 10:44

before and after the spike at 11:04:
<http://www.postgresql-archive.org/file/t342733/buffercache1104.png>

My interpretation of it is the following:
the count of clean buffers with high usagecount is decreasing, the count of
buffers with usagecount of 0 and 1 is very unstable -> so the buffers have
no time to get older in the shared buffers and are thrown out?

bgwriter stats:
<http://www.postgresql-archive.org/file/t342733/bgwriter.png>

the biggest number of buffers is cleaned by backends - so there is no free
buffers with usagecount 0 and LWlocks happen?

So increasing shared buffers would be a solution?
Please help, it's happening quite often and I'm not sure which way is the
right one...

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Scott Marlowe

scott.marlowe@gmail.com

over 8 years ago

In reply to: pinker (#1)

Re: core system is getting unresponsive because over 300 cpu load

On Tue, Oct 10, 2017 at 2:40 PM, pinker <pinker@onet.eu> wrote:

Hi to all!

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes
unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

The solution here is to reduce the number of connections usually via
some kind of connection pooling. Any db server will have a max
throughput at around the number of cpu cores == connections (give or
take a factor of 2). Outside that performance falls off, and has a
very sharp knee on the other side as the # of conns goes up.

Reduce connections, db runs faster. Increase it slows until it
eventually falls over.

pgbouncer and pgpool II are useful on the db end, look at pooling
options on the app side as well.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pinker

pinker@onet.eu

over 8 years ago

In reply to: Scott Marlowe (#2)

Re: core system is getting unresponsive because over 300 cpu load

Thank you Scott,
we are planning to do it today. But are you sure it will help in this case?

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Victor Yegorov

vyegorov@gmail.com

over 8 years ago

In reply to: pinker (#1)

Re: core system is getting unresponsive because over 300 cpu load

2017-10-10 23:40 GMT+03:00 pinker <pinker@onet.eu>:

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes
unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

Can you provide output of `iostat -myx 10` at the “peak” moments, please?

Also, it'd be good to look in more detailed bgwriter/checkpointer stats.
You can find more details in this post: http://blog.postgresql-
consulting.com/2017/03/deep-dive-into-postgres-stats_27.html
(You might want to reset 'shared' stats here.)

--
Victor Yegorov

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 8 years ago

In reply to: pinker (#1)

Re: core system is getting unresponsive because over 300 cpu load

On 10/10/2017 10:40 PM, pinker wrote:

Hi to all!

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes

What is "CPU load"? Perhaps you mean "load average"?

Also, what are the basic system parameters (number of cores, RAM), it's
difficult to help without knowing that.

unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

* page tables size
* Committed_AS
* Active anon

<http://www.postgresql-archive.org/file/t342733/pagetables.png>

and the total number of connections are increasing very fast (but I suppose
it's the symptom not the root cause of cpu load) and exceed max_connections
(1000).

I doubt you have 1000 cores in your system, so 1000 connections active
at the same time is guaranteed to cause issues. What we see quite often
is a minor hiccup (occasional slow query) snowballing into much more
serious trouble exactly because of this.

Queries get slower for some reason, application starts opening more
connections (through a built-in connection pool) to run more queries,
that further increases pressure, slows the queries even more, ...

As Scott suggested, you should consider using a connection pool.

System:
* CentOS Linux release 7.2.1511 (Core)
* Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
* postgresql95-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-server-9.5.5-1PGDG.rhel7.x86_64

* 4 sockets/80 cores
* vm.dirty_background_bytes = 0
* vm.dirty_background_ratio = 2
* vm.dirty_bytes = 0
* vm.dirty_expire_centisecs = 3000
* vm.dirty_ratio = 20
* vm.dirty_writeback_centisecs = 500

after the first incident we have changed:
* increased shared_buffers to 16GB (completely on huge pages. previously
2GB)
* adjusted vm.nr_hugepages to 8000 (we've got 2mb pages)
* changed vm.overcommit_memory = 2 and vm.overcommit_ratio = 99
* disabled transparent huge pages (they were set before unfortunately to
'always')

It's a highly transactional db. Today I've run:
select now(), txid_current();
and the results:
3 339 351 transactions between 2017-10-10 14:42 and 2017-10-10 16:24

Well, 3M transactions over ~2h period is just ~450tps, so nothing
extreme. Not sure how large the transactions are, of course.

... snip ...

So, from the kernel stats we know that the failure happens when db is trying
to alocate some huge amount of pages (page tables size, anons, commited_as).
But what is triggering this situation?

Something gets executed on the database. We have no idea what it is, but
it should be in the system logs. And you should see the process in 'top'
with large amounts of virtual memory ...

I suppose it could be lazy autovacuum (just standard settings). So
autovacuum had to read whole 369gb yesterday to clean xids. today did the
same on some other tables.

Possible, but it shouldn't allocate more than maintenance_work_mem. So I
don't why it would allocate so much virtual memory.

Another possibility is a run-away query that consumes a lot of work_mem.

Another idea is too small shared buffers setting.

... snip ...

bgwriter stats:
<http://www.postgresql-archive.org/file/t342733/bgwriter.png>

Yes, this suggests you probably have shared_buffers set too low, but
it's impossible to say if increasing the size will help - perhaps your
active set (part of DB you regularly access) is way too big.

Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
and then you can decide.

You may also make the bgwriter more aggressive - that won't really
improve the hit ratio, it will only make enough room for the backends.

But I don't quite see how this could cause the severe problems you have,
as I assume this is kinda regular behavior on that system. Hard to say
without more data.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pinker

pinker@onet.eu

over 8 years ago

In reply to: Victor Yegorov (#4)

Re: core system is getting unresponsive because over 300 cpu load

Victor Yegorov wrote

Can you provide output of `iostat -myx 10` at the “peak” moments, please?

sure, please find it here:
https://pastebin.com/f2Pv6hDL

Victor Yegorov wrote

Also, it'd be good to look in more detailed bgwriter/checkpointer stats.
You can find more details in this post: http://blog.postgresql-
consulting.com/2017/03/deep-dive-into-postgres-stats_27.html
(You might want to reset 'shared' stats here.)

thank you for the link, it's really nice explanation. Here you'll find the
full bgwriter stats: https://pastebin.com/VA8pyfXj

--
Victor Yegorov

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Scott Marlowe

scott.marlowe@gmail.com

over 8 years ago

In reply to: pinker (#6)

Re: core system is getting unresponsive because over 300 cpu load

On Tue, Oct 10, 2017 at 3:53 PM, pinker <pinker@onet.eu> wrote:

Victor Yegorov wrote

Can you provide output of `iostat -myx 10` at the “peak” moments, please?

sure, please find it here:
https://pastebin.com/f2Pv6hDL

Ouch, unless I'm reading that wrong, your IO subsystem seems to be REALLY slow.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pinker

pinker@onet.eu

over 8 years ago

In reply to: Scott Marlowe (#7)

Re: core system is getting unresponsive because over 300 cpu load

Scott Marlowe-2 wrote

Ouch, unless I'm reading that wrong, your IO subsystem seems to be REALLY
slow.

it's a huge array where a lot is happening, for instance data snapshots :/
the lun on which is this db is dm-7.
I'm a DBA with null knowledge about arrays so any advice will be much
appreciated :)

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Victor Yegorov

vyegorov@gmail.com

over 8 years ago

In reply to: pinker (#6)

Re: core system is getting unresponsive because over 300 cpu load

2017-10-11 0:53 GMT+03:00 pinker <pinker@onet.eu>:

Can you provide output of `iostat -myx 10` at the “peak” moments, please?

sure, please find it here:
https://pastebin.com/f2Pv6hDL

Looks like `sdg` and `sdm` are the ones used most.
Can you describe what's on those devices? Do you have WAL and DB sitting
together?
Where DB log files are stored?

Here you'll find the

full bgwriter stats: https://pastebin.com/VA8pyfXj

Can you, please, provide the output of this query (linked from the article
mentioned):
https://gist.github.com/lesovsky/4587d70f169739c01d4525027c087d14

And also this query:
SELECT name,version,source FROM pg_settings WHERE source NOT IN
('default','override');

--
Victor Yegorov

#10

pinker

pinker@onet.eu

over 8 years ago

In reply to: Tomas Vondra (#5)

Re: core system is getting unresponsive because over 300 cpu load

Tomas Vondra-4 wrote

What is "CPU load"? Perhaps you mean "load average"?

Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the
graph from yesterday's failure (after 6p.m.):
<http://www.postgresql-archive.org/file/t342733/cpu.png>
So as one can see connections spikes follow cpu spikes...

Tomas Vondra-4 wrote

Also, what are the basic system parameters (number of cores, RAM), it's
difficult to help without knowing that.

I have actually written everything in the first post:
80 CPU and 4 sockets
over 500GB RAM

Tomas Vondra-4 wrote

Well, 3M transactions over ~2h period is just ~450tps, so nothing
extreme. Not sure how large the transactions are, of course.

It's quite a lot going on. Most of them are complicated stored procedures.

Tomas Vondra-4 wrote

Something gets executed on the database. We have no idea what it is, but
it should be in the system logs. And you should see the process in 'top'
with large amounts of virtual memory ...

Yes, it would be much easier if it would be just single query from the top,
but the most cpu is eaten by the system itself and I'm not sure why. I
suppose because of page tables size and anon pages is NUMA related.

Tomas Vondra-4 wrote

Another possibility is a run-away query that consumes a lot of work_mem.

It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
stored procedures with unnecessary WITH clauses (i.e. materialization) and
right after it IN query with results of that (hash).

Tomas Vondra-4 wrote

Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
and then you can decide.

Thank you for the tip. I always do it but haven't here, so the result is
0.992969610990056 - so increasing it is rather pointless.

Tomas Vondra-4 wrote

You may also make the bgwriter more aggressive - that won't really
improve the hit ratio, it will only make enough room for the backends.

yes i probably will

Tomas Vondra-4 wrote

But I don't quite see how this could cause the severe problems you have,
as I assume this is kinda regular behavior on that system. Hard to say
without more data.

I can provide you with any data you need :)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

John R Pierce

pierce@hogranch.com

over 8 years ago

In reply to: pinker (#10)

Re: core system is getting unresponsive because over 300 cpu load

On 10/10/2017 3:28 PM, pinker wrote:

It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
stored procedures with unnecessary WITH clauses (i.e. materialization) and
right after it IN query with results of that (hash).

1000 connections all doing queries that need 1 work_mem each will
consume 1000*350MB == 350GB of your ram. many queries use several
work_mem's.

if the vast majority of your operations are OLTP and only access a few
rows, then large work_mem is NOT a good idea. If you're doing large
aggregate operations like OLAP for reporting or whatever, then thats
another story, but generally doing that sort of thing does NOT use 1000
connections.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

pinker

pinker@onet.eu

over 8 years ago

In reply to: Victor Yegorov (#9)

Re: core system is getting unresponsive because over 300 cpu load

Victor Yegorov wrote

Looks like `sdg` and `sdm` are the ones used most.
Can you describe what's on those devices? Do you have WAL and DB sitting
together?
Where DB log files are stored?

it's multipath with the same LUN for PGDATA and pg_log, but separate one for
xlogs and archives.

mpatha dm-4 IBM ,2145

size=2.0T features='0' hwhandler='0' wp=rw

|-+- policy='round-robin 0' prio=0 status=active

| |- 7:0:1:2 sdg 8:96 active undef running

| `- 8:0:1:2 sdm 8:192 active undef running

`-+- policy='round-robin 0' prio=0 status=enabled

|- 7:0:0:2 sdd 8:48 active undef running

`- 8:0:0:2 sdj 8:144 active undef running

Victor Yegorov wrote

Can you, please, provide the output of this query (linked from the article
mentioned):
https://gist.github.com/lesovsky/4587d70f169739c01d4525027c087d14

00:26:51.226024|120 days
03:05:37.987175|0.6|7.99|300.63|0.46|12673500.4|162.00|0.34|0.51|0.37|1.22|26.721|27.7|41.8|30.6|4.47|34.27|--------------------------------------|21532|124|6510377185|9920323|449049896|677360078|2321057|495798075|0

Victor Yegorov wrote

And also this query:
SELECT name,version,source FROM pg_settings WHERE source NOT IN
('default','override');

application_name | client | psql

archive_command | configuration file | <deleted>

archive_mode | configuration file | on

autovacuum | configuration file | on

autovacuum_max_workers | configuration file | 10

checkpoint_completion_target | configuration file | 0.9

checkpoint_timeout | configuration file | 480

client_encoding | client | UTF8

DateStyle | configuration file | ISO, MDY

default_statistics_target | configuration file | 350

default_text_search_config | configuration file | pg_catalog.english

effective_cache_size | configuration file | 52428800

enable_indexscan | configuration file | on

huge_pages | configuration file | on

lc_messages | configuration file | en_US.UTF-8

lc_monetary | configuration file | en_US.UTF-8

lc_numeric | configuration file | en_US.UTF-8

lc_time | configuration file | en_US.UTF-8

listen_addresses | configuration file | *

log_autovacuum_min_duration | configuration file | 0

log_checkpoints | configuration file | on

log_connections | configuration file | on

log_destination | configuration file | stderr

log_directory | configuration file | pg_log

log_disconnections | configuration file | on

log_duration | configuration file | off

log_filename | configuration file | postgresql-%a.log

log_line_prefix | configuration file | %t [%p]: [%l-1]
user=%u,db=%d

log_lock_waits | configuration file | on

log_min_duration_statement | configuration file | 0

log_rotation_age | configuration file | 1440

log_rotation_size | configuration file | 0

log_temp_files | configuration file | 0

log_timezone | configuration file | Poland

log_truncate_on_rotation | configuration file | on

logging_collector | configuration file | on

maintenance_work_mem | configuration file | 2097152

max_connections | configuration file | 1000

max_stack_depth | environment variable | 2048

max_wal_senders | configuration file | 10

max_wal_size | configuration file | 640

random_page_cost | configuration file | 1

shared_buffers | configuration file | 2097152

temp_buffers | configuration file | 16384

TimeZone | configuration file | Poland

track_functions | configuration file | all

track_io_timing | configuration file | off

wal_buffers | configuration file | 2048

wal_keep_segments | configuration file | 150

wal_level | configuration file | hot_standby

work_mem | configuration file | 393216+

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 8 years ago

In reply to: pinker (#10)

Re: core system is getting unresponsive because over 300 cpu load

On 10/11/2017 12:28 AM, pinker wrote:

Tomas Vondra-4 wrote

What is "CPU load"? Perhaps you mean "load average"?

Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the
graph from yesterday's failure (after 6p.m.):
<http://www.postgresql-archive.org/file/t342733/cpu.png>
So as one can see connections spikes follow cpu spikes...

I'm probably a bit dumb (after all, it's 1AM over here), but can you
explain the CPU chart? I'd understand percentages (say, 75% CPU used)
but what do the seconds / fractions mean? E.g. when the system time
reaches 5 seconds, what does that mean?

Tomas Vondra-4 wrote

Also, what are the basic system parameters (number of cores, RAM), it's
difficult to help without knowing that.

I have actually written everything in the first post:
80 CPU and 4 sockets
over 500GB RAM

Apologies, missed that bit.

Tomas Vondra-4 wrote

Well, 3M transactions over ~2h period is just ~450tps, so nothing
extreme. Not sure how large the transactions are, of course.

It's quite a lot going on. Most of them are complicated stored procedures.

OK.

Tomas Vondra-4 wrote

Something gets executed on the database. We have no idea what it is, but
it should be in the system logs. And you should see the process in 'top'
with large amounts of virtual memory ...

Yes, it would be much easier if it would be just single query from the top,
but the most cpu is eaten by the system itself and I'm not sure why. I
suppose because of page tables size and anon pages is NUMA related.

Have you tried profiling using perf? That usually identifies hot spots
pretty quickly - either in PostgreSQL code or in the kernel.

Tomas Vondra-4 wrote

Another possibility is a run-away query that consumes a lot of work_mem.

It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
stored procedures with unnecessary WITH clauses (i.e. materialization) and
right after it IN query with results of that (hash).

Depends on how much data is in the CTEs. We don't really allocate all of
work_mem at once, but bit by bit.

Tomas Vondra-4 wrote

Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
and then you can decide.

Thank you for the tip. I always do it but haven't here, so the result is
0.992969610990056 - so increasing it is rather pointless.

Yeah.

Tomas Vondra-4 wrote

You may also make the bgwriter more aggressive - that won't really
improve the hit ratio, it will only make enough room for the backends.

yes i probably will

On the other hand, the numbers are rather low. I mean, the backends
seems to be evicting ~15k buffers over 5-minute period, which is pretty
much nothing (~400kB/s). I wouldn't bother by tuning this.

Tomas Vondra-4 wrote

But I don't quite see how this could cause the severe problems you have,
as I assume this is kinda regular behavior on that system. Hard to say
without more data.

I can provide you with any data you need :)

What I meant is that if the system evicts this amount of buffers all the
time (i.e. there doesn't seem to be any sudden spike), then it's
unlikely to be the cause (or related to it).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: pinker (#1)

Re: core system is getting unresponsive because over 300 cpu load

Hi,

On 2017-10-10 13:40:07 -0700, pinker wrote:

and the total number of connections are increasing very fast (but I suppose
it's the symptom not the root cause of cpu load) and exceed max_connections
(1000).

Others mentioned already that that's worth improving.

System:
* CentOS Linux release 7.2.1511 (Core)
* Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

Some versions of this kernel have had serious problems with transparent
hugepages. I'd try turning that off. I think it defaults to off even in
that version, but also make sure zone_reclaim_mode is disabled.

* postgresql95-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64
* postgresql95-server-9.5.5-1PGDG.rhel7.x86_64

* 4 sockets/80 cores

9.6 has quite some scalability improvements over 9.5. I don't know
whether it's feasible for you to update, but if so, It's worth trying.

How about taking perf profile to investigate?

* vm.dirty_background_bytes = 0
* vm.dirty_background_ratio = 2
* vm.dirty_bytes = 0
* vm.dirty_expire_centisecs = 3000
* vm.dirty_ratio = 20
* vm.dirty_writeback_centisecs = 500

I'd suggest monitoring /proc/meminfo for the amount of Dirty and
Writeback memory, and see whether rapid changes therein coincide with
periodds of slowdown.

Greetings,

Andres Freund

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15

pinker

pinker@onet.eu

over 8 years ago

In reply to: Tomas Vondra (#13)

Re: core system is getting unresponsive because over 300 cpu load

Tomas Vondra-4 wrote

I'm probably a bit dumb (after all, it's 1AM over here), but can you
explain the CPU chart? I'd understand percentages (say, 75% CPU used)
but what do the seconds / fractions mean? E.g. when the system time
reaches 5 seconds, what does that mean?

hehe, no you've just spotted a mistake, it suppose to be 50 cores :)
out of 80 in total

Tomas Vondra-4 wrote

Have you tried profiling using perf? That usually identifies hot spots
pretty quickly - either in PostgreSQL code or in the kernel.

I was always afraid because of overhead, but maybe it's time to start ...

Tomas Vondra-4 wrote

What I meant is that if the system evicts this amount of buffers all the
time (i.e. there doesn't seem to be any sudden spike), then it's
unlikely to be the cause (or related to it).

I was actually been thinking about scenario where different sessions want to
at one time read/write from or to many different relfilenodes, what could
cause page swap between shared buffers and os cache? we see that context
switches on cpu are increasing as well. kernel documentation says that using
page tables instead of Translation Lookaside Buffer (TLB) is very costly and
on some blogs have seen recomendations that using huge pages (so more
addresses can fit in TLB) will help here but postgresql, unlike oracle,
cannot use it for anything else than page buffering (so 16gb) ... so process
memory still needs to use 4k pages.
or memory fragmentation?

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16

pinker

pinker@onet.eu

over 8 years ago

In reply to: Andres Freund (#14)

Re: core system is getting unresponsive because over 300 cpu load

Andres Freund wrote

Others mentioned already that that's worth improving.

Yes, we are just setting up pgbouncer

Andres Freund wrote

Some versions of this kernel have had serious problems with transparent
hugepages. I'd try turning that off. I think it defaults to off even in
that version, but also make sure zone_reclaim_mode is disabled.

Yes, I'm aware of that so always set it to never.
but thank you for the zone_reclaim_mode.

Andres Freund wrote

9.6 has quite some scalability improvements over 9.5. I don't know
whether it's feasible for you to update, but if so, It's worth trying.

How about taking perf profile to investigate?

Both are on my to do list :)

Andres Freund wrote

I'd suggest monitoring /proc/meminfo for the amount of Dirty and
Writeback memory, and see whether rapid changes therein coincide with
periodds of slowdown.

yes, I was monitoring it the whole day and that's the reason why I've
changed dirty_background_ratio but both of them were flat - without any
bigger spikes.

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17

Justin Pryzby

pryzby@telsasoft.com

over 8 years ago

In reply to: pinker (#1)

Re: core system is getting unresponsive because over 300 cpu load

On Tue, Oct 10, 2017 at 01:40:07PM -0700, pinker wrote:

Hi to all!

We've got problem with a very serious repetitive incident on our core
system. Namely, cpu load spikes to 300-400 and the whole db becomes
unresponsive. From db point of view nothing special is happening, memory
looks fine, disks io's are ok and the only problem is huge cpu load. Kernel
parameters that are increasing with load are always the same:

* disabled transparent huge pages (they were set before unfortunately to
'always')

Did you also try disabling KSM ?
echo 2 |sudo tee /sys/kernel/mm/ksm/run

I believe for us that was affecting a postgres VM(QEMU/KVM) and maybe not
postgres itself. Worth a try ?

/messages/by-id/20170718180152.GE17566@telsasoft.com

Justin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 8 years ago

In reply to: pinker (#15)

Re: core system is getting unresponsive because over 300 cpu load

On 10/11/2017 02:26 AM, pinker wrote:

Tomas Vondra-4 wrote

I'm probably a bit dumb (after all, it's 1AM over here), but can you
explain the CPU chart? I'd understand percentages (say, 75% CPU used)
but what do the seconds / fractions mean? E.g. when the system time
reaches 5 seconds, what does that mean?

hehe, no you've just spotted a mistake, it suppose to be 50 cores :)
out of 80 in total

Ah, so it should say '50 cores' instead of '5s'? Well, that's busy
system I guess.

Tomas Vondra-4 wrote

Have you tried profiling using perf? That usually identifies hot spots
pretty quickly - either in PostgreSQL code or in the kernel.

I was always afraid because of overhead, but maybe it's time to start ...

I don't follow. If you're not in trouble, a little bit of additional
overhead is not an issue (but you generally don't need profiling at that
moment). If you're already in trouble, then spending a bit of CPU time
on basic CPU profile is certainly worth it.

Tomas Vondra-4 wrote

What I meant is that if the system evicts this amount of buffers all the
time (i.e. there doesn't seem to be any sudden spike), then it's
unlikely to be the cause (or related to it).

I was actually been thinking about scenario where different sessions
want to at one time read/write from or to many different relfilenodes,
what could cause page swap between shared buffers and os cache?

Perhaps. If the sessions only do reads, that would not be visible in
buffer_backends I believe (not sure ATM, would have to check source).
But it'd be visible in buffers_alloc and certainly in blks_read.

we see that context switches on cpu are increasing as well. kernel
documentation says that using page tables instead of Translation
Lookaside Buffer (TLB) is very costly and on some blogs have seen
recomendations that using huge pages (so more addresses can fit in
TLB) will help here but postgresql, unlike oracle, cannot use it for
anything else than page buffering (so 16gb) ... so process memory
still needs to use 4k pages.

The context switches are likely due to large number of runnable
processes competing for the CPU.

Also, memory bandwidth is increasingly an issue on big boxes ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19

Scott Marlowe

scott.marlowe@gmail.com

over 8 years ago

In reply to: pinker (#10)

Re: core system is getting unresponsive because over 300 cpu load

On Tue, Oct 10, 2017 at 4:28 PM, pinker <pinker@onet.eu> wrote:

Yes, it would be much easier if it would be just single query from the top,
but the most cpu is eaten by the system itself and I'm not sure why.

You are experiencing a context switch storm. The OS is spending so
much time trying to switch between 1,000+ processes it doesn't have
any time left to do much else.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20

pinker

pinker@onet.eu

over 8 years ago

In reply to: Scott Marlowe (#19)

Re: core system is getting unresponsive because over 300 cpu load

Any chance this situation has something to do with the latest reveal of
meltdown vulnerability?
https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html