General performance/load issue
Hello everyone,
I'm having some troubles with a Postgresql server.
We're using PG has a database backend for a very big website (lots of data
and much traffic).
The issue : server suddenly (1H after restart) becomes slow (queries not
responding), load rises (>20 instead of 1), iowait rises (20 to 70%)
Version : 9.0.5
Server : Dual Xeon X5650 (24 cores total)
Memory : 48 GB
Disks : SSD
Top when overloaded :
21537 postgres 20 0 6420m 899m 892m D 22 1.9 0:01.86 postgres
21534 postgres 20 0 6554m 1.3g 1.1g D 14 2.7 0:01.72 postgres
21518 postgres 20 0 6419m 2.1g 2.1g D 10 4.6 0:02.10 postgres
21038 postgres 20 0 6716m 3.8g 3.5g D 10 8.0 0:45.46 postgres
21103 postgres 20 0 6571m 3.7g 3.6g D 9 7.8 0:32.19 postgres
21079 postgres 20 0 6575m 3.8g 3.7g D 8 8.1 0:36.39 postgres
21359 postgres 20 0 6563m 1.9g 1.8g D 4 4.1 0:09.10 postgres
21422 postgres 20 0 6563m 1.9g 1.7g D 4 3.9 0:08.34 postgres
19656 postgres 20 0 6727m 5.3g 5.0g D 3 11.1 1:58.25 postgres
21418 postgres 20 0 6685m 2.1g 1.9g D 2 4.5 0:12.42 postgres
21413 postgres 20 0 6693m 2.1g 1.8g D 2 4.4 0:11.06 postgres
21541 postgres 20 0 6421m 719m 711m D 1 1.5 0:00.48 postgres
14044 postgres 20 0 6418m 755m 750m D 1 1.6 0:04.71 postgres
21326 postgres 20 0 6685m 2.2g 2.0g D 1 4.7 0:15.82 postgres
21031 postgres 20 0 6688m 3.6g 3.4g D 1 7.7 0:44.18 postgres
21055 postgres 20 0 6575m 4.1g 3.9g D 1 8.6 0:39.11 postgres
21357 postgres 20 0 6693m 2.3g 2.0g D 1 4.9 0:13.74 postgres
Memory when overloaded :
~ # free -m
total used free shared buffers cached
Mem: 48339 47087 1251 0 248 38720
-/+ buffers/cache: 8118 40220
Swap: 8190 1 8189
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
effective_cache_size = 12GB
max_files_per_process = 8192
fsync = off
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
seq_page_cost = 2.0
random_page_cost = 2.0
Did I do anything wrong? Any idea?
Regards,
Gaëtan
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Hello everyone,
I'm having some troubles with a Postgresql server.
We're using PG has a database backend for a very big website (lots of data
and much traffic).The issue : server suddenly (1H after restart) becomes slow (queries not
responding), load rises (>20 instead of 1), iowait rises (20 to 70%)Version : 9.0.5
Server : Dual Xeon X5650 (24 cores total)
Memory : 48 GB
Disks : SSDTop when overloaded :
Top is not the most useful tool here, I guess. Use "iotop" (will show you
which processes are doing the I/O) and tools like vmstat / iostat.
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
Fine. Let's see the options that look suspicious.
effective_cache_size = 12GB
Why have you set it like this? According to the "free" output you've
posted the cache has about 38G, so why just 12G here? There are possible
reasons, but I don't think this is the case.
fsync = off
A really bad idea. I guess your data are worthless to you, right?
seq_page_cost = 2.0
random_page_cost = 2.0
Eh? First of all, what really matters is the relative value of those two
values, and it's good habit to leave seq_page_cost = 1.0 and change just
the other values.
Plus the random I/O is not as cheap as sequential I/O even on SSD drives,
so I't recommend something like this:
seq_page_cost = 1.0
random_page_cost = 2.0 (or maybe 1.5)
Anyway this needs to be tested properly - watch the performance and tune
if needed.
Did I do anything wrong? Any idea?
Not sure. My guess is you're getting bitten by a checkpoint. We need to
know a few more details.
1) What is dirty_background_ratio / dirty_ratio (see /proc/sys/vm/ directory)
2) enable log_checkpoints in postgresql.conf and see how it correlates to
the bad performance
3) check which processes are responsible for the I/O (use iotop)
Tomas
Hi Thomas,
I will be using iotop ;)
Right now, most i/o come from "postgres: wal writer process".
-> effective_cache_size
Okay, I'll rise it to 32Gb.
-> fsync : changed to on ;)
-> seq_pages : i'll run tests. Thanks.
-> dirty :
cat /proc/sys/vm/dirty_ratio
20
cat /proc/sys/vm/dirty_background_ratio10
Thanks a lot Tomas. You're really helpful!
Gaëtan
Le 24/11/11 15:09, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
Show quoted text
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Hello everyone,
I'm having some troubles with a Postgresql server.
We're using PG has a database backend for a very big website (lots of
data
and much traffic).The issue : server suddenly (1H after restart) becomes slow (queries not
responding), load rises (>20 instead of 1), iowait rises (20 to 70%)Version : 9.0.5
Server : Dual Xeon X5650 (24 cores total)
Memory : 48 GB
Disks : SSDTop when overloaded :
Top is not the most useful tool here, I guess. Use "iotop" (will show you
which processes are doing the I/O) and tools like vmstat / iostat.Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
effective_cache_size = 12GB
Why have you set it like this? According to the "free" output you've
posted the cache has about 38G, so why just 12G here? There are possible
reasons, but I don't think this is the case.fsync = off
A really bad idea. I guess your data are worthless to you, right?
seq_page_cost = 2.0
random_page_cost = 2.0Eh? First of all, what really matters is the relative value of those two
values, and it's good habit to leave seq_page_cost = 1.0 and change just
the other values.Plus the random I/O is not as cheap as sequential I/O even on SSD drives,
so I't recommend something like this:seq_page_cost = 1.0
random_page_cost = 2.0 (or maybe 1.5)Anyway this needs to be tested properly - watch the performance and tune
if needed.Did I do anything wrong? Any idea?
Not sure. My guess is you're getting bitten by a checkpoint. We need to
know a few more details.1) What is dirty_background_ratio / dirty_ratio (see /proc/sys/vm/
directory)2) enable log_checkpoints in postgresql.conf and see how it correlates to
the bad performance3) check which processes are responsible for the I/O (use iotop)
Tomas
On 24 Listopad 2011, 15:27, Gaëtan Allart wrote:
Hi Thomas,
I will be using iotop ;)
Right now, most i/o come from "postgres: wal writer process".
What do you mean by "most I/O" - how much data is it writing? Is there a
vacuum running at the same time? What other processes are doing I/O?
Post a few lines of "iostat -x 5" so we know what kind of I/O we're
dealing with.
-> dirty :
cat /proc/sys/vm/dirty_ratio 20
cat /proc/sys/vm/dirty_background_ratio 10
This means the cache may contain up to 3.2GB of data before the system
starts to write them out on background, and about 6.4GB before the
processes can't use the write cache.
What about /proc/sys/vm/dirty_expire_centiseconds?
Is there something interesting in the postgresql.log? Autovacuum maybe?
Tomas
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Hello everyone,
I'm having some troubles with a Postgresql server.
We're using PG has a database backend for a very big website (lots of data
and much traffic).The issue : server suddenly (1H after restart) becomes slow (queries not
responding), load rises (>20 instead of 1), iowait rises (20 to 70%)Version : 9.0.5
Server : Dual Xeon X5650 (24 cores total)
Memory : 48 GB
Disks : SSDTop when overloaded :
Top is not the most useful tool here, I guess. Use "iotop" (will show you
which processes are doing the I/O) and tools like vmstat / iostat.Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.
On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.
Robert Treat
conjecture: xzilla.net
consulting: omniti.com
On 24 Listopad 2011, 16:39, Robert Treat wrote:
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.
Yes, checkpoints were my first thought too. OTOH the OP reported that most
of the I/O is caused by WAL writer - that's not exactly the part that does
the work during checkpoint. Plus the WAL may not be postponed, as it's
usually O_DIRECT and fsynced, right.
You're right that the writes are postponed, but I generally see that as a
good thing when combined with spread checkpoints. And even with those vm
settings (about 3.2GB for background writes), I wouldn't expect this
behaviour (because the page cache usually expires after 30 seconds). Say
you need 100% of the shared buffers is dirty and need to be written. You
have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30
seconds expire there might be about 240MB before the pdflush starts to
write the data to the SSD. And that can surely handle more than 50MB/s. So
why the long delay? The question is what else is going on there.
But all this is just guessing - I want to see the log_checkpoint message,
iostat results etc.
On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.
This is one of the reasons why effective_cache_size should be lower than
32GB, probably ...
Tomas
Hi Robert,
Thanks for your help as well.
You're right about checkpoints, it's running pretty good at start then
encounter heavy i/os.
I've changed theses settings and also reduced work_mem a little and
reduced effective_cache_size btw.
LOG: parameter "work_mem" changed to "96MB"
LOG: parameter "effective_cache_size" changed to "24GB"
LOG: parameter "checkpoint_segments" changed to "40"
LOG: parameter "checkpoint_timeout" changed to "5min"
Apparently, it's been running fine since I made the first changes
recommenced by Tomas. Let's wait for a couple of hours again to confirm
this.
Gaëtan
Le 24/11/11 16:39, « Robert Treat » <rob@xzilla.net> a écrit :
Show quoted text
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Hello everyone,
I'm having some troubles with a Postgresql server.
We're using PG has a database backend for a very big website (lots of
data
and much traffic).The issue : server suddenly (1H after restart) becomes slow (queries
not
responding), load rises (>20 instead of 1), iowait rises (20 to 70%)Version : 9.0.5
Server : Dual Xeon X5650 (24 cores total)
Memory : 48 GB
Disks : SSDTop when overloaded :
Top is not the most useful tool here, I guess. Use "iotop" (will show
you
which processes are doing the I/O) and tools like vmstat / iostat.Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.Robert Treat
conjecture: xzilla.net
consulting: omniti.com
Tomas,
I've enabled logging of checkpoints.
I'm waiting for the next i/o crisisŠ
Gaëtan
Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
Show quoted text
On 24 Listopad 2011, 16:39, Robert Treat wrote:
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.Yes, checkpoints were my first thought too. OTOH the OP reported that most
of the I/O is caused by WAL writer - that's not exactly the part that does
the work during checkpoint. Plus the WAL may not be postponed, as it's
usually O_DIRECT and fsynced, right.You're right that the writes are postponed, but I generally see that as a
good thing when combined with spread checkpoints. And even with those vm
settings (about 3.2GB for background writes), I wouldn't expect this
behaviour (because the page cache usually expires after 30 seconds). Say
you need 100% of the shared buffers is dirty and need to be written. You
have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30
seconds expire there might be about 240MB before the pdflush starts to
write the data to the SSD. And that can surely handle more than 50MB/s. So
why the long delay? The question is what else is going on there.But all this is just guessing - I want to see the log_checkpoint message,
iostat results etc.On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.This is one of the reasons why effective_cache_size should be lower than
32GB, probably ...Tomas
Le 24 novembre 2011 17:02, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 16:39, Robert Treat wrote:
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.Yes, checkpoints were my first thought too. OTOH the OP reported that most
of the I/O is caused by WAL writer - that's not exactly the part that does
the work during checkpoint. Plus the WAL may not be postponed, as it's
usually O_DIRECT and fsynced, right.You're right that the writes are postponed, but I generally see that as a
good thing when combined with spread checkpoints. And even with those vm
settings (about 3.2GB for background writes), I wouldn't expect this
behaviour (because the page cache usually expires after 30 seconds). Say
you need 100% of the shared buffers is dirty and need to be written. You
have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30
seconds expire there might be about 240MB before the pdflush starts to
write the data to the SSD. And that can surely handle more than 50MB/s. So
why the long delay? The question is what else is going on there.But all this is just guessing - I want to see the log_checkpoint message,
iostat results etc.On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.This is one of the reasons why effective_cache_size should be lower than
32GB, probably ...
according to 'free' output, 38GB is what is here right now.
effective_cache_size is just informative, so you can put it to 1TB
without memory issue.
And, it is OS cache+PG cache.
There is not enougth information yet to be sure on what's happening.
log_checkpoint output will help for sure.
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Finally, it "crashed" againŠ :-(
Here's the output of iotop while databased was inaccessible :
32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34847) idle
32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34660) SELECT
32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(33765) SELECT
32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34112) SELECT
32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34632) SELECT
32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38858) idle in transaction
32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38602) SELECT
28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(40594) SELECT
32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38497) SELECT
And the latest logs :
LOG: checkpoint complete: wrote 3192 buffers (0.2%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=262.636 s, sync=135.456 s,
total=416.630 s
LOG: checkpoint complete: wrote 716 buffers (0.0%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=165.497 s, sync=17.111 s,
total=193.199 s
WARNING: pgstat wait timeout
STATEMENT: SELECT 'commit',sum(pg_stat_get_db_xact_commit(oid)) FROM
pg_database WHERE datname=$1
UNION ALL
SELECT
'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database WHERE
datname=$2
LOG: checkpoint complete: wrote 699 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=203.023 s, sync=119.037 s,
total=335.012 s
LOG: checkpoint complete: wrote 348 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=120.412 s, sync=0.020 s,
total=120.435 s
Does this help?
Gaëtan
Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
Show quoted text
On 24 Listopad 2011, 16:39, Robert Treat wrote:
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.Yes, checkpoints were my first thought too. OTOH the OP reported that most
of the I/O is caused by WAL writer - that's not exactly the part that does
the work during checkpoint. Plus the WAL may not be postponed, as it's
usually O_DIRECT and fsynced, right.You're right that the writes are postponed, but I generally see that as a
good thing when combined with spread checkpoints. And even with those vm
settings (about 3.2GB for background writes), I wouldn't expect this
behaviour (because the page cache usually expires after 30 seconds). Say
you need 100% of the shared buffers is dirty and need to be written. You
have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30
seconds expire there might be about 240MB before the pdflush starts to
write the data to the SSD. And that can surely handle more than 50MB/s. So
why the long delay? The question is what else is going on there.But all this is just guessing - I want to see the log_checkpoint message,
iostat results etc.On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.This is one of the reasons why effective_cache_size should be lower than
32GB, probably ...Tomas
Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Finally, it "crashed" againŠ :-(
Here's the output of iotop while databased was inaccessible :
32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34847) idle
32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34660) SELECT
32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(33765) SELECT
32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34112) SELECT
32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(34632) SELECT
32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38858) idle in transaction
32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38602) SELECT
28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(40594) SELECT
32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres: mydb
mydb host(38497) SELECTAnd the latest logs :
LOG: checkpoint complete: wrote 3192 buffers (0.2%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=262.636 s, sync=135.456 s,
total=416.630 s
LOG: checkpoint complete: wrote 716 buffers (0.0%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=165.497 s, sync=17.111 s,
total=193.199 sWARNING: pgstat wait timeout
STATEMENT: SELECT 'commit',sum(pg_stat_get_db_xact_commit(oid)) FROM
pg_database WHERE datname=$1
UNION ALL
SELECT
'rollback',sum(pg_stat_get_db_xact_rollback(oid)) FROM pg_database WHERE
datname=$2LOG: checkpoint complete: wrote 699 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=203.023 s, sync=119.037 s,
total=335.012 s
LOG: checkpoint complete: wrote 348 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=120.412 s, sync=0.020 s,
total=120.435 sDoes this help?
yes.
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backports
It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.
Gaëtan
Le 24/11/11 17:02, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 16:39, Robert Treat wrote:
On Thu, Nov 24, 2011 at 9:09 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 24 Listopad 2011, 14:51, Gaëtan Allart wrote:
Postgresql.conf :
max_connections = 50
shared_buffers = 12G
temp_buffers = 40MB
work_mem = 128MB
maintenance_work_mem = 256MB
max_files_per_process = 8192
checkpoint_segments = 256
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9Fine. Let's see the options that look suspicious.
I think you missed some suspicious settings... I'd recommend setting
shared buffers to 8gb, and I'd likely reduce checkpoint segements to
30 and set the checkpoint timeout back to 5 minutes. Everything about
the way this server is configured (including those vm settings) is
pushing it towards delaying the WAL/Buffer/Checkpoint as long as
possible, which matches with the idea of good performance initial
followed by a period of poor performance and heavy i/o.Yes, checkpoints were my first thought too. OTOH the OP reported that most
of the I/O is caused by WAL writer - that's not exactly the part that does
the work during checkpoint. Plus the WAL may not be postponed, as it's
usually O_DIRECT and fsynced, right.You're right that the writes are postponed, but I generally see that as a
good thing when combined with spread checkpoints. And even with those vm
settings (about 3.2GB for background writes), I wouldn't expect this
behaviour (because the page cache usually expires after 30 seconds). Say
you need 100% of the shared buffers is dirty and need to be written. You
have 27 minutes (30*0.9) to do that - that means about 8MB/s. With 30
seconds expire there might be about 240MB before the pdflush starts to
write the data to the SSD. And that can surely handle more than 50MB/s. So
why the long delay? The question is what else is going on there.But all this is just guessing - I want to see the log_checkpoint message,
iostat results etc.On a side note, I'd guess your work_mem is probably too high. 50
(connections) x 128 (mb work mem) x 2 (sorts per query) = 12GB RAM,
which is 25% of total ram on the box. That doesn't necessarily mean
game over, but it seem like it wouldn't be that hard to get thrashing
being set up that way. YMMV.This is one of the reasons why effective_cache_size should be lower than
32GB, probably ...Tomas
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Finally, it "crashed" againŠ :-(
Here's the output of iotop while databased was inaccessible :
32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34847) idle
32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34660) SELECT
32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(33765) SELECT
32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34112) SELECT
32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34632) SELECT
32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38858) idle in transaction
32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38602) SELECT
28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(40594) SELECT
32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38497) SELECT
This iotop output is rather incomplete - it does not show which of the
processes actually did the I/O. The fact that a process does 99% of the
I/O and yet performs no reads or writes is a bit strange.
Use "iotop -o" and post the result.
Does this help?
yes.
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backports
People generally don't want to apply backports on their own, especially
when it's a production server and when it's unclear it actually fixes the
issue they have. I'm not sure about that.
It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.
Well, using low shared_buffers was used especially before 8.3, when the
spread checkpoints were not available. It prevents the I/O overload when
the database suddenly decides to write all of the dirty buffers. But he's
on 9.0 (so he already has spread checkpoints).
Plus the number of buffers he's writing is negligible - usually about 700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a
problem for the SSD he's using.
Tomas
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
Le 24 novembre 2011 20:38, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Finally, it "crashed" againŠ :-(
Here's the output of iotop while databased was inaccessible :
32361 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34847) idle
32244 be/4 postgres 163.48 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34660) SELECT
32045 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(33765) SELECT
32158 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34112) SELECT
32242 be/4 postgres 7.78 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(34632) SELECT
32372 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38858) idle in transaction
32231 be/4 postgres 15.57 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38602) SELECT
28811 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(40594) SELECT
32190 be/4 postgres 3.89 K/s 0.00 B/s 0.00 % 99.99 % postgres:
mydb
mydb host(38497) SELECTThis iotop output is rather incomplete - it does not show which of the
processes actually did the I/O. The fact that a process does 99% of the
I/O and yet performs no reads or writes is a bit strange.Use "iotop -o" and post the result.
Does this help?
yes.
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backportsPeople generally don't want to apply backports on their own, especially
when it's a production server and when it's unclear it actually fixes the
issue they have. I'm not sure about that.
I agree that most people don't want to do that themselves, but if it
happens to be the solution they can proceed or ask someone to do it.
People want to see their production system back to a normal situation,
here the limited information are not enought to be sure, but the
checkpoint sync time are clear: sync time are not correct.
It is very probable that compacting the fsync will help, but it is not
sure it is required yet.
It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.Well, using low shared_buffers was used especially before 8.3, when the
spread checkpoints were not available. It prevents the I/O overload when
the database suddenly decides to write all of the dirty buffers. But he's
on 9.0 (so he already has spread checkpoints).
It is a different animal here.
Plus the number of buffers he's writing is negligible - usually about 700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a
problem for the SSD he's using.
See the blog entry from Greg Smith:
http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.html
And the slides of his talk at pgconf2011:
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf
I was just pointing that there are known issues in this area, with
known solutions.
Getting more information on vacuum activity, bgwriter activity should help too.
Gaëtan, do you still have critical performance issue, or is it back to
normal/loaded (but not overloaded) situation ?
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
On 25 Listopad 2011, 12:43, Cédric Villemain wrote:
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backportsPeople generally don't want to apply backports on their own, especially
when it's a production server and when it's unclear it actually fixes
the
issue they have. I'm not sure about that.I agree that most people don't want to do that themselves, but if it
happens to be the solution they can proceed or ask someone to do it.
People want to see their production system back to a normal situation,
here the limited information are not enought to be sure, but the
checkpoint sync time are clear: sync time are not correct.
It is very probable that compacting the fsync will help, but it is not
sure it is required yet.
Yes, the sync times are quite crazy. Especially given the tiny number of
buffers to be written and the fact that the SSD should handle random I/O
quite well.
It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.Well, using low shared_buffers was used especially before 8.3, when the
spread checkpoints were not available. It prevents the I/O overload when
the database suddenly decides to write all of the dirty buffers. But
he's
on 9.0 (so he already has spread checkpoints).It is a different animal here.
Plus the number of buffers he's writing is negligible - usually about
700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a
problem for the SSD he's using.See the blog entry from Greg Smith:
http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.htmlAnd the slides of his talk at pgconf2011:
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdfI was just pointing that there are known issues in this area, with
known solutions.
Thanks for the links, interesting stuff. Still, my impression is that the
SSD is stressed by something else, and the high fsync values during a
checkpoint are merely a symptom. So fixing a checkpoint (using the
backpatch) won't actually fix the issue. But I'm just guessing here.
Getting more information on vacuum activity, bgwriter activity should help
too.
Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from
postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes
apart, collected when the db is slow)? A complete 'iotop -o' output would
be nice too.
BTW what filesystem are we talking about? What mount options have you used?
What about the /proc/sys/vm/dirty_expire_centiseconds?
Tomas
Hello Tomas and Cédric,
Right now, the server is not all right. Load is above 30 and queries are
slow like hell.
Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT
queries).
Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(38816) SELECT
20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(34166) SELECT
26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres:
database database 46.105.104.205(40820) SELECT
23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres:
database database 46.105.104.205(58091) SELECT
29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres:
database database 46.105.104.205(51047) SELECT
27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres:
database database 46.105.104.205(42315) SELECT
28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres:
database database 46.105.104.205(49871) SELECT
27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres:
database database 46.105.104.205(49096) SELECT
29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres:
database database 176.31.228.6(39787) SELECT
8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres:
database database 91.121.89.14(57291) SELECT
29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres:
database database 176.31.228.6(39807) idle
27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres:
database database 46.105.104.205(41285) SELECT
29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres:
database database 46.105.104.205(51178) SELECT
3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald]
28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres:
database database 46.105.104.205(50164) SELECT
26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres:
database database 46.105.104.205(40589) SELECT
29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres:
database database 176.31.228.6(39784) SELECT
4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0]
28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres:
database database 46.105.104.205(50125) SELECT
28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres:
database database 46.105.104.205(50167) SELECT
19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(33598) SELECT
28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(39260) SELECT
29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(51044) SELECT
Here is dirty_expire_centisecs :
cat /proc/sys/vm/dirty_expire_centisecs
3000
Bgwriter configuration is default :
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round
Is there anything I can provide to help you ?
Thanks again,
Gaëtan
Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
Show quoted text
On 25 Listopad 2011, 12:43, Cédric Villemain wrote:
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backportsPeople generally don't want to apply backports on their own, especially
when it's a production server and when it's unclear it actually fixes
the
issue they have. I'm not sure about that.I agree that most people don't want to do that themselves, but if it
happens to be the solution they can proceed or ask someone to do it.
People want to see their production system back to a normal situation,
here the limited information are not enought to be sure, but the
checkpoint sync time are clear: sync time are not correct.
It is very probable that compacting the fsync will help, but it is not
sure it is required yet.Yes, the sync times are quite crazy. Especially given the tiny number of
buffers to be written and the fact that the SSD should handle random I/O
quite well.It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.Well, using low shared_buffers was used especially before 8.3, when the
spread checkpoints were not available. It prevents the I/O overload
when
the database suddenly decides to write all of the dirty buffers. But
he's
on 9.0 (so he already has spread checkpoints).It is a different animal here.
Plus the number of buffers he's writing is negligible - usually about
700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a
problem for the SSD he's using.See the blog entry from Greg Smith:
http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.htm
lAnd the slides of his talk at pgconf2011:
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon
2011.pdfI was just pointing that there are known issues in this area, with
known solutions.Thanks for the links, interesting stuff. Still, my impression is that the
SSD is stressed by something else, and the high fsync values during a
checkpoint are merely a symptom. So fixing a checkpoint (using the
backpatch) won't actually fix the issue. But I'm just guessing here.Getting more information on vacuum activity, bgwriter activity should
help
too.Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from
postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes
apart, collected when the db is slow)? A complete 'iotop -o' output would
be nice too.BTW what filesystem are we talking about? What mount options have you
used?What about the /proc/sys/vm/dirty_expire_centiseconds?
Tomas
Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Hello Tomas and Cédric,
Right now, the server is not all right. Load is above 30 and queries are
slow like hell.Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT
queries).Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(38816) SELECT
20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(34166) SELECT
26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres:
database database 46.105.104.205(40820) SELECT
23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres:
database database 46.105.104.205(58091) SELECT
29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres:
database database 46.105.104.205(51047) SELECT
27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres:
database database 46.105.104.205(42315) SELECT
28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres:
database database 46.105.104.205(49871) SELECT
27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres:
database database 46.105.104.205(49096) SELECT
29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres:
database database 176.31.228.6(39787) SELECT
8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres:
database database 91.121.89.14(57291) SELECT
29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres:
database database 176.31.228.6(39807) idle
27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres:
database database 46.105.104.205(41285) SELECT
29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres:
database database 46.105.104.205(51178) SELECT
3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald]
28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres:
database database 46.105.104.205(50164) SELECT
26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres:
database database 46.105.104.205(40589) SELECT
29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres:
database database 176.31.228.6(39784) SELECT
4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0]
28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres:
database database 46.105.104.205(50125) SELECT
28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres:
database database 46.105.104.205(50167) SELECT
19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(33598) SELECT
28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(39260) SELECT
29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(51044) SELECTHere is dirty_expire_centisecs :
cat /proc/sys/vm/dirty_expire_centisecs
3000Bgwriter configuration is default :
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/roundIs there anything I can provide to help you ?
the checkpoints logs and the output of pg_stat_bgwriter (as asked by Tomas).
It is probable that shared_buffers are too small for your workload
(expected), do you still have issue with checkpoint sync duration ?
You should increase shared_buffers (you can iterate) *and* try to not
have the long checkpoint sync. (keep shared_buffers <= 8GB, as already
mentioned upthread)
The output you provided does not show everything, and the select are
not looking to write that much. What process are writing ?
Thanks again,
Gaëtan
Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
On 25 Listopad 2011, 12:43, Cédric Villemain wrote:
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backportsPeople generally don't want to apply backports on their own, especially
when it's a production server and when it's unclear it actually fixes
the
issue they have. I'm not sure about that.I agree that most people don't want to do that themselves, but if it
happens to be the solution they can proceed or ask someone to do it.
People want to see their production system back to a normal situation,
here the limited information are not enought to be sure, but the
checkpoint sync time are clear: sync time are not correct.
It is very probable that compacting the fsync will help, but it is not
sure it is required yet.Yes, the sync times are quite crazy. Especially given the tiny number of
buffers to be written and the fact that the SSD should handle random I/O
quite well.It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.Well, using low shared_buffers was used especially before 8.3, when the
spread checkpoints were not available. It prevents the I/O overload
when
the database suddenly decides to write all of the dirty buffers. But
he's
on 9.0 (so he already has spread checkpoints).It is a different animal here.
Plus the number of buffers he's writing is negligible - usually about
700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not be a
problem for the SSD he's using.See the blog entry from Greg Smith:
http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.htm
lAnd the slides of his talk at pgconf2011:
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon
2011.pdfI was just pointing that there are known issues in this area, with
known solutions.Thanks for the links, interesting stuff. Still, my impression is that the
SSD is stressed by something else, and the high fsync values during a
checkpoint are merely a symptom. So fixing a checkpoint (using the
backpatch) won't actually fix the issue. But I'm just guessing here.Getting more information on vacuum activity, bgwriter activity should
help
too.Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from
postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes
apart, collected when the db is slow)? A complete 'iotop -o' output would
be nice too.BTW what filesystem are we talking about? What mount options have you
used?What about the /proc/sys/vm/dirty_expire_centiseconds?
Tomas
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Here are the latest checkpoint logs :
LOG: checkpoint complete: wrote 842 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=168.970 s, sync=0.005 s,
total=168.977 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 318 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 2 recycled; write=63.818 s, sync=0.006 s,
total=63.825 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 744 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=147.035 s, sync=0.006 s,
total=147.043 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 108 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=35.410 s, sync=14.921 s,
total=54.811 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 393 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=88.835 s, sync=43.210 s,
total=135.728 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 914 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=148.162 s, sync=14.249 s,
total=170.481 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 202 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=53.152 s, sync=0.004 s,
total=53.159 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 897 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=42.414 s, sync=1.175 s,
total=58.957 s
LOG: checkpoint starting: shutdown immediate
LOG: checkpoint complete: wrote 666 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=0.027 s, sync=1.600 s,
total=1.630 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 627 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=125.856 s, sync=0.006 s,
total=125.864 s
LOG: checkpoint starting: time
And pg_stat_bgwriter :
postgres=# select *,now() from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean
| maxwritten_clean | buffers_backend | buffers_alloc | now
-------------------+-----------------+--------------------+---------------+
------------------+-----------------+---------------+----------------------
---------
388 | 13 | 494948 | 4306591
| 13555 | 7458743 | 7835244602 | 2011-11-26
00:43:47.232924+01
(1 row)
postgres=# select *,now() from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean
| maxwritten_clean | buffers_backend | buffers_alloc | now
-------------------+-----------------+--------------------+---------------+
------------------+-----------------+---------------+----------------------
---------
389 | 13 | 501802 | 4352198
| 13809 | 7469220 | 7839778941 | 2011-11-26
00:49:00.680779+01
(1 row)
Processes that were writing were "SELECT" queries against database.
Gaëtan
Le 26/11/11 00:17, « Cédric Villemain »
<cedric.villemain.debian@gmail.com> a écrit :
Show quoted text
Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Hello Tomas and Cédric,
Right now, the server is not all right. Load is above 30 and queries are
slow like hell.Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT
queries).Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(38816) SELECT
20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(34166) SELECT
26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres:
database database 46.105.104.205(40820) SELECT
23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres:
database database 46.105.104.205(58091) SELECT
29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres:
database database 46.105.104.205(51047) SELECT
27271 be/4 postgres 23.46 K/s 234.58 K/s 0.00 % 77.15 % postgres:
database database 46.105.104.205(42315) SELECT
28224 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 74.09 % postgres:
database database 46.105.104.205(49871) SELECT
27450 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 70.12 % postgres:
database database 46.105.104.205(49096) SELECT
29206 be/4 postgres 121.20 K/s 0.00 B/s 0.00 % 62.99 % postgres:
database database 176.31.228.6(39787) SELECT
8202 be/4 postgres 129.02 K/s 0.00 B/s 0.00 % 60.59 % postgres:
database database 91.121.89.14(57291) SELECT
29209 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 55.75 % postgres:
database database 176.31.228.6(39807) idle
27046 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 52.35 % postgres:
database database 46.105.104.205(41285) SELECT
29217 be/4 postgres 54.73 K/s 0.00 B/s 0.00 % 50.18 % postgres:
database database 46.105.104.205(51178) SELECT
3249 be/4 root 3.91 K/s 320.59 K/s 0.00 % 40.31 % [kjournald]
28918 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.83 % postgres:
database database 46.105.104.205(50164) SELECT
26856 be/4 postgres 7.82 K/s 234.58 K/s 0.00 % 34.30 % postgres:
database database 46.105.104.205(40589) SELECT
29205 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 34.17 % postgres:
database database 176.31.228.6(39784) SELECT
4175 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.84 % [flush-8:0]
28905 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.02 % postgres:
database database 46.105.104.205(50125) SELECT
28919 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.01 % postgres:
database database 46.105.104.205(50167) SELECT
19807 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(33598) SELECT
28703 be/4 postgres 0.00 B/s 86.01 K/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(39260) SELECT
29183 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(51044) SELECTHere is dirty_expire_centisecs :
cat /proc/sys/vm/dirty_expire_centisecs
3000Bgwriter configuration is default :
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers
scanned/roundIs there anything I can provide to help you ?
the checkpoints logs and the output of pg_stat_bgwriter (as asked by
Tomas).It is probable that shared_buffers are too small for your workload
(expected), do you still have issue with checkpoint sync duration ?You should increase shared_buffers (you can iterate) *and* try to not
have the long checkpoint sync. (keep shared_buffers <= 8GB, as already
mentioned upthread)The output you provided does not show everything, and the select are
not looking to write that much. What process are writing ?Thanks again,
Gaëtan
Le 25/11/11 14:12, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
On 25 Listopad 2011, 12:43, Cédric Villemain wrote:
Le 25 novembre 2011 11:25, Tomas Vondra <tv@fuzzy.cz> a écrit :
On 24 Listopad 2011, 23:19, Cédric Villemain wrote:
It seem you have an issue with your checkpoint syncing time, it is
fixed in 9.1 and backported in 9.0 here :
http://projects.2ndquadrant.com/backportsPeople generally don't want to apply backports on their own,
especially
when it's a production server and when it's unclear it actually fixes
the
issue they have. I'm not sure about that.I agree that most people don't want to do that themselves, but if it
happens to be the solution they can proceed or ask someone to do it.
People want to see their production system back to a normal situation,
here the limited information are not enought to be sure, but the
checkpoint sync time are clear: sync time are not correct.
It is very probable that compacting the fsync will help, but it is not
sure it is required yet.Yes, the sync times are quite crazy. Especially given the tiny number of
buffers to be written and the fact that the SSD should handle random I/O
quite well.It is possible you have other problems that explains the issue you
have. An immediate solution before trying a patch is to reduce your
shared_buffer setting to something very low, like 1GB.Well, using low shared_buffers was used especially before 8.3, when
the
spread checkpoints were not available. It prevents the I/O overload
when
the database suddenly decides to write all of the dirty buffers. But
he's
on 9.0 (so he already has spread checkpoints).It is a different animal here.
Plus the number of buffers he's writing is negligible - usually about
700
buffers (6MB), 3192 buffers (25MB) at most. That surely should not
be a
problem for the SSD he's using.See the blog entry from Greg Smith:
http://blog.2ndquadrant.com/en/2011/06/backporting-and-checkpoint-tro.h
tm
lAnd the slides of his talk at pgconf2011:
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGC
on
2011.pdfI was just pointing that there are known issues in this area, with
known solutions.Thanks for the links, interesting stuff. Still, my impression is that
the
SSD is stressed by something else, and the high fsync values during a
checkpoint are merely a symptom. So fixing a checkpoint (using the
backpatch) won't actually fix the issue. But I'm just guessing here.Getting more information on vacuum activity, bgwriter activity should
help
too.Yes, that'd be nice. Gaëtan, can you post bgwriter-related options from
postgresql.conf and two snapshosts from pg_stat_bgwriter (say 5 minutes
apart, collected when the db is slow)? A complete 'iotop -o' output
would
be nice too.BTW what filesystem are we talking about? What mount options have you
used?What about the /proc/sys/vm/dirty_expire_centiseconds?
Tomas
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
Dne 26.11.2011 00:17, Cédric Villemain napsal(a):
Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Hello Tomas and Cédric,
Right now, the server is not all right. Load is above 30 and queries are
slow like hell.Here's the complete iotop. Note the 71 MB/s writes (apparently on SELECT
queries).Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(38816) SELECT
20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(34166) SELECT
26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres:
database database 46.105.104.205(40820) SELECT
23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres:
database database 46.105.104.205(58091) SELECT
29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres:
database database 46.105.104.205(51047) SELECT
Gaetan, you keep deleting the interesting rows for some reason. None of
the rows you've posted writes more than a few KB/s - what about the rows
that actually write those 71MB/s?
Here is dirty_expire_centisecs :
cat /proc/sys/vm/dirty_expire_centisecs
3000
OK. That's a default value and it's usually too high (just like the
ratios), but don't change it until we find out what's wrong.
Bgwriter configuration is default :
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round
OK.
Is there anything I can provide to help you ?
the checkpoints logs and the output of pg_stat_bgwriter (as asked by Tomas).
It is probable that shared_buffers are too small for your workload
(expected), do you still have issue with checkpoint sync duration ?
I don't think that's the case. Too small shared buffers usually cause a
lot of reads (especially when all the queries are SELECTs as here), but
he has problems with writes.
And according to the pg_stat_bgwriter Gaetan posted, the checkpoints
wrote about 54MB, bgwriter about 370MB and backends about 80MB (during
the 5 minutes between snapshots).
So I'm really wondering where those 70MB/s of writes come from.
Two things just crossed my mind. The first one are hint bits - this may
cause a SELECT to write a lot of data. But I guess this is included in
the pg_stat_bgwriter stats.
The second one is on-disk sorting - this happens when a query needs to
sort so much data it can't be done in work_mem, so the data are pushed
to the disk, and AFAIK it's not included into the pg_stat_bgwriter.
But he has work_mem set to 128MB so it's rather unlikely. Gaetan, can
you verify that those queries that write the most data to the disk are
not performing any huge sorts etc? (See iotop which postgres process is
writing a lot of data and use the PID and pg_stat_activity to find out
which query it's executing.)
Tomas
A better view of iotop :
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
16974 be/4 postgres 46.33 K/s 0.00 B/s 0.00 % 7.21 % postgres:
database database 46.105.111.92(54930) idle
383 be/4 postgres 7.72 K/s 0.00 B/s 0.00 % 1.56 % postgres:
database database 46.105.104.205(51706) idle in transaction
15934 be/4 postgres 0.00 B/s 38.61 K/s 0.00 % 0.04 % postgres: wal
writer process
31487 be/4 postgres 0.00 B/s 11.32 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(48616) SELECT
29541 be/4 postgres 0.00 B/s 11.52 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(40675) SELECT
31419 be/4 postgres 0.00 B/s 5.99 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(48476) SELECT
31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(50228) SELECT
30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(47672) SELECT
30902 be/4 postgres 0.00 B/s 6.04 M/s 0.00 % 0.00 % postgres:
database database 176.31.228.6(43295) SELECT
30903 be/4 postgres 0.00 B/s 4.79 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(47565) SELECT
15933 be/4 postgres 0.00 B/s 77.22 K/s 0.00 % 0.00 % postgres:
writer process
As you can see, SELECTS are writing very much on the disk.
At this moment, pg_stat_activity show many many queries running at the
same time.
It looks like that all queries are suddenly writing on disk and not a
particular one, making me think of a buffer issue or something.
This is a example of 10 MB/s writing query :
SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id",
"table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id",
"table"."url_article" AS "url_article", "table"."original_url" AS
"original_url", "table"."name" AS "name", "table"."description" AS
"description", "table"."content" AS "content", "table"."permis" AS
"permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS
"author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS
"post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS
"add_date", "table"."comments_open" AS "comments_open", "table"."site_id"
AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS
"status", "table"."visits" AS "visits", "table"."votes" AS "votes",
"table"."score" AS "score", "arti
Checkpoints logs still show very long write times :
LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s,
total=63.625 s
Couldn't this be a hardware issue ?
Best regards,
Gaëtan
Le 26/11/11 04:42, « Tomas Vondra » <tv@fuzzy.cz> a écrit :
Show quoted text
Dne 26.11.2011 00:17, Cédric Villemain napsal(a):
Le 25 novembre 2011 23:47, Gaëtan Allart <gaetan@nexylan.com> a écrit :
Hello Tomas and Cédric,
Right now, the server is not all right. Load is above 30 and queries
are
slow like hell.Here's the complete iotop. Note the 71 MB/s writes (apparently on
SELECT
queries).Total DISK READ: 633.35 K/s | Total DISK WRITE: 71.06 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND27352 be/4 postgres 15.64 K/s 86.01 K/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(38816) SELECT
20226 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 99.99 % postgres:
database database 176.31.228.6(34166) SELECT
26950 be/4 postgres 23.46 K/s 0.00 B/s 0.00 % 82.14 % postgres:
database database 46.105.104.205(40820) SELECT
23160 be/4 postgres 3.91 K/s 0.00 B/s 0.00 % 81.14 % postgres:
database database 46.105.104.205(58091) SELECT
29184 be/4 postgres 7.82 K/s 0.00 B/s 0.00 % 79.17 % postgres:
database database 46.105.104.205(51047) SELECTGaetan, you keep deleting the interesting rows for some reason. None of
the rows you've posted writes more than a few KB/s - what about the rows
that actually write those 71MB/s?Here is dirty_expire_centisecs :
cat /proc/sys/vm/dirty_expire_centisecs
3000OK. That's a default value and it's usually too high (just like the
ratios), but don't change it until we find out what's wrong.Bgwriter configuration is default :
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers
scanned/roundOK.
Is there anything I can provide to help you ?
the checkpoints logs and the output of pg_stat_bgwriter (as asked by
Tomas).It is probable that shared_buffers are too small for your workload
(expected), do you still have issue with checkpoint sync duration ?I don't think that's the case. Too small shared buffers usually cause a
lot of reads (especially when all the queries are SELECTs as here), but
he has problems with writes.And according to the pg_stat_bgwriter Gaetan posted, the checkpoints
wrote about 54MB, bgwriter about 370MB and backends about 80MB (during
the 5 minutes between snapshots).So I'm really wondering where those 70MB/s of writes come from.
Two things just crossed my mind. The first one are hint bits - this may
cause a SELECT to write a lot of data. But I guess this is included in
the pg_stat_bgwriter stats.The second one is on-disk sorting - this happens when a query needs to
sort so much data it can't be done in work_mem, so the data are pushed
to the disk, and AFAIK it's not included into the pg_stat_bgwriter.But he has work_mem set to 128MB so it's rather unlikely. Gaetan, can
you verify that those queries that write the most data to the disk are
not performing any huge sorts etc? (See iotop which postgres process is
writing a lot of data and use the PID and pg_stat_activity to find out
which query it's executing.)Tomas
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 26 Listopad 2011, 10:45, Gaëtan Allart wrote:
A better view of iotop :
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
31875 be/4 postgres 0.00 B/s 15.23 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(50228) SELECT
30985 be/4 postgres 0.00 B/s 10.55 M/s 0.00 % 0.00 % postgres:
database database 46.105.104.205(47672) SELECTAs you can see, SELECTS are writing very much on the disk.
At this moment, pg_stat_activity show many many queries running at the
same time.
It looks like that all queries are suddenly writing on disk and not a
particular one, making me think of a buffer issue or something.
No, I don't think this is an issue with the size of shared buffers. That'd
influence the pg_stat_bgwriter - the buffers_backend would grow much
faster, and it's not the case.
So I'm guessing it's either hint bits or sorting. If I had to, I'd
probably guess about the hint bits - see for example this thread
http://archives.postgresql.org/pgsql-performance/2008-05/msg00182.php
it more or less behaves like what you described so far.
This is a example of 10 MB/s writing query :
SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS "id",
"table"."flux_id" AS "flux_id", "table"."locale_id" AS "locale_id",
"table"."url_article" AS "url_article", "table"."original_url" AS
"original_url", "table"."name" AS "name", "table"."description" AS
"description", "table"."content" AS "content", "table"."permis" AS
"permis", "table"."reviewed" AS "reviewed", "table"."author_id" AS
"author_id", "table"."poster_id" AS "poster_id", "table"."post_date" AS
"post_date", "table"."edit_date" AS "edit_date", "table"."add_date" AS
"add_date", "table"."comments_open" AS "comments_open", "table"."site_id"
AS "site_id", "table"."is_local" AS "is_local", "table"."status" AS
"status", "table"."visits" AS "visits", "table"."votes" AS "votes",
"table"."score" AS "score", "arti
Post EXPLAIN ANALYZE output of this query (use explain.depesz.com to post
it).
Checkpoints logs still show very long write times :
LOG: checkpoint complete: wrote 92 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=49.622 s, sync=6.510 s,
total=63.625 s
No, that's fine - that's what spread checkpoints do. Once the
checkpoint_timeout expires, the system decides to perform a checkpoint,
i.e. it has to write all dirty (modified) shared buffers to the disk. But
it knows the next checkpoint will happen in checkpoint_timeout, so it has
about 5 minutes to write all the data.
So it says something like 'I need to write 540MB want to write that in 270
seconds (5 minutes * completion_target), so I'll write at 2MB/s'. So the
write phase is expected to take long. But it has to sync the data at the
end, and that's where the problems usually happen - so the important thing
is 'sync' and that improved significantly. 6 seconds is not great but it's
not a big issue I guess.
Couldn't this be a hardware issue ?
I don't think so. The problem is that the queries write a lot of data, and
the SSD can't help in this case. The SSDs provide much better random
performance (compared to spinners), but the sequential performance is not
much better. So pushing 70MB/s to the SSD may be fully utilized.
Have you tested the basic performance (like bonnie++, dd etc) before going
into production? There are some common SSD-related mistakes (e.g. not
aligning the partitions properly) that may easily cause 30% performance
loss. But this won't fix the problem, although you could get a bit better
performance.
Tomas