BUG #14714: long running sessions from remote instance seems to hang some times
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:
On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.
Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.
When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."
We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...
Do you please have any advice what to check or what setting might help?
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
josef.machytka@gmail.com writes:
Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual.
Is it using a lot of CPU time according to ps or vmstat?
Do you please have any advice what to check or what setting might help?
If it's eating CPU time, try using "perf" to find out where.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both
connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I
do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."We implemented timeouts for PG tasks into our programs to restart tasks but
it very annoying and we have to use long enough timeouts anyway so it
delays
tasks.... It happens from time to time also to dblink connections inside pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and problem
is still present...Do you please have any advice what to check or what setting might help?
it can be spin lock issue. Try to use "perf top". Install postgresql-server
debug symbols before.
what is your share buffers configuration?
Regards
Pavel
Show quoted text
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Mon, Jun 19, 2017 at 7:49 AM, <josef.machytka@gmail.com> wrote:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both
connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.
One possibility is that another process inserted a bunch of new tuples and
committed them, and now the existing session has to dig through all those
new tuples and ignore them because they are too new to be visible to it.
This would particularly be a problem where the queries have things like
"MAX(indexed_column)" or "ORDER BY indexed_column desc LIMIT 1" and where
the newly insert tuples all have values which are at the end of the range
being probed by that query. Although I don't know why this would have
gotten worse from 9.5 to 9.6. The change should have been in the other
direction.
Is this on a master or on a hot standby?
Cheers,
Jeff
Today I got several sessions in this "drowsy" state and it almost caused
collapse of the whole reporting system because aggregations were not fully
calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous
9.6.x subversions...
I checked "top", "iotop -u postgres", "perf top" and "perf top -u postgres"
and as far as I could see those sessions are running although they are
marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5 to
0,8% - which is visible only in "perf" output. In "top" I see almost all
the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are
shown.
GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3
different GCE standard persistent disks, wal logs and temp tablespace have
separate disk.
Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4
There are several procedures which set work_mem to 256MB because tests
showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old
queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to
get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.
Shall I look for some special perf symbols in output? Or is there a
possibility to debug it deeper?
Thanks
Josef
On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Show quoted text
2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x
slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both
connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I
do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."We implemented timeouts for PG tasks into our programs to restart tasks
but
it very annoying and we have to use long enough timeouts anyway so it
delays
tasks.... It happens from time to time also to dblink connections inside
pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and
problem
is still present...Do you please have any advice what to check or what setting might help?
it can be spin lock issue. Try to use "perf top". Install
postgresql-server debug symbols before.what is your share buffers configuration?
Regards
Pavel
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
2017-06-23 11:26 GMT+02:00 Josef Machytka <josef.machytka@gmail.com>:
Today I got several sessions in this "drowsy" state and it almost caused
collapse of the whole reporting system because aggregations were not fully
calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous
9.6.x subversions...I checked "top", "iotop -u postgres", "perf top" and "perf top -u
postgres" and as far as I could see those sessions are running although
they are marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5 to
0,8% - which is visible only in "perf" output. In "top" I see almost all
the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are
shown.GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3
different GCE standard persistent disks, wal logs and temp tablespace have
separate disk.Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4There are several procedures which set work_mem to 256MB because tests
showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old
queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to
get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.Shall I look for some special perf symbols in output? Or is there a
possibility to debug it deeper?
what is perf top output?
you can install debug symbols for postgres and for kernel
https://wiki.debian.org/DebugPackage
Regards
Pavel
Thanks
Show quoted text
Josef
On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js
applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x
slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both
connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in
process
priority in OS, process switches between sleeping and running state and
I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."We implemented timeouts for PG tasks into our programs to restart tasks
but
it very annoying and we have to use long enough timeouts anyway so it
delays
tasks.... It happens from time to time also to dblink connections inside
pg
function so we had to rewrite our programs which were running without any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and
problem
is still present...Do you please have any advice what to check or what setting might help?
it can be spin lock issue. Try to use "perf top". Install
postgresql-server debug symbols before.what is your share buffers configuration?
Regards
Pavel
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
sorry, here is file with sample - basically several group by aggregations
over different partitioned tables have been locked in this "drowsy" state
On 23 June 2017 at 11:37, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Show quoted text
2017-06-23 11:26 GMT+02:00 Josef Machytka <josef.machytka@gmail.com>:
Today I got several sessions in this "drowsy" state and it almost caused
collapse of the whole reporting system because aggregations were not fully
calculated.
Looks like now on PG 9.6.3 is this problem even worse then on previous
9.6.x subversions...I checked "top", "iotop -u postgres", "perf top" and "perf top -u
postgres" and as far as I could see those sessions are running although
they are marked as "S" sleeping in "top" command almost all the time.
From time to time they cause very low CPU load, something like from 0,5
to 0,8% - which is visible only in "perf" output. In "top" I see almost all
the time 0.0% on CPU.
The same with disk IO - iotop shows only a few MB/s from time to time.
They are marked as "active" in "pg_stat_activity" and no wait events are
shown.GCE Instance has Debian 8, 10 CPUs, 46GB RAM
Our database has 2390 GB is distributed over 3 different tablespaces on 3
different GCE standard persistent disks, wal logs and temp tablespace have
separate disk.Out pg config is:
shared_buffers = 8GB
temp_buffers = 512MB
work_mem = 192MB
maintenance_work_mem = 2GB
max_connections = 150
max_wal_size = 8GB
min_wal_size = 4GB
max_worker_processes = 50
max_parallel_workers_per_gather = 4There are several procedures which set work_mem to 256MB because tests
showed it speeds up processing really significantly.
Reason is when we switched from PG 9.5 to PG 9.6 we found that many old
queries had suddenly very bad explain plan and very long run time.
Only setting work_mem from previous 64MB to the current values helped to
get the same or better results for those queries on 9.6.
Of course we upgraded instance with more memory. But we do not use swap.Shall I look for some special perf symbols in output? Or is there a
possibility to debug it deeper?what is perf top output?
you can install debug symbols for postgres and for kernel
https://wiki.debian.org/DebugPackageRegards
Pavel
Thanks
Josef
On 19 June 2017 at 17:56, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2017-06-19 16:49 GMT+02:00 <josef.machytka@gmail.com>:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js
applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x
slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both
connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in
process
priority in OS, process switches between sleeping and running state and
I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."We implemented timeouts for PG tasks into our programs to restart tasks
but
it very annoying and we have to use long enough timeouts anyway so it
delays
tasks.... It happens from time to time also to dblink connections
inside pg
function so we had to rewrite our programs which were running without
any
problems on PG 9.5. Problems started when we switched to PG 9.6 and we
experienced them on all minor versions - we currently run 9.6.3 and
problem
is still present...Do you please have any advice what to check or what setting might help?
it can be spin lock issue. Try to use "perf top". Install
postgresql-server debug symbols before.what is your share buffers configuration?
Regards
Pavel
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Attachments:
perfoutput.txttext/plain; charset=US-ASCII; name=perfoutput.txtDownload
Josef Machytka <josef.machytka@gmail.com> writes:
sorry, here is file with sample - basically several group by aggregations
over different partitioned tables have been locked in this "drowsy" state
The amount of time being spent in the kernel is strikingly high. I wonder
if you're seeing some variant of the old "context swap storm" problem.
Try watching the output of "vmstat 1" for awhile to see if the cs rate
is high.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
If there are no queries on database I see number of context switches from
~900 to ~1500.
I checked our monitoring based on node_exporter, Prometheus and Grafana and
when this "drowsiness" of sessions happens I see number of context switches
from ~10 000 to ~25 000 with a few peeks ~28 000.
But similar numbers or even slightly higher I see also in other cases of
high load on the database and they did not caused any "drowsy" sessions. In
fact monitoring does not show any significant differences between cases
with "drowsy" sessions and without them...If you have some advices about
what else to monitor I will add.
On 23 June 2017 at 16:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Josef Machytka <josef.machytka@gmail.com> writes:
sorry, here is file with sample - basically several group by aggregations
over different partitioned tables have been locked in this "drowsy" stateThe amount of time being spent in the kernel is strikingly high. I wonder
if you're seeing some variant of the old "context swap storm" problem.
Try watching the output of "vmstat 1" for awhile to see if the cs rate
is high.regards, tom lane
Hi,
On 2017-06-19 14:49:07 +0000, josef.machytka@gmail.com wrote:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.js applications.
Environment are Google Compute Engine instances with Debian 8.Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20x slower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to both connections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in process
priority in OS, process switches between sleeping and running state and I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."
Which version of the os & kernel is this? This kind of sounds like the
symptoms that we've observed more frequently a while ago with
the kernel's transparent_hugepages management.
Greetings,
Andres Freund
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Debian 8:
Linux <hostname> 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u1 (2017-06-18)
x86_64 GNU/Linux
On 23 June 2017 at 18:57, Andres Freund <andres@anarazel.de> wrote:
Show quoted text
Hi,
On 2017-06-19 14:49:07 +0000, josef.machytka@gmail.com wrote:
The following bug has been logged on the website:
Bug reference: 14714
Logged by: Josef Machytka
Email address: josef.machytka@gmail.com
PostgreSQL version: 9.6.3
Operating system: Debian jessie
Description:On pg 9.6.* we have sometimes strange problem with some sessions. We
experienced it from bash scripts, golang program and node.jsapplications.
Environment are Google Compute Engine instances with Debian 8.
Sometimes some session goes into some kind of "drowsy" state and runs
incredibly slowly. It causes some very low disk IO and runs like 20xslower
then usual. This happens mainly when database is under heavy load and it
happens randomly to different tasks. It happens randomly to bothconnections
from local instance (cronjob running on instance with PG) and remote
connections from other instances.When we restart task it runs normally. I cannot see any changes in
process
priority in OS, process switches between sleeping and running state and
I do
not see any locks or waits in pg_stat_activity or pg_locks. Usually all
other processes finish OK and this "drowsy" one runs "for ever..."Which version of the os & kernel is this? This kind of sounds like the
symptoms that we've observed more frequently a while ago with
the kernel's transparent_hugepages management.Greetings,
Andres Freund
Hi,
(please don't top-quote on this list)
On 2017-06-26 12:27:04 +0200, Josef Machytka wrote:
Debian 8:
Linux <hostname> 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u1 (2017-06-18)
x86_64 GNU/Linux
You should chekc the following parameters:
cat /proc/sys/vm/zone_reclaim_mode
cat /sys/kernel/mm/transparent_hugepage/enabled
and disable them:
echo 0 > /proc/sys/vm/zone_reclaim_mode
echo never > /sys/kernel/mm/transparent_hugepage/enabled
- Andres
Greetings,
Andres Freund
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hi,
thank you for an advice - I set it and tested for several days.
Good news is we do not have "drowsy" sessions any more.
But performance of queries on huge tables went seriously down. Even with
parallel processing.
It is understandable regarding memory management but not sustainable for
the future since our tables are growing almost exponentially. In these days
we have weekly increase in size >100GB and it goes still higher. Database
now have 2,5 TB. We had to move some of our queries from PostgreSQL to
Bigquery because performance was really bad...
I checked history of this database and previously PostgreSQL 9.5 was
running on Google GCE instance with Ubuntu 14.04 now we have new instance
with Debian 8 and PostgeSQL 9.6.
On previous instance with Ubuntu we did not have all these problems at all.
I checked and Ubuntu 14.04 has by default hugepages enabled (value
“always”) + fixed setting of number of huge pages but Debian 8 has default
setting “madvise” - memory adviser.
Please has someone here some deeper experiences with differences between
Ubuntu and Debian as environment for PostgreSQL as highly used OLAP
database? Because so far it looks like switching to Debian was rather bad
choice....
Thanks
Josef