psql is hanging
We have a long script of sql that we run, several thousand lines of sql. If
I execute the script from start to finish, somewhere in the middle of it,
one sql command will hang and take 2 to 3 hours. During this time, "htop"
shows 100% cpu usage with a postgresql process, so it appears to be doing
something.
If I stop that command, then immediately rerun the rest of the script
starting from the command that "hung", the "hung" command then completes in
5 seconds and the rest of the script just continues on perfectly. I can
interrupt the command within 30 mins of hanging, or 2 hours of hanging;
regardless of when I interrrupt the hung command, it always immediately
completes successully in < 5 seconds when I rerun the script immediately
after that.
I have noticed this in multiple scripts, multiple commands; one was an
"insert into", a different script hung on a "update" and a third one would
hang on a "create table x as (select ... )". These are all short running
commands normally, and we have indexes on all the columns for the
appropriate "where" clauses.
I can't send the sql (forbidden), but the actual command does not appear to
be the issue. Rerunning the same little section of sql manually which
includes the hung command, always takes less than a minute, but the entire
sql which should only take 30 minutes in total will always hang at the same
spot; and then interrupting it (regardless of how long it was running) and
rerunning immediately completes.
The "hung" command does eventually finish at times after 2 to 3 hours, most
often we can't afford to wait for it to run since this batch of sql has to
run daily.
Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but
we have upgraded to 11.1 using postgresql packages for ubuntu and the
problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during this
time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk
free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere as
well, no difference.
- We have made tunings to the postgresql configuration to make it use more
RAM.
- All sql commands are sent through a custom Java application; so we use
java + JDBC to connect to postgresql for this. The script has custom
commands which are interpreted by the Java app, so we can't test running
the exact script start to finish in psql native tools since the native
tools won't understand our custom sql. However the java app is very simple.
All individual sql commands are committed immediately after they run so
there is no long transaction in play here. The psql jdbc driver is
"postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql
using 100% CPU usage on one core, so postgresql is chewing on the command,
doing something.
I have attached our current postgresql.conf file.
We're kind of pulling out our hair here, any ideas?
Thanks!
Attachments:
On 11/30/18 7:53 AM, John Smith wrote:
We have a long script of sql that we run, several thousand lines of sql.
If I execute the script from start to finish, somewhere in the middle of
it, one sql command will hang and take 2 to 3 hours. During this time,
"htop" shows 100% cpu usage with a postgresql process, so it appears to
be doing something.
First, a naming note. psql is the name of the Postgres command line
client. What you are actually referring to is the Postgres server.
Second, without the actual commands solving this is going to be
difficult to impossible.
That being said I would suspect some sort of locking issue. I would
start with pg_stat_activity:
https://www.postgresql.org/docs/10/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW
See what it says when you are in the 'hung' phase.
If I stop that command, then immediately rerun the rest of the script
starting from the command that "hung", the "hung" command then completes
in 5 seconds and the rest of the script just continues on perfectly. I
can interrupt the command within 30 mins of hanging, or 2 hours of
hanging; regardless of when I interrrupt the hung command, it always
immediately completes successully in < 5 seconds when I rerun the script
immediately after that.I have noticed this in multiple scripts, multiple commands; one was an
"insert into", a different script hung on a "update" and a third one
would hang on a "create table x as (select ... )". These are all short
running commands normally, and we have indexes on all the columns for
the appropriate "where" clauses.I can't send the sql (forbidden), but the actual command does not appear
to be the issue. Rerunning the same little section of sql manually which
includes the hung command, always takes less than a minute, but the
entire sql which should only take 30 minutes in total will always hang
at the same spot; and then interrupting it (regardless of how long it
was running) and rerunning immediately completes.The "hung" command does eventually finish at times after 2 to 3 hours,
most often we can't afford to wait for it to run since this batch of sql
has to run daily.Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?)
but we have upgraded to 11.1 using postgresql packages for ubuntu and
the problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during
this time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk
free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere
as well, no difference.
- We have made tunings to the postgresql configuration to make it use
more RAM.
- All sql commands are sent through a custom Java application; so we use
java + JDBC to connect to postgresql for this. The script has custom
commands which are interpreted by the Java app, so we can't test running
the exact script start to finish in psql native tools since the native
tools won't understand our custom sql. However the java app is very
simple. All individual sql commands are committed immediately after they
run so there is no long transaction in play here. The psql jdbc driver
is "postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql
using 100% CPU usage on one core, so postgresql is chewing on the
command, doing something.I have attached our current postgresql.conf file.
We're kind of pulling out our hair here, any ideas?
Thanks!
--
Adrian Klaver
adrian.klaver@aklaver.com
We're kind of pulling out our hair here, any ideas?
You might try issuing the command
analyze;
right *before* the command that hangs.
The rationale behind this idea is that your script changed data
and the "hung" command uses a wrong plan based on outdated statistics.
By the time you run it manually it would be fast again, because in the
mean time the statistics have been updated automatically.
Analyze forces an immediate update of the statistics on the whole
database.
This can by itself take some time according to size of the
database. If my idea works, you can then further optimize by
doing analyze only on the changed tables.
Bye,
Chris.
We're kind of pulling out our hair here, any ideas?
You might try issuing the command
analyze;
right *before* the command that hangs.
Clarification: I meant to say: "put it into the script at the location right before the command, that hangs".
Bye,
Chris.
On 11/30/2018 09:53 AM, John Smith wrote:
We have a long script of sql that we run, several thousand lines of sql.
If I execute the script from start to finish, somewhere in the middle of
it, one sql command will hang and take 2 to 3 hours. During this time,
"htop" shows 100% cpu usage with a postgresql process, so it appears to be
doing something.If I stop that command, then immediately rerun the rest of the script
starting from the command that "hung", the "hung" command then completes
in 5 seconds and the rest of the script just continues on perfectly. I can
interrupt the command within 30 mins of hanging, or 2 hours of hanging;
regardless of when I interrrupt the hung command, it always immediately
completes successully in < 5 seconds when I rerun the script immediately
after that.I have noticed this in multiple scripts, multiple commands; one was an
"insert into", a different script hung on a "update" and a third one would
hang on a "create table x as (select ... )". These are all short running
commands normally, and we have indexes on all the columns for the
appropriate "where" clauses.I can't send the sql (forbidden), but the actual command does not appear
to be the issue. Rerunning the same little section of sql manually which
includes the hung command, always takes less than a minute, but the entire
sql which should only take 30 minutes in total will always hang at the
same spot; and then interrupting it (regardless of how long it was
running) and rerunning immediately completes.The "hung" command does eventually finish at times after 2 to 3 hours,
most often we can't afford to wait for it to run since this batch of sql
has to run daily.Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but
we have upgraded to 11.1 using postgresql packages for ubuntu and the
problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during this
time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk
free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere
as well, no difference.
- We have made tunings to the postgresql configuration to make it use more
RAM.
- All sql commands are sent through a custom Java application; so *we use
java + JDBC to connect to postgresql* for this. The script has custom
commands which are interpreted by the Java app, so we can't test running
the exact script start to finish in psql native tools since the native
tools won't understand our custom sql. However the java app is very
simple. All individual sql commands are committed immediately after they
run so there is no long transaction in play here. The psql jdbc driver is
"postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql
using 100% CPU usage on one core, so postgresql is chewing on the command,
doing something.I have attached our current postgresql.conf file.
We're kind of pulling out our hair here, any ideas?
We have a seemingly identical problem in v9.6 on our big prod databases. It
worked perfectly in 8.4 *and works perfectly in 9.6 our small staging servers*.
A giant sql script that drops the oldest child tables from a large set of
partitioned tables (by dropping all relevant triggers, indexes, functions
etc, dropping the oldest children, adding the new children then recreating
all the relevant triggers, indexes, functions etc).
In 9.6 It works perfectly when I run the script directly from psql.
--
Angular momentum makes the world go 'round.
On Fri, Nov 30, 2018 at 8:05 AM Chris Mair <chris@1006.org> wrote:
We're kind of pulling out our hair here, any ideas?
You might try issuing the command
analyze;
right *before* the command that hangs.
You might consider trying the "auto_explain" module (
https://www.postgresql.org/docs/current/auto-explain.html ). This will let
you "trap" the query plan used for the long-running query. Then compare
that query plan with a manually run explain when it runs quickly to see if
they differ. If they do, it suggests that bad statistics are a likely
culprit and Chris' suggestion of running analyze will help.
You only need to analyze those tables used in the query and, most likely,
only tables that were substantially changed within a moderately short
period prior to the start of the query.
Autovacuum, which will handles analyze as well, typically defaults to
checking for tables that need attention every minute so for processes that
have a lot of steps it becomes "luck of the draw" whether or not a needed
analyze is run after a substantial table change and before that table is
used.
We frequently put specific "analyze" statements in such scripts immediately
following bulk-update statements.
Cheers,
Steve
Ron <ronljohnsonjr@gmail.com> writes:
We have a seemingly identical problem in v9.6 on our big prod databases. It
worked perfectly in 8.4 *and works perfectly in 9.6 our small staging servers*.
A giant sql script that drops the oldest child tables from a large set of
partitioned tables (by dropping all relevant triggers, indexes, functions
etc, dropping the oldest children, adding the new children then recreating
all the relevant triggers, indexes, functions etc).
A script like that is going to be taking exclusive locks on a whole lot
of tables. My guess is that you are running up against lock conflicts
against other activity in the DB, and that the "hang" is just waiting
for other transactions to release their table locks. Looking into
the pg_locks view could be informative.
regards, tom lane
On Fri, 30 Nov 2018 at 15:53, John Smith <localdevjs@gmail.com> wrote:
We have a long script of sql that we run, several thousand lines of sql. If I execute the script
from start to finish, somewhere in the middle of it, one sql command will hang and take 2 to
3 hours. During this time, "htop" shows 100% cpu usage with a postgresql process, so it
appears to be doing something.If I stop that command, then immediately rerun the rest of the script starting from the command that "hung",
the "hung" command then completes in 5 seconds and the rest of the script just continues on perfectly.
I assume that JDBC doesn't put a transaction around your whole script
if you send it in one hit?
Geoff
On 11/30/2018 10:38 AM, Tom Lane wrote:
Ron <ronljohnsonjr@gmail.com> writes:
We have a seemingly identical problem in v9.6 on our big prod databases. It
worked perfectly in 8.4 *and works perfectly in 9.6 our small staging servers*.
A giant sql script that drops the oldest child tables from a large set of
partitioned tables (by dropping all relevant triggers, indexes, functions
etc, dropping the oldest children, adding the new children then recreating
all the relevant triggers, indexes, functions etc).A script like that is going to be taking exclusive locks on a whole lot
of tables. My guess is that you are running up against lock conflicts
against other activity in the DB, and that the "hang" is just waiting
for other transactions to release their table locks. Looking into
the pg_locks view could be informative.
We thought of that, but...
1. Even when that process is the only one touching the database, it hangs.
2. It worked perfectly on 8.4.
3. It works perfectly on 9.6 using the small staging database.
4. It works perfectly when running the script from psql.
Before that weekly job runs again, the application team is going to upgrade
to JDBC driver v42.2.5 (don't know what they're using now), and add
protocolVersion=2 to the connection string. I'll post an update next Wednesday.
--
Angular momentum makes the world go 'round.
Thanks Chris and Steve for the analyze suggestion; That will be my next
test!
On Fri, Nov 30, 2018 at 11:27 AM Steve Crawford <
scrawford@pinpointresearch.com> wrote:
Show quoted text
On Fri, Nov 30, 2018 at 8:05 AM Chris Mair <chris@1006.org> wrote:
We're kind of pulling out our hair here, any ideas?
You might try issuing the command
analyze;
right *before* the command that hangs.
You might consider trying the "auto_explain" module (
https://www.postgresql.org/docs/current/auto-explain.html ). This will
let you "trap" the query plan used for the long-running query. Then compare
that query plan with a manually run explain when it runs quickly to see if
they differ. If they do, it suggests that bad statistics are a likely
culprit and Chris' suggestion of running analyze will help.You only need to analyze those tables used in the query and, most likely,
only tables that were substantially changed within a moderately short
period prior to the start of the query.Autovacuum, which will handles analyze as well, typically defaults to
checking for tables that need attention every minute so for processes that
have a lot of steps it becomes "luck of the draw" whether or not a needed
analyze is run after a substantial table change and before that table is
used.We frequently put specific "analyze" statements in such scripts
immediately following bulk-update statements.Cheers,
Steve
I like the "add an analyze" idea; the two most likely causes of the
phenomenon (to my mind) are either:
a) Something's getting locked and Tom Lane's idea of checking pg_locks
when you notice it's stopped can help track down the problem.
Further to that, the thing I'd be expecting to see if the problem is
locking is that the connection that's blocked up will be waiting on a
lock held by another connection.
It's near certain that you'll find that the connection doing the work
will have LOTS of locks outstanding; that's not a problem at all;
that's perfectly normal. You need to look keenly for locks that have
not yet been granted.
b) I find it common in my environments to need to do manual ANALYZE
requests all the time because I'll set up temporary tables (that the
autovacuum daemon can't do anything about) which, as the stats are
lacking, will lead to awful query plans that make queries run badly.
If you use temporary tables, that's like a heavy "thumb on the scale"
that can lead to awful performance, unless those tables get an ANALYZE
after getting populated.
Unpredictable slowness can certainly result from tables having changed
size leading to pathological query plans. ANALYZE will help.
On 11/30/2018 09:53 AM, John Smith wrote:
We have a long script of sql that we run, several thousand lines of sql.
If I execute the script from start to finish, somewhere in the middle of
it, one sql command will hang and take 2 to 3 hours. During this time,
"htop" shows 100% cpu usage with a postgresql process, so it appears to be
doing something.If I stop that command, then immediately rerun the rest of the script
starting from the command that "hung", the "hung" command then completes
in 5 seconds and the rest of the script just continues on perfectly. I can
interrupt the command within 30 mins of hanging, or 2 hours of hanging;
regardless of when I interrrupt the hung command, it always immediately
completes successully in < 5 seconds when I rerun the script immediately
after that.I have noticed this in multiple scripts, multiple commands; one was an
"insert into", a different script hung on a "update" and a third one would
hang on a "create table x as (select ... )". These are all short running
commands normally, and we have indexes on all the columns for the
appropriate "where" clauses.I can't send the sql (forbidden), but the actual command does not appear
to be the issue. Rerunning the same little section of sql manually which
includes the hung command, always takes less than a minute, but the entire
sql which should only take 30 minutes in total will always hang at the
same spot; and then interrupting it (regardless of how long it was
running) and rerunning immediately completes.The "hung" command does eventually finish at times after 2 to 3 hours,
most often we can't afford to wait for it to run since this batch of sql
has to run daily.Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but
we have upgraded to 11.1 using postgresql packages for ubuntu and the
problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during this
time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk
free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere
as well, no difference.
- We have made tunings to the postgresql configuration to make it use more
RAM.
- All sql commands are sent through a custom Java application; so we use
java + JDBC to connect to postgresql for this. The script has custom
commands which are interpreted by the Java app, so we can't test running
the exact script start to finish in psql native tools since the native
tools won't understand our custom sql. However the java app is very
simple. All individual sql commands are committed immediately after they
run so there is no long transaction in play here. The psql jdbc driver is
"postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql
using 100% CPU usage on one core, so postgresql is chewing on the command,
doing something.I have attached our current postgresql.conf file.
We're kind of pulling out our hair here, any ideas?
The (apparent) solution for us was to upgrade the JDBC driver from 42.2.1 to
42.2.5. Note, though, that our giant script only does DDL statements.
--
Angular momentum makes the world go 'round.