Why would log_lock_waits affect a query plan?

Started by Evan Martinover 8 years ago3 messagesgeneral
Jump to latest
#1Evan Martin
postgresql2@realityexists.net

I have an application that imports a lot of data and the does some
queries on it to build some caches in the database, all in one long
transaction. One of those cache updates repeatedly calls a plpgsql
function, which internally does some SQL queries. Sometimes this is
much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
totally reproducible when it happened, though (running on the same
machine, same input data).

It turns out that the problem only happens when the "log_lock_waits"
setting was OFF! Many machines had it ON (to troubleshoot a different
problem), so they never experienced it.

I eventually tracked it down to the query plan chosen for one particular
query in the plpgsql function: using a Nested Loop makes it fast and
using a Hash Join makes it very slow. Running an ANALYZE on one of the
tables involved fixes the problem - the fast query plan is chosen all
the time. This itself is a bit strange, because I was already running
ANALYZE on all tables after the data import - it seems that I needed to
run it a second time? But what I'd really like to understand is: why did
setting log_lock_waits to ON always change the query plan to use a
Nested Loop? It's just not something I'd ever expect to affect a query plan.

By the way, I also found that the problem does not occur if I commit
before the cache updates. This was with PostgreSQL 9.6.3 running on
Windows x64, if that matters.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Evan Martin (#1)
Re: Why would log_lock_waits affect a query plan?

Evan Martin <postgresql2@realityexists.net> writes:

I have an application that imports a lot of data and the does some
queries on it to build some caches in the database, all in one long
transaction. One of those cache updates repeatedly calls a plpgsql
function, which internally does some SQL queries. Sometimes this is
much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
totally reproducible when it happened, though (running on the same
machine, same input data).

It turns out that the problem only happens when the "log_lock_waits"
setting was OFF! Many machines had it ON (to troubleshoot a different
problem), so they never experienced it.

I eventually tracked it down to the query plan chosen for one particular
query in the plpgsql function: using a Nested Loop makes it fast and
using a Hash Join makes it very slow. Running an ANALYZE on one of the
tables involved fixes the problem - the fast query plan is chosen all
the time. This itself is a bit strange, because I was already running
ANALYZE on all tables after the data import - it seems that I needed to
run it a second time?

Are you using the problematic function earlier in the process? I wonder
if it's cached a bad plan that dates from when there was much less data
in the table. I also wonder if maybe the specific table is being updated
after the ANALYZEs. In the situation you've got here, you can't expect
any help from auto-analyze; only your own manual ANALYZEs are going to be
able to see the uncommitted data in the tables.

But what I'd really like to understand is: why did
setting log_lock_waits to ON always change the query plan to use a
Nested Loop? It's just not something I'd ever expect to affect a query plan.

TBH, I don't believe it. There are a lot of moving parts here, but
I don't see how that could be relevant.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Evan Martin
postgresql@realityexists.net
In reply to: Tom Lane (#2)
Re: Why would log_lock_waits affect a query plan?

On 19/07/2017 11:52 PM, Tom Lane wrote:

Evan Martin <postgresql2@realityexists.net> writes:

I have an application that imports a lot of data and the does some
queries on it to build some caches in the database, all in one long
transaction. One of those cache updates repeatedly calls a plpgsql
function, which internally does some SQL queries. Sometimes this is
much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
totally reproducible when it happened, though (running on the same
machine, same input data).
It turns out that the problem only happens when the "log_lock_waits"
setting was OFF! Many machines had it ON (to troubleshoot a different
problem), so they never experienced it.
I eventually tracked it down to the query plan chosen for one particular
query in the plpgsql function: using a Nested Loop makes it fast and
using a Hash Join makes it very slow. Running an ANALYZE on one of the
tables involved fixes the problem - the fast query plan is chosen all
the time. This itself is a bit strange, because I was already running
ANALYZE on all tables after the data import - it seems that I needed to
run it a second time?

Are you using the problematic function earlier in the process? I wonder
if it's cached a bad plan that dates from when there was much less data
in the table. I also wonder if maybe the specific table is being updated
after the ANALYZEs. In the situation you've got here, you can't expect
any help from auto-analyze; only your own manual ANALYZEs are going to be
able to see the uncommitted data in the tables.

No, that function is not used earlier and the table is not updated after
the ANALYZE. I've also tried running DISCARD PLANS right after the
ANALYZE, which made no difference.
All the data import is done first, then ANALYZE, then the cache updates.

But what I'd really like to understand is: why did
setting log_lock_waits to ON always change the query plan to use a
Nested Loop? It's just not something I'd ever expect to affect a query plan.

TBH, I don't believe it. There are a lot of moving parts here, but
I don't see how that could be relevant.

I agree, it's totally bizarre, which is why I'm posting here, but I've
tried it many times now and the results are very clear (at least on my
machine, yet to confirm on others). I change nothing, except
log_lock_waits, re-import the same data and see different query plans
and very different speeds for that function. (For testing purposes I
don't commit the transaction at the end and just start with a clean DB
each time.)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general