Delay locking partitions during query execution

Started by David Rowleyover 7 years ago30 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

Over on [1]/messages/by-id/CAKJS1f-=FnMqmQP6qitkD+xEddxw22ySLP-0xFk3JAqUX2yfMw@mail.gmail.com I'm proposing to delay locking partitions of a partitioned
table that's the target of an INSERT or UPDATE command until we first
route a tuple to the partition. Currently, we go and lock all
partitions, even if we just insert a single tuple to a single
partition. The patch in [1]/messages/by-id/CAKJS1f-=FnMqmQP6qitkD+xEddxw22ySLP-0xFk3JAqUX2yfMw@mail.gmail.com improves the situation when there are many
partitions and only a few tuples to route to just to a few partitions.

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation(). This
improves the situation when many partitions get run-time pruned, as
we'll never bother locking those at all since we'll never call
ExecGetRangeTableRelation() on them. We'll of course still lock the
partitioned table so that plan invalidation works correctly.

This does make the locking order less well defined, but I'm already
proposing similar in [1]/messages/by-id/CAKJS1f-=FnMqmQP6qitkD+xEddxw22ySLP-0xFk3JAqUX2yfMw@mail.gmail.com and over there I've mentioned that I can't
quite see any huge issues with doing that. We already don't lock all
partitions inside AcquireExecutorLocks() during INSERT with VALUES
anyway.

The attached patch implements this delayed locking. A quick benchmark
shows a pretty good performance improvement when there are a large
number of partitions and run-time pruning prunes almost all of them.

Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'|| x::text || ' partition of hashp for
values with (modulus 10000, remainder ' || x ::text || ');' from
generate_Series(0,9999) x;
\gexec
insert into hashp select generate_Series(1,1000000)

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

Master: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 108.882749 (excluding connections establishing)
tps = 108.245437 (excluding connections establishing)

delaylock: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 1068.289505 (excluding connections establishing)
tps = 1092.797015 (excluding connections establishing)

More could be done to make this quite a bit faster again, but that
mostly requires the range table coming directly from the planner as an
array and marking which array elements require locking with a
Bitmapset. This'll save having to loop over the entire large array
that mostly does not need anything locked. Similar can be done for
ExecCheckRTPerms(), but that's also for another day. With those
changes and some further tweaks done on the Append/MergeAppend code,
tps is about 22k on my machine, which is just slightly slower than
with an equivalent non-partitioned table.

I'll add the attached patch to the January commitfest

[1]: /messages/by-id/CAKJS1f-=FnMqmQP6qitkD+xEddxw22ySLP-0xFk3JAqUX2yfMw@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchapplication/octet-stream; name=v1-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchDownload+51-14
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#1)
Re: Delay locking partitions during query execution

On 12/3/18 12:42 PM, David Rowley wrote:

...

Master: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 108.882749 (excluding connections establishing)
tps = 108.245437 (excluding connections establishing)

delaylock: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 1068.289505 (excluding connections establishing)
tps = 1092.797015 (excluding connections establishing)

I'm a bit confused, because I can't reproduce any such speedup. I've
used the attached script that varies the number of partitions (which
worked quite nicely in the INSERT thread), but I'm getting results like
this:

partitions 0 100 1000 10000
--------------------------------------------
master 49 1214 186 11
patched 53 1225 187 11

So I don't see any significant speedup, for some reason :-(

Before I start digging into this, is there something that needs to be
done to enable it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

select.sqlapplication/sql; name=select.sqlDownload
run-select.shapplication/x-shellscript; name=run-select.shDownload
#3David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#2)
Re: Delay locking partitions during query execution

On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused, because I can't reproduce any such speedup. I've
used the attached script that varies the number of partitions (which
worked quite nicely in the INSERT thread), but I'm getting results like
this:

partitions 0 100 1000 10000
--------------------------------------------
master 49 1214 186 11
patched 53 1225 187 11

So I don't see any significant speedup, for some reason :-(

Before I start digging into this, is there something that needs to be
done to enable it?

Thanks for looking at this.

One thing I seem to quite often forget to mention is that I was running with:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

Without changing plan_cache_mode then the planner would likely never
favour a generic plan since it will not appear to be very efficient
due to the lack of consideration to the costing of run-time partition
pruning.

Also, then with a generic plan, the planner will likely want to build
a parallel plan since it sees up to 10k partitions that need to be
scanned. max_parallel_workers_per_gather = 0 puts it right.

(Ideally, the planner would cost run-time pruning, but it's not quite
so simple for RANGE partitions with non-equality operators. Likely
we'll want to fix that one day, but that's not for here)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#3)
Re: Delay locking partitions during query execution

On 1/3/19 10:50 PM, David Rowley wrote:

On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I'm a bit confused, because I can't reproduce any such speedup. I've
used the attached script that varies the number of partitions (which
worked quite nicely in the INSERT thread), but I'm getting results like
this:

partitions 0 100 1000 10000
--------------------------------------------
master 49 1214 186 11
patched 53 1225 187 11

So I don't see any significant speedup, for some reason :-(

Before I start digging into this, is there something that needs to be
done to enable it?

Thanks for looking at this.

One thing I seem to quite often forget to mention is that I was running with:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

Without changing plan_cache_mode then the planner would likely never
favour a generic plan since it will not appear to be very efficient
due to the lack of consideration to the costing of run-time partition
pruning.

Also, then with a generic plan, the planner will likely want to build
a parallel plan since it sees up to 10k partitions that need to be
scanned. max_parallel_workers_per_gather = 0 puts it right.

(Ideally, the planner would cost run-time pruning, but it's not quite
so simple for RANGE partitions with non-equality operators. Likely
we'll want to fix that one day, but that's not for here)

Nope, that doesn't seem to make any difference :-( In all cases the
resulting plan (with 10k partitions) looks like this:

test=# explain analyze select * from hashp where a = 13442;

QUERY PLAN
-----------------------------------------------------------------------
Append (cost=0.00..41.94 rows=13 width=4)
(actual time=0.018..0.018 rows=0 loops=1)
-> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
(actual time=0.017..0.018 rows=0 loops=1)
Filter: (a = 13442)
Planning Time: 75.870 ms
Execution Time: 0.471 ms
(5 rows)

and it doesn't change (the timings on shape) no matter how I set any of
the GUCs.

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR: unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#4)
Re: Delay locking partitions during query execution

On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Nope, that doesn't seem to make any difference :-( In all cases the
resulting plan (with 10k partitions) looks like this:

test=# explain analyze select * from hashp where a = 13442;

QUERY PLAN
-----------------------------------------------------------------------
Append (cost=0.00..41.94 rows=13 width=4)
(actual time=0.018..0.018 rows=0 loops=1)
-> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
(actual time=0.017..0.018 rows=0 loops=1)
Filter: (a = 13442)
Planning Time: 75.870 ms
Execution Time: 0.471 ms
(5 rows)

and it doesn't change (the timings on shape) no matter how I set any of
the GUCs.

For this to work, run-time pruning needs to take place, so it must be
a PREPAREd statement.

With my test I used:

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

$ pgbench -n -f bench.sql -M prepared -T 60 postgres

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR: unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.

You'll probably need to initdb with the patch applied as there's a new
field in RangeTblEntry. If there's a serialised one of these stored in
the in the catalogue somewhere then the new read function will have
issues reading the old serialised format.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#5)
Re: Delay locking partitions during query execution

On 1/3/19 11:57 PM, David Rowley wrote:

On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Nope, that doesn't seem to make any difference :-( In all cases the
resulting plan (with 10k partitions) looks like this:

test=# explain analyze select * from hashp where a = 13442;

QUERY PLAN
-----------------------------------------------------------------------
Append (cost=0.00..41.94 rows=13 width=4)
(actual time=0.018..0.018 rows=0 loops=1)
-> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
(actual time=0.017..0.018 rows=0 loops=1)
Filter: (a = 13442)
Planning Time: 75.870 ms
Execution Time: 0.471 ms
(5 rows)

and it doesn't change (the timings on shape) no matter how I set any of
the GUCs.

For this to work, run-time pruning needs to take place, so it must be
a PREPAREd statement.

With my test I used:

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

$ pgbench -n -f bench.sql -M prepared -T 60 postgres

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

Indeed, with prepared statements I now see some improvements:

partitions 0 100 1000 10000
--------------------------------------------
master 19 1590 2090 128
patched 18 1780 6820 1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Does this mean this optimization can only ever work with prepared
statements, or can it be made to work with regular plans too?

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR: unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.

You'll probably need to initdb with the patch applied as there's a new
field in RangeTblEntry. If there's a serialised one of these stored in
the in the catalogue somewhere then the new read function will have
issues reading the old serialised format.

D'oh! That explains it, because switching from/to patched binaries might
have easily been triggering the error. I've checked that there are no
changes to catalogs, but it did not occur to me adding a new RTE field
could have such consequences ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#6)
Re: Delay locking partitions during query execution

On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 1/3/19 11:57 PM, David Rowley wrote:

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

Indeed, with prepared statements I now see some improvements:

partitions 0 100 1000 10000
--------------------------------------------
master 19 1590 2090 128
patched 18 1780 6820 1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Those look strange. Why is it so slow with the non-partitioned case?
I'd have expected that to be the fastest result.

Does this mean this optimization can only ever work with prepared
statements, or can it be made to work with regular plans too?

That's a good question. I confirm it's only any use of run-time
pruning occurs during init plan. For this patch to be any use, you'll
need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
output. If you don't see this then all Append/MergeAppend subplans
were initialised and the relation lock would have been obtained
regardless of if delaylock is set for the relation. The effect of the
patch here would just have been to obtain the lock during the first
call to ExecGetRangeTableRelation() for that relation instead of
during AcquireExecutorLocks(). There may actually be a tiny overhead
in this case since AcquireExecutorLocks() must skip the delaylock
relations, but they'll get locked later anyway. I doubt you could
measure that though.

When run-time pruning is able to prune partitions before execution
starts then the optimisation is useful since AcquireExecutorLocks()
won't obtain the lock and ExecGetRangeTableRelation() won't be called
for all pruned partition's rels as we don't bother to init the
Append/MergeAppend subplan for those.

I'm a little unsure if there are any cases where this type of run-time
pruning can occur when PREPAREd statements are not in use. Initplan
parameters can't prune before executor run since we need to run the
executor to obtain the values of those. Likewise for evaluation of
volatile functions. So I think run-time pruning before initplan is
only ever going to happen for PARAM_EXTERN type parameters, i.e. with
PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself. That's not free, but it's significantly faster than obtaining
a lock.

Or in short... it only good for prepared statements where the
statement's parameters allow for run-time pruning. However, that's a
pretty large case since the planner is still very slow at planning for
large numbers of partitions, meaning it's common (or at least it will
be) for people to use PREPAREd statement and plan_cache_mode =
force_generic_plan;

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR: unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.

You'll probably need to initdb with the patch applied as there's a new
field in RangeTblEntry. If there's a serialised one of these stored in
the in the catalogue somewhere then the new read function will have
issues reading the old serialised format.

D'oh! That explains it, because switching from/to patched binaries might
have easily been triggering the error. I've checked that there are no
changes to catalogs, but it did not occur to me adding a new RTE field
could have such consequences ...

schema-wise, no changes, but data-wise, there are changes.

$ pg_dump --schema=pg_catalog --data-only postgres | grep ":rellockmode" | wc -l
121

All of which are inside the pg_rewrite table:

$ pg_dump --schema=pg_catalog --data-only --table=pg_rewrite postgres
| grep ":rellockmode" | wc -l
121

I just used ":rellockmode" here as it's a field that exists in RangeTblEntry.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#7)
Re: Delay locking partitions during query execution

On 1/4/19 1:53 AM, David Rowley wrote:

On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 1/3/19 11:57 PM, David Rowley wrote:

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

Indeed, with prepared statements I now see some improvements:

partitions 0 100 1000 10000
--------------------------------------------
master 19 1590 2090 128
patched 18 1780 6820 1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Those look strange. Why is it so slow with the non-partitioned case?
I'd have expected that to be the fastest result.

Because there are 1M rows in the table, and it's doing a seqscan.

That also makes the other cases difficult to compare, because with few
partitions there will be multiple pages per partition, scanned
sequentially. And with many partitions it's likely only a single page
with a couple of rows on it. I'll think about constructing a better
benchmark, to make it easier to compare - perhaps by using a single row
per table and/or adding indexes. Or something ...

Does this mean this optimization can only ever work with prepared
statements, or can it be made to work with regular plans too?

That's a good question. I confirm it's only any use of run-time
pruning occurs during init plan. For this patch to be any use, you'll
need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
output. If you don't see this then all Append/MergeAppend subplans
were initialised and the relation lock would have been obtained
regardless of if delaylock is set for the relation. The effect of the
patch here would just have been to obtain the lock during the first
call to ExecGetRangeTableRelation() for that relation instead of
during AcquireExecutorLocks(). There may actually be a tiny overhead
in this case since AcquireExecutorLocks() must skip the delaylock
relations, but they'll get locked later anyway. I doubt you could
measure that though.

When run-time pruning is able to prune partitions before execution
starts then the optimisation is useful since AcquireExecutorLocks()
won't obtain the lock and ExecGetRangeTableRelation() won't be called
for all pruned partition's rels as we don't bother to init the
Append/MergeAppend subplan for those.

I'm a little unsure if there are any cases where this type of run-time
pruning can occur when PREPAREd statements are not in use. Initplan
parameters can't prune before executor run since we need to run the
executor to obtain the values of those. Likewise for evaluation of
volatile functions. So I think run-time pruning before initplan is
only ever going to happen for PARAM_EXTERN type parameters, i.e. with
PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself. That's not free, but it's significantly faster than obtaining
a lock.

Or in short... it only good for prepared statements where the
statement's parameters allow for run-time pruning. However, that's a
pretty large case since the planner is still very slow at planning for
large numbers of partitions, meaning it's common (or at least it will
be) for people to use PREPAREd statement and plan_cache_mode =
force_generic_plan;

OK, thanks for the explanation. One more reason to use prepared
statements in such cases ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#8)
Re: Delay locking partitions during query execution

On Sat, 5 Jan 2019 at 03:12, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

partitions 0 100 1000 10000
--------------------------------------------
master 19 1590 2090 128
patched 18 1780 6820 1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Those look strange. Why is it so slow with the non-partitioned case?
I'd have expected that to be the fastest result.

Because there are 1M rows in the table, and it's doing a seqscan.

Of course. My test did the same, but I didn't consider that because I
had so few rows per partition. Likely just adding an index would have
it make more sense.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#1)
Re: Delay locking partitions during query execution

On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation().

I've attached a rebase version of this. The previous version
conflicted with some changes made in b60c397599.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchapplication/octet-stream; name=v2-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchDownload+52-14
#11Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#7)
Re: Delay locking partitions during query execution

On 2019/01/04 9:53, David Rowley wrote:

Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself. That's not free, but it's significantly faster than obtaining
a lock.

Hmm, AcquireExecutorLocks is only called if prepared statements are used
and that too if a generic cached plan is reused.

GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

In GetCachedPlan:

if (!customplan)
{
if (CheckCachedPlan(plansource))

Thanks,
Amit

#12David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#11)
Re: Delay locking partitions during query execution

On Thu, 17 Jan 2019 at 17:18, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2019/01/04 9:53, David Rowley wrote:

Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself. That's not free, but it's significantly faster than obtaining
a lock.

Hmm, AcquireExecutorLocks is only called if prepared statements are used
and that too if a generic cached plan is reused.

GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

ah okay. Thanks for the correction, but either way, it's really only
useful when run-time pruning prunes partitions before initialising the
Append/MergeAppend node.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#10)
Re: Delay locking partitions during query execution

On Sat, 12 Jan 2019 at 23:42, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached a rebase version of this. The previous version
conflicted with some changes made in b60c397599.

I've attached another rebased version. This one fixes up the conflict
with e0c4ec07284.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v3-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchapplication/octet-stream; name=v3-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patchDownload+52-14
#14David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#1)
Re: Delay locking partitions during query execution

On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation(). This
improves the situation when many partitions get run-time pruned, as
we'll never bother locking those at all since we'll never call
ExecGetRangeTableRelation() on them. We'll of course still lock the
partitioned table so that plan invalidation works correctly.

I started looking at this patch again and I do see a problem with it.
Let me explain:

Currently during LockRelationOid() when we obtain a lock on a relation
we'll check for rel cache invalidation messages. This means that
during AcquireExecutorLocks(), as called from CheckCachedPlan(), we
could miss invalidation messages since we're no longer necessarily
locking the partitions.

I think this only creates problems for partition's reloptions being
changed and cached plans possibly being not properly invalidated when
that happens, but I might be wrong about that, but if I'm not, there
still also might be more important reasons to ensure we invalidate the
plan properly in the future.

The following shows the problem:

create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
insert into listp select x from generate_series(1,2)x,
generate_series(1,100000);
analyze listp;

-- session 1;
begin;
prepare q1 as select count(*) from listp;
explain (costs off,analyze, timing off, summary off) execute q1;
QUERY PLAN
----------------------------------------------------------------------------------
Finalize Aggregate (actual rows=1 loops=1)
-> Gather (actual rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Append (actual rows=66667 loops=3)
-> Parallel Seq Scan on listp2 (actual rows=33333 loops=3)
-> Parallel Seq Scan on listp1 (actual
rows=100000 loops=1)
(8 rows)
-- session 2
alter table listp1 set (parallel_workers=0);

-- session 1

explain (costs off, analyze, timing off, summary off) execute q1; --
same as previous plan, i.e. still does a parallel scan on listp1.

One way around this would be to always perform an invalidation on the
partition's parent when performing a relcache invalidation on the
partition. We could perhaps invalidate all the way up to the top
level partitioned table, that way we could just obtain a lock on the
target partitioned table during AcquireExecutorLocks(). I'm currently
only setting the delaylock flag to false for leaf partitions only.

It's not particularly great to think of invalidating the partitioned
table's relcache entry for this, but it's also not so great that we
lock all partitions when runtime pruning might prune away the
partition anyway. I don't see a way that we can have both, but I'm
happy to hear people's thoughts about this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#14)
Re: Delay locking partitions during query execution

On 2019/01/28 10:26, David Rowley wrote:

On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation(). This
improves the situation when many partitions get run-time pruned, as
we'll never bother locking those at all since we'll never call
ExecGetRangeTableRelation() on them. We'll of course still lock the
partitioned table so that plan invalidation works correctly.

I started looking at this patch again and I do see a problem with it.
Let me explain:

Currently during LockRelationOid() when we obtain a lock on a relation
we'll check for rel cache invalidation messages. This means that
during AcquireExecutorLocks(), as called from CheckCachedPlan(), we
could miss invalidation messages since we're no longer necessarily
locking the partitions.

I think this only creates problems for partition's reloptions being
changed and cached plans possibly being not properly invalidated when
that happens, but I might be wrong about that, but if I'm not, there
still also might be more important reasons to ensure we invalidate the
plan properly in the future.

It seems to me that plancache.c doesn't really need to perform
AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
reloptions property has changed to discard a generic plan and build a new
one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
by observing that an invalidation message that it received either from
the same session or from another session belongs to one of the relations
in PlannedStmt.relationOids. That list must already contain all
partitions' OIDs.

Session 1:
prepare q as select count(*) from p;
explain (costs off) execute q(1);
QUERY PLAN
────────────────────────────────────────────────────
Finalize Aggregate
-> Gather
Workers Planned: 2
-> Partial Aggregate
-> Parallel Append
-> Parallel Seq Scan on p4
-> Parallel Seq Scan on p1
-> Parallel Seq Scan on p2
-> Parallel Seq Scan on p3
-> Parallel Seq Scan on p_def
(10 rows)

-- same session (p1 can no longer use parallel scan)
alter table p1 set (parallel_workers=0);
explain (costs off) execute q(1);
QUERY PLAN
────────────────────────────────────────────────────
Finalize Aggregate
-> Gather
Workers Planned: 2
-> Partial Aggregate
-> Parallel Append
-> Seq Scan on p1
-> Parallel Seq Scan on p4
-> Parallel Seq Scan on p2
-> Parallel Seq Scan on p3
-> Parallel Seq Scan on p_def
(10 rows)

-- another session
alter table p1 reset (parallel_workers);

-- back to session 1 (p1 can now use parallel scan)
explain (costs off) execute q(1);
QUERY PLAN
────────────────────────────────────────────────────
Finalize Aggregate
-> Gather
Workers Planned: 2
-> Partial Aggregate
-> Parallel Append
-> Parallel Seq Scan on p4
-> Parallel Seq Scan on p1
-> Parallel Seq Scan on p2
-> Parallel Seq Scan on p3
-> Parallel Seq Scan on p_def
(10 rows)

Thanks,
Amit

#16David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#15)
Re: Delay locking partitions during query execution

On Mon, 28 Jan 2019 at 20:45, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

It seems to me that plancache.c doesn't really need to perform
AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
reloptions property has changed to discard a generic plan and build a new
one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
by observing that an invalidation message that it received either from
the same session or from another session belongs to one of the relations
in PlannedStmt.relationOids. That list must already contain all
partitions' OIDs.

Really? So when you tried my case you properly got a plan with a
non-parallel Seq Scan on listp1?

I imagine you didn't with yours since we check for relcache
invalidations at the start of a transaction. I performed both my
EXECUTEs in the same transaction.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#16)
Re: Delay locking partitions during query execution

On 2019/01/28 20:27, David Rowley wrote:

On Mon, 28 Jan 2019 at 20:45, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

It seems to me that plancache.c doesn't really need to perform
AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
reloptions property has changed to discard a generic plan and build a new
one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
by observing that an invalidation message that it received either from
the same session or from another session belongs to one of the relations
in PlannedStmt.relationOids. That list must already contain all
partitions' OIDs.

Really? So when you tried my case you properly got a plan with a
non-parallel Seq Scan on listp1?

I imagine you didn't with yours since we check for relcache
invalidations at the start of a transaction. I performed both my
EXECUTEs in the same transaction.

Yeah, I performed each EXECUTE in a new transaction, so not the same case
as yours. Sorry about the noise.

However, I tried the example as you described and the plan *doesn't*
change due to concurrent update of reloptions with master (without the
patch) either.

session 1:
begin;
prepare q1 as select count(*) from listp;
explain (costs off, analyze, timing off, summary off) execute q1;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────
Finalize Aggregate (actual rows=1 loops=1)
-> Gather (actual rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Append (actual rows=66667 loops=3)
-> Parallel Seq Scan on listp1 (actual rows=33333
loops=3)
-> Parallel Seq Scan on listp2 (actual rows=50000
loops=2)
(8 rows)

session 2:
alter table listp1 set (parallel_workers=0);

session 1:
explain (costs off, analyze, timing off, summary off) execute q1;
QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────
Finalize Aggregate (actual rows=1 loops=1)
-> Gather (actual rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Append (actual rows=66667 loops=3)
-> Parallel Seq Scan on listp1 (actual rows=33333
loops=3)
-> Parallel Seq Scan on listp2 (actual rows=50000
loops=2)
(8 rows)

I then built master with -DCATCACHE_FORCE_RELEASE and the plan does
change, but because of syscache misses here and there resulting in opening
the erring system catalog which then does AcceptInvalidationMessages().

In the normal build, invalidation messages don't seem to be processed even
by calling AcquireExecutorLocks(), which is perhaps not normal. It seem
that the following condition in LockRelationOid is never true when called
from AcquireExecutorLocks:

if (res != LOCKACQUIRE_ALREADY_CLEAR)
{
AcceptInvalidationMessages();
MarkLockClear(locallock);
}

Bug, maybe?

Thanks,
Amit

#18David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#17)
Re: Delay locking partitions during query execution

On Tue, 29 Jan 2019 at 19:42, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, I tried the example as you described and the plan *doesn't*
change due to concurrent update of reloptions with master (without the
patch) either.

Well, I didn't think to try that. I just assumed had broken it.
Could well be related to lowering the lock levels on setting these
reloptions. Once upon a time, they required an AEL but that was
changed in e5550d5fec6. So previous to that commit you'd have been
blocked from making the concurrent change while the other session was
in a transaction.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#14)
Re: Delay locking partitions during query execution

On Sun, Jan 27, 2019 at 8:26 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

One way around this would be to always perform an invalidation on the
partition's parent when performing a relcache invalidation on the
partition. We could perhaps invalidate all the way up to the top
level partitioned table, that way we could just obtain a lock on the
target partitioned table during AcquireExecutorLocks(). I'm currently
only setting the delaylock flag to false for leaf partitions only.

Would this problem go away if we adopted the proposal discussed in
/messages/by-id/24823.1544220272@sss.pgh.pa.us and, if so, is that
a good fix?

I don't quite understand why this is happening. It seems like as long
as you take at least one new lock, you'll process *every* pending
invalidation message, and that should trigger replanning as long as
the dependencies are correct. But maybe the issue is that you hold
all the other locks you need already, and the lock on the partition at
issue is only acquired during execution, at which point it's too late
to replan. If so, then I think something along the lines of the above
might make a lot of sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#19)
Re: Delay locking partitions during query execution

On 1/31/19 10:31 PM, Robert Haas wrote:

On Sun, Jan 27, 2019 at 8:26 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

One way around this would be to always perform an invalidation on the
partition's parent when performing a relcache invalidation on the
partition. We could perhaps invalidate all the way up to the top
level partitioned table, that way we could just obtain a lock on the
target partitioned table during AcquireExecutorLocks(). I'm currently
only setting the delaylock flag to false for leaf partitions only.

Would this problem go away if we adopted the proposal discussed in
/messages/by-id/24823.1544220272@sss.pgh.pa.us and, if so, is that
a good fix?

I don't quite understand why this is happening. It seems like as long
as you take at least one new lock, you'll process *every* pending
invalidation message, and that should trigger replanning as long as
the dependencies are correct. But maybe the issue is that you hold
all the other locks you need already, and the lock on the partition at
issue is only acquired during execution, at which point it's too late
to replan. If so, then I think something along the lines of the above
might make a lot of sense.

It happens because ConditionalLockRelation does this:

if (res != LOCKACQUIRE_ALREADY_CLEAR)
{
AcceptInvalidationMessages();
MarkLockClear(locallock);
}

and with the prepared statement already planned, we end up skipping that
because res == LOCKACQUIRE_ALREADY_CLEAR. Which happens because
GrantLockLocal (called from LockAcquireExtended) find the relation in as
already locked.

I don't know if this is correct or not, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#22)
#24David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#24)
#26David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#25)
#27David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#22)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#27)
#29David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#29)