pgsql: Support partition pruning at execution time
Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.
This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:
1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.
2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.
Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.
For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.
This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.
Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: /messages/by-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/499be013de65242235ebdde06adb08db887f0ea5
Modified Files
--------------
doc/src/sgml/perform.sgml | 12 +
src/backend/commands/explain.c | 51 +-
src/backend/executor/execPartition.c | 419 +++++++++
src/backend/executor/nodeAppend.c | 268 ++++--
src/backend/nodes/copyfuncs.c | 21 +
src/backend/nodes/nodeFuncs.c | 28 +-
src/backend/nodes/outfuncs.c | 28 +
src/backend/nodes/readfuncs.c | 20 +
src/backend/optimizer/path/allpaths.c | 12 +-
src/backend/optimizer/path/joinrels.c | 2 +-
src/backend/optimizer/plan/createplan.c | 45 +-
src/backend/optimizer/plan/planner.c | 8 +-
src/backend/optimizer/prep/prepunion.c | 6 +-
src/backend/optimizer/util/pathnode.c | 26 +-
src/backend/partitioning/partprune.c | 267 +++++-
src/include/executor/execPartition.h | 77 ++
src/include/nodes/execnodes.h | 12 +-
src/include/nodes/nodes.h | 1 +
src/include/nodes/plannodes.h | 5 +
src/include/nodes/primnodes.h | 23 +
src/include/optimizer/pathnode.h | 2 +-
src/include/partitioning/partprune.h | 14 +
src/test/regress/expected/partition_prune.out | 1135 +++++++++++++++++++++++++
src/test/regress/sql/partition_prune.sql | 344 ++++++++
24 files changed, 2714 insertions(+), 112 deletions(-)
On 8 April 2018 at 09:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Support partition pruning at execution time
I'm looking at buildfarm member lapwing's failure [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2018-04-07%2021%3A20%3A01&stg=check now.
Probably it can be fixed by adding a vacuum, but will need a few mins
to test and produce a patch.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Support partition pruning at execution time
Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?
regards, tom lane
On 8 April 2018 at 09:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Support partition pruning at execution time
Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?
I'm looking at this now. I've tried adding vacuum (analyze) to the
tables before the queries in order to have relallvisible set so that
the index only scan's "Heap Fetches" becomes stable, but very weirdly
it still sometimes fetches from the heap after having vacuumed.
To help see what's going on while testing this I added:
select relname,relallvisible from pg_Class where relname like 'tprt%'
and relkind = 'r';
just before the:
explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
Sometimes I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1
Other times I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0
I thought maybe something might be holding a pin on a page somewhere
and vacuum could be skipping it, so I added a VERBOSE to the vacuum
and I see:
Skipped 0 pages due to buffer pins, 0 frozen pages.
I'd considered just doing: set enable_indexonly_scan = off; for all
these tests, but I don't have an explanation for this vacuum behaviour
yet.
I'll need to dig through the vacuum code that sets the visibility bit
and see if there's some good reason for this. I have a local patch
ready to go for the set enable_indexonlyscan = off;
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote:
Sometimes I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1Other times I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0
The minimum set of commands I can find to recreate this are:
drop table if exists tprt;
create table tprt (col1 int) partition by range (col1);
create table tprt_1 partition of tprt for values from (1) to (5001);
create index tprt1_idx on tprt_1 (col1);
insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
vacuum tprt; select relname,relallvisible from pg_Class where relname
like 'tprt%' and relkind = 'r';
I get relallvisible = 0 once in maybe 20 or so attempts.
I didn't manage to get the same without a partitioned table.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 11:26, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote:
Sometimes I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1Other times I see:
relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0The minimum set of commands I can find to recreate this are:
drop table if exists tprt;
create table tprt (col1 int) partition by range (col1);
create table tprt_1 partition of tprt for values from (1) to (5001);
create index tprt1_idx on tprt_1 (col1);
insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
vacuum tprt; select relname,relallvisible from pg_Class where relname
like 'tprt%' and relkind = 'r';I get relallvisible = 0 once in maybe 20 or so attempts.
I didn't manage to get the same without a partitioned table.
Anyway, this does not seem related to this patch. So no point in the
build farm blaming it. There might be some reasonable explanation for
this that I just can't think of now.
I've attached a patch which gets rid of the index only scans in the tests.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
runtime_pruning_make_tests_stable.patchapplication/octet-stream; name=runtime_pruning_make_tests_stable.patchDownload+149-184
Yeah, I don't quite understand this problem, and I tend to agree that
it likely isn't this patch's fault. However, for the moment I'm going
to avoid pushing the patch you propose because maybe there's a bug
elsewhere and it'd be good to understand it. I'm looking at it now.
If others would prefer me to push David's patch (or do so themselves),
I'm not dead set against that.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Support partition pruning at execution time
Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?
This theory seems correct; the counters are getting incremented, yet
explain later prints them as zero. What is the code that is supposed to
propagate the instrumentation counts?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8 April 2018 at 12:15, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Yeah, I don't quite understand this problem, and I tend to agree that
it likely isn't this patch's fault. However, for the moment I'm going
to avoid pushing the patch you propose because maybe there's a bug
elsewhere and it'd be good to understand it. I'm looking at it now.If others would prefer me to push David's patch (or do so themselves),
I'm not dead set against that.
I just wanted to share this:
#!/bin/bash
for i in {1..1000000}
do
if [ $(psql --no-psqlrc -w -v ON_ERROR_STOP=0 -d postgres -q
-A -F " " -t <<EOF
drop table if exists tprt;
create table tprt (col1 int);
create index tprt_idx on tprt (col1);
insert into tprt values (10), (20), (501),
(502), (505), (1001), (4500);
vacuum tprt;
select relallvisible from pg_Class where
relname like 'tprt%' and relkind = 'r';
EOF
) = "0" ];
then
echo "[$(date --iso-8601=seconds)]: 0"
fi
done
If I run this I only get the wrong result from the visibility map in
60 second intervals:
Check this output:
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote:
It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.
Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
"David" == David Rowley <david.rowley@2ndquadrant.com> writes:
It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.
David> Setting autovacuum_naptime to 10 seconds makes it occur in 10
David> second intervals...
Analyze (including auto-analyze on a different table entirely) has a
snapshot, which can hold back OldestXmin, hence preventing the
all-visible flag from being set.
--
Andrew (irc:RhodiumToad)
On 8 April 2018 at 15:02, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote:
It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...
Ok, I thought it might have been some concurrent vacuum on the table
but the only tables I see being vacuumed are system tables.
I tried performing a manual vacuum of each of these and could not get
it to trigger, but then I did:
select * from pg_class;
from another session and then the script starts spitting out some errors.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 8 April 2018 at 15:21, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
David> Setting autovacuum_naptime to 10 seconds makes it occur in 10
David> second intervals...Analyze (including auto-analyze on a different table entirely) has a
snapshot, which can hold back OldestXmin, hence preventing the
all-visible flag from being set.
urg, that's true.
Seems like there's no bugs here then;
begin work;
set transaction isolation level repeatable read;
select * from pg_class;
-- do nothing
makes the script go crazy.
You're right, thanks.
I guess the patch I sent is the way forward with this.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
"David" == David Rowley <david.rowley@2ndquadrant.com> writes:
Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...
David> Ok, I thought it might have been some concurrent vacuum on the
David> table but the only tables I see being vacuumed are system
David> tables.
It's not vacuum that tends to be the problem, but analyze (on any
table). Lazy-vacuum's snapshots are mostly ignored for computing global
xmin horizons by other vacuums, but analyze's snapshots are not.
David> I tried performing a manual vacuum of each of these and could
David> not get it to trigger, but then I did:
David> select * from pg_class;
David> from another session and then the script starts spitting out
David> some errors.
Obviously, because the select holds a snapshot and therefore also holds
back OldestXmin.
You can't ever assume that data you just inserted will become
all-visible just because you just vacuumed the table, unless you know
that there is NO concurrent activity that might have a snapshot (and no
other possible reason why OldestXmin might be older than your insert).
--
Andrew (irc:RhodiumToad)
On 8 April 2018 at 15:34, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
You can't ever assume that data you just inserted will become
all-visible just because you just vacuumed the table, unless you know
that there is NO concurrent activity that might have a snapshot (and no
other possible reason why OldestXmin might be older than your insert).
Thanks. I got it. It just slipped my slightly paranoid and sleep deprived mind.
I've attached my proposed fix for the unstable regression tests. I
removed the vacuums I'd added in the last version and commented why
we're doing set enable_indesonlyscan = off;
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
runtime_pruning_make_tests_stable_v2.patchapplication/octet-stream; name=runtime_pruning_make_tests_stable_v2.patchDownload+132-182
"David" == David Rowley <david.rowley@2ndquadrant.com> writes:
David> I've attached my proposed fix for the unstable regression tests.
David> I removed the vacuums I'd added in the last version and
David> commented why we're doing set enable_indesonlyscan = off;
Looks basically sane - I'll try it out and commit it shortly.
--
Andrew (irc:RhodiumToad)
Andrew Gierth wrote:
"David" == David Rowley <david.rowley@2ndquadrant.com> writes:
David> I've attached my proposed fix for the unstable regression tests.
David> I removed the vacuums I'd added in the last version and
David> commented why we're doing set enable_indesonlyscan = off;Looks basically sane - I'll try it out and commit it shortly.
Thanks for cleaning that up. I'll look into why the test (without this
commit) fails with force_parallel_mode=regress next week.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Alvaro> Thanks for cleaning that up. I'll look into why the test
Alvaro> (without this commit) fails with force_parallel_mode=regress
Alvaro> next week.
Seems clear enough to me - the "Heap Fetches" statistic is kept in the
IndexOnlyScanState node in its own field, not part of ss.ps.instrument,
and is therefore not reported from workers to leader.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Alvaro> Thanks for cleaning that up. I'll look into why the test
Alvaro> (without this commit) fails with force_parallel_mode=regress
Alvaro> next week.
Seems clear enough to me - the "Heap Fetches" statistic is kept in the
IndexOnlyScanState node in its own field, not part of ss.ps.instrument,
and is therefore not reported from workers to leader.
BTW, pademelon just exhibited a different instability in this test:
*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out Sun Apr 8 01:56:04 2018
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Sun Apr 8 17:48:14 2018
***************
*** 1606,1612 ****
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
-> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
--- 1606,1612 ----
-> Partial Aggregate (actual rows=1 loops=3)
-> Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
-> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
======================================================================
Dunno quite what to make of that, but this animal previously passed
at commit
b47a86f Sun Apr 8 05:35:42 2018 UTC Attempt to stabilize partition_prune test output.
so it's not a consistent failure.
regards, tom lane
On 9 April 2018 at 09:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, pademelon just exhibited a different instability in this test:
*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out Sun Apr 8 01:56:04 2018 --- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Sun Apr 8 17:48:14 2018 *************** *** 1606,1612 **** -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=0 loops=3) Subplans Removed: 6 ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) --- 1606,1612 ---- -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=0 loops=3) Subplans Removed: 6 ! -> Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2) Filter: ((a >= $1) AND (a <= $2) AND (b < 4)) -> Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
Reading code it looks like a bug in choose_next_subplan_for_worker():
The following needs to be changed for this patch:
/* Advance to next plan. */
pstate->pa_next_plan++;
I'll think a bit harder about the best way to fix and submit a patch
for it later.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services