pgsql: Support partition pruning at execution time

Started by Alvaro Herreraalmost 8 years ago39 messages

alvherre@alvh.no-ip.org

almost 8 years ago

Support partition pruning at execution time

Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query. This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.

This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:

1. Parameterized Nested Loop Joins: The parameter from the outer side of the
join can be used to determine the minimum set of inner side partitions to
scan.

2. Initplans: Once an initplan has been executed we can then determine which
partitions match the value from the initplan.

Partition pruning is performed in two ways. When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor. This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.

For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait. Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output. In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed. If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)". Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times. This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.

This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable. This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.

Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: /messages/by-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/499be013de65242235ebdde06adb08db887f0ea5

Modified Files
--------------
doc/src/sgml/perform.sgml | 12 +
src/backend/commands/explain.c | 51 +-
src/backend/executor/execPartition.c | 419 +++++++++
src/backend/executor/nodeAppend.c | 268 ++++--
src/backend/nodes/copyfuncs.c | 21 +
src/backend/nodes/nodeFuncs.c | 28 +-
src/backend/nodes/outfuncs.c | 28 +
src/backend/nodes/readfuncs.c | 20 +
src/backend/optimizer/path/allpaths.c | 12 +-
src/backend/optimizer/path/joinrels.c | 2 +-
src/backend/optimizer/plan/createplan.c | 45 +-
src/backend/optimizer/plan/planner.c | 8 +-
src/backend/optimizer/prep/prepunion.c | 6 +-
src/backend/optimizer/util/pathnode.c | 26 +-
src/backend/partitioning/partprune.c | 267 +++++-
src/include/executor/execPartition.h | 77 ++
src/include/nodes/execnodes.h | 12 +-
src/include/nodes/nodes.h | 1 +
src/include/nodes/plannodes.h | 5 +
src/include/nodes/primnodes.h | 23 +
src/include/optimizer/pathnode.h | 2 +-
src/include/partitioning/partprune.h | 14 +
src/test/regress/expected/partition_prune.out | 1135 +++++++++++++++++++++++++
src/test/regress/sql/partition_prune.sql | 344 ++++++++
24 files changed, 2714 insertions(+), 112 deletions(-)

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#1)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 09:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Support partition pruning at execution time

I'm looking at buildfarm member lapwing's failure [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2018-04-07%2021%3A20%3A01&stg=check now.

Probably it can be fixed by adding a vacuum, but will need a few mins
to test and produce a patch.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2018-04-07%2021%3A20%3A01&stg=check

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alvaro Herrera (#1)

Re: pgsql: Support partition pruning at execution time

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Support partition pruning at execution time

Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?

regards, tom lane

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#3)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 09:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Support partition pruning at execution time

Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?

I'm looking at this now. I've tried adding vacuum (analyze) to the
tables before the queries in order to have relallvisible set so that
the index only scan's "Heap Fetches" becomes stable, but very weirdly
it still sometimes fetches from the heap after having vacuumed.

To help see what's going on while testing this I added:

select relname,relallvisible from pg_Class where relname like 'tprt%'
and relkind = 'r';

just before the:

explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 > tprt.col1;

Sometimes I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1

Other times I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0

I thought maybe something might be holding a pin on a page somewhere
and vacuum could be skipping it, so I added a VERBOSE to the vacuum
and I see:

Skipped 0 pages due to buffer pins, 0 frozen pages.

I'd considered just doing: set enable_indexonly_scan = off; for all
these tests, but I don't have an explanation for this vacuum behaviour
yet.

I'll need to dig through the vacuum code that sets the visibility bit
and see if there's some good reason for this. I have a local patch
ready to go for the set enable_indexonlyscan = off;

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#4)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote:

Sometimes I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1

Other times I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0

The minimum set of commands I can find to recreate this are:

drop table if exists tprt;
create table tprt (col1 int) partition by range (col1);
create table tprt_1 partition of tprt for values from (1) to (5001);
create index tprt1_idx on tprt_1 (col1);
insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
vacuum tprt; select relname,relallvisible from pg_Class where relname
like 'tprt%' and relkind = 'r';

I get relallvisible = 0 once in maybe 20 or so attempts.

I didn't manage to get the same without a partitioned table.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#5)

1 attachment(s)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 11:26, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 8 April 2018 at 10:59, David Rowley <david.rowley@2ndquadrant.com> wrote:

Sometimes I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 1

Other times I see:

relname | relallvisible
---------+---------------
tprt_1 | 0
tprt_2 | 0

The minimum set of commands I can find to recreate this are:

drop table if exists tprt;
create table tprt (col1 int) partition by range (col1);
create table tprt_1 partition of tprt for values from (1) to (5001);
create index tprt1_idx on tprt_1 (col1);
insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
vacuum tprt; select relname,relallvisible from pg_Class where relname
like 'tprt%' and relkind = 'r';

I get relallvisible = 0 once in maybe 20 or so attempts.

I didn't manage to get the same without a partitioned table.

Anyway, this does not seem related to this patch. So no point in the
build farm blaming it. There might be some reasonable explanation for
this that I just can't think of now.

I've attached a patch which gets rid of the index only scans in the tests.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

runtime_pruning_make_tests_stable.patchapplication/octet-stream; name=runtime_pruning_make_tests_stable.patchDownload

diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 9762cc9748..de339797f5 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1347,6 +1347,8 @@ create table ab_a3 partition of ab for values in(3) partition by list (b);
 create table ab_a3_b1 partition of ab_a3 for values in (1);
 create table ab_a3_b2 partition of ab_a3 for values in (2);
 create table ab_a3_b3 partition of ab_a3 for values in (3);
+-- disable to prevent variable "Heap Fetches:" output
+set enable_indexonlyscan = off;
 prepare ab_q1 (int, int, int) as
 select * from ab where a between $1 and $2 and b <= $3;
 -- Execute query 5 times to allow choose_custom_plan
@@ -1709,7 +1711,7 @@ create table lprt_a (a int not null);
 insert into lprt_a select 0 from generate_series(1,100);
 -- and insert some values that we should find.
 insert into lprt_a values(1),(1);
-analyze lprt_a;
+vacuum (analyze) lprt_a;
 create index ab_a2_b1_a_idx on ab_a2_b1 (a);
 create index ab_a2_b2_a_idx on ab_a2_b2 (a);
 create index ab_a2_b3_a_idx on ab_a2_b3 (a);
@@ -1754,8 +1756,8 @@ execute ab_q6 (1, 2, 3);
 (1 row)
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (0, 0, 1);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1765,39 +1767,31 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (0, 0, 1);
                      ->  Parallel Seq Scan on lprt_a a (actual rows=51 loops=2)
                            Filter: (a = ANY ('{0,0,1}'::integer[]))
                      ->  Append (actual rows=0 loops=102)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(36 rows)
+(27 rows)
 
 insert into lprt_a values(3),(3);
+vacuum (analyze) lprt_a;
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 3);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1807,38 +1801,29 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 3);
                      ->  Parallel Seq Scan on lprt_a a (actual rows=52 loops=2)
                            Filter: (a = ANY ('{1,0,3}'::integer[]))
                      ->  Append (actual rows=0 loops=104)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(36 rows)
+(27 rows)
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1849,39 +1834,31 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
                            Filter: (a = ANY ('{1,0,0}'::integer[]))
                            Rows Removed by Filter: 1
                      ->  Append (actual rows=0 loops=102)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(37 rows)
+(28 rows)
 
 delete from lprt_a where a = 1;
+vacuum (analyze) lprt_a;
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
-                                           QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1892,34 +1869,25 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
                            Filter: (a = ANY ('{1,0,0}'::integer[]))
                            Rows Removed by Filter: 1
                      ->  Append (actual rows=0 loops=100)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (never executed)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (never executed)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (never executed)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(37 rows)
+(28 rows)
 
 reset enable_hashjoin;
 reset enable_mergejoin;
@@ -2011,61 +1979,50 @@ create index tprt4_idx on tprt_4 (col1);
 create index tprt5_idx on tprt_5 (col1);
 create index tprt6_idx on tprt_6 (col1);
 insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
+vacuum (analyze) tprt;
 set enable_hashjoin = off;
 set enable_mergejoin = off;
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=6 loops=1)
    ->  Seq Scan on tbl1 (actual rows=2 loops=1)
    ->  Append (actual rows=3 loops=2)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 4
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+(15 rows)
 
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=2 loops=1)
    ->  Seq Scan on tbl1 (actual rows=2 loops=1)
    ->  Append (actual rows=1 loops=2)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 > tprt.col1
@@ -2091,59 +2048,50 @@ order by tbl1.col1, tprt.col1;
 
 -- Multiple partitions
 insert into tbl1 values (1001), (1010), (1011);
+vacuum (analyze) tbl1;
+set enable_seqscan = off;
+set enable_bitmapscan = off;
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=23 loops=1)
    ->  Seq Scan on tbl1 (actual rows=5 loops=1)
    ->  Append (actual rows=5 loops=5)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 10
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 11
-         ->  Index Only Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+(15 rows)
 
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=3 loops=1)
    ->  Seq Scan on tbl1 (actual rows=5 loops=1)
    ->  Append (actual rows=1 loops=5)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
+         ->  Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 1
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 > tprt.col1
@@ -2190,30 +2138,24 @@ delete from tbl1;
 insert into tbl1 values (4400);
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 < tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=1 loops=1)
    ->  Seq Scan on tbl1 (actual rows=1 loops=1)
    ->  Append (actual rows=1 loops=1)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 1
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
+               Index Cond: (tbl1.col1 < col1)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 < tprt.col1
@@ -2228,30 +2170,24 @@ delete from tbl1;
 insert into tbl1 values (10000);
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
-                               QUERY PLAN                               
-------------------------------------------------------------------------
+                            QUERY PLAN                             
+-------------------------------------------------------------------
  Nested Loop (actual rows=0 loops=1)
    ->  Seq Scan on tbl1 (actual rows=1 loops=1)
    ->  Append (actual rows=0 loops=1)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (never executed)
+         ->  Index Scan using tprt2_idx on tprt_2 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 = tprt.col1
@@ -2260,6 +2196,8 @@ order by tbl1.col1, tprt.col1;
 ------+------
 (0 rows)
 
+reset enable_seqscan;
+reset enable_bitmapscan;
 drop table tbl1, tprt;
 -- Test with columns defined in varying orders between each level
 create table part_abc (a int not null, b int not null, c int not null) partition by list (a);
@@ -2466,3 +2404,4 @@ select * from boolp where a = (select value from boolvalues where not value);
 (9 rows)
 
 drop table boolp;
+reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 1d712b4ada..7d75c9adf9 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -256,6 +256,9 @@ create table ab_a3_b1 partition of ab_a3 for values in (1);
 create table ab_a3_b2 partition of ab_a3 for values in (2);
 create table ab_a3_b3 partition of ab_a3 for values in (3);
 
+-- disable to prevent variable "Heap Fetches:" output
+set enable_indexonlyscan = off;
+
 prepare ab_q1 (int, int, int) as
 select * from ab where a between $1 and $2 and b <= $3;
 
@@ -359,7 +362,7 @@ insert into lprt_a select 0 from generate_series(1,100);
 -- and insert some values that we should find.
 insert into lprt_a values(1),(1);
 
-analyze lprt_a;
+vacuum (analyze) lprt_a;
 
 create index ab_a2_b1_a_idx on ab_a2_b1 (a);
 create index ab_a2_b2_a_idx on ab_a2_b2 (a);
@@ -385,11 +388,13 @@ execute ab_q6 (1, 2, 3);
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (0, 0, 1);
 
 insert into lprt_a values(3),(3);
+vacuum (analyze) lprt_a;
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 3);
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
 
 delete from lprt_a where a = 1;
+vacuum (analyze) lprt_a;
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
 
@@ -435,6 +440,7 @@ create index tprt6_idx on tprt_6 (col1);
 
 insert into tprt values (10), (20), (501), (502), (505), (1001), (4500);
 
+vacuum (analyze) tprt;
 set enable_hashjoin = off;
 set enable_mergejoin = off;
 
@@ -454,6 +460,11 @@ order by tbl1.col1, tprt.col1;
 
 -- Multiple partitions
 insert into tbl1 values (1001), (1010), (1011);
+vacuum (analyze) tbl1;
+
+set enable_seqscan = off;
+set enable_bitmapscan = off;
+
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
 
@@ -488,6 +499,9 @@ select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 = tprt.col1
 order by tbl1.col1, tprt.col1;
 
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 drop table tbl1, tprt;
 
 -- Test with columns defined in varying orders between each level
@@ -581,3 +595,5 @@ explain (analyze, costs off, summary off, timing off)
 select * from boolp where a = (select value from boolvalues where not value);
 
 drop table boolp;
+
+reset enable_indexonlyscan;
\ No newline at end of file

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: David Rowley (#4)

Re: pgsql: Support partition pruning at execution time

Yeah, I don't quite understand this problem, and I tend to agree that
it likely isn't this patch's fault. However, for the moment I'm going
to avoid pushing the patch you propose because maybe there's a bug
elsewhere and it'd be good to understand it. I'm looking at it now.

If others would prefer me to push David's patch (or do so themselves),
I'm not dead set against that.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Tom Lane (#3)

Re: pgsql: Support partition pruning at execution time

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Support partition pruning at execution time

Buildfarm member lapwing doesn't like this. I can reproduce the
failures here by setting force_parallel_mode = regress. Kind
of looks like instrumentation counts aren't getting propagated
from workers back to the leader?

This theory seems correct; the counters are getting incremented, yet
explain later prints them as zero. What is the code that is supposed to
propagate the instrumentation counts?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#7)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 12:15, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Yeah, I don't quite understand this problem, and I tend to agree that
it likely isn't this patch's fault. However, for the moment I'm going
to avoid pushing the patch you propose because maybe there's a bug
elsewhere and it'd be good to understand it. I'm looking at it now.

If others would prefer me to push David's patch (or do so themselves),
I'm not dead set against that.

I just wanted to share this:

#!/bin/bash
for i in {1..1000000}
do
if [ $(psql --no-psqlrc -w -v ON_ERROR_STOP=0 -d postgres -q
-A -F " " -t <<EOF
drop table if exists tprt;
create table tprt (col1 int);
create index tprt_idx on tprt (col1);
insert into tprt values (10), (20), (501),
(502), (505), (1001), (4500);
vacuum tprt;
select relallvisible from pg_Class where
relname like 'tprt%' and relkind = 'r';
EOF
) = "0" ];
then
echo "[$(date --iso-8601=seconds)]: 0"
fi

done

If I run this I only get the wrong result from the visibility map in
60 second intervals:

Check this output:

[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:34+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:50:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:51:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0
[2018-04-08T02:52:35+0000]: 0

It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#9)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote:

It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.

Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: David Rowley (#10)

Re: pgsql: Support partition pruning at execution time

"David" == David Rowley <david.rowley@2ndquadrant.com> writes:

It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.

David> Setting autovacuum_naptime to 10 seconds makes it occur in 10
David> second intervals...

Analyze (including auto-analyze on a different table entirely) has a
snapshot, which can hold back OldestXmin, hence preventing the
all-visible flag from being set.

--
Andrew (irc:RhodiumToad)

#12

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#10)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 15:02, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 8 April 2018 at 14:56, David Rowley <david.rowley@2ndquadrant.com> wrote:

It happens 12 or 13 times on my machine, then does not happen again
for 60 seconds, then happens again.

Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...

Ok, I thought it might have been some concurrent vacuum on the table
but the only tables I see being vacuumed are system tables.

I tried performing a manual vacuum of each of these and could not get
it to trigger, but then I did:

select * from pg_class;

from another session and then the script starts spitting out some errors.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Andrew Gierth (#11)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 15:21, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:

David> Setting autovacuum_naptime to 10 seconds makes it occur in 10
David> second intervals...

Analyze (including auto-analyze on a different table entirely) has a
snapshot, which can hold back OldestXmin, hence preventing the
all-visible flag from being set.

urg, that's true.

Seems like there's no bugs here then;

begin work;
set transaction isolation level repeatable read;
select * from pg_class;
-- do nothing

makes the script go crazy.

You're right, thanks.

I guess the patch I sent is the way forward with this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: David Rowley (#12)

Re: pgsql: Support partition pruning at execution time

"David" == David Rowley <david.rowley@2ndquadrant.com> writes:

Setting autovacuum_naptime to 10 seconds makes it occur in 10 second
intervals...

David> Ok, I thought it might have been some concurrent vacuum on the
David> table but the only tables I see being vacuumed are system
David> tables.

It's not vacuum that tends to be the problem, but analyze (on any
table). Lazy-vacuum's snapshots are mostly ignored for computing global
xmin horizons by other vacuums, but analyze's snapshots are not.

David> I tried performing a manual vacuum of each of these and could
David> not get it to trigger, but then I did:

David> select * from pg_class;

David> from another session and then the script starts spitting out
David> some errors.

Obviously, because the select holds a snapshot and therefore also holds
back OldestXmin.

You can't ever assume that data you just inserted will become
all-visible just because you just vacuumed the table, unless you know
that there is NO concurrent activity that might have a snapshot (and no
other possible reason why OldestXmin might be older than your insert).

--
Andrew (irc:RhodiumToad)

#15

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Andrew Gierth (#14)

1 attachment(s)

Re: pgsql: Support partition pruning at execution time

On 8 April 2018 at 15:34, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:

You can't ever assume that data you just inserted will become
all-visible just because you just vacuumed the table, unless you know
that there is NO concurrent activity that might have a snapshot (and no
other possible reason why OldestXmin might be older than your insert).

Thanks. I got it. It just slipped my slightly paranoid and sleep deprived mind.

I've attached my proposed fix for the unstable regression tests. I
removed the vacuums I'd added in the last version and commented why
we're doing set enable_indesonlyscan = off;

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

runtime_pruning_make_tests_stable_v2.patchapplication/octet-stream; name=runtime_pruning_make_tests_stable_v2.patchDownload

diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 9762cc9748..df3fca025e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1347,6 +1347,10 @@ create table ab_a3 partition of ab for values in(3) partition by list (b);
 create table ab_a3_b1 partition of ab_a3 for values in (1);
 create table ab_a3_b2 partition of ab_a3 for values in (2);
 create table ab_a3_b3 partition of ab_a3 for values in (3);
+-- Disallow index only scans as concurrent transactions may stop visibility
+-- bits being set causing "Heap Fetches" to be unstable in the EXPLAIN ANALYZE
+-- output.
+set enable_indexonlyscan = off;
 prepare ab_q1 (int, int, int) as
 select * from ab where a between $1 and $2 and b <= $3;
 -- Execute query 5 times to allow choose_custom_plan
@@ -1754,8 +1758,8 @@ execute ab_q6 (1, 2, 3);
 (1 row)
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (0, 0, 1);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1765,39 +1769,30 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (0, 0, 1);
                      ->  Parallel Seq Scan on lprt_a a (actual rows=51 loops=2)
                            Filter: (a = ANY ('{0,0,1}'::integer[]))
                      ->  Append (actual rows=0 loops=102)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(36 rows)
+(27 rows)
 
 insert into lprt_a values(3),(3);
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 3);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1807,38 +1802,29 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 3);
                      ->  Parallel Seq Scan on lprt_a a (actual rows=52 loops=2)
                            Filter: (a = ANY ('{1,0,3}'::integer[]))
                      ->  Append (actual rows=0 loops=104)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(36 rows)
+(27 rows)
 
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
-                                               QUERY PLAN                                               
---------------------------------------------------------------------------------------------------------
+                                            QUERY PLAN                                             
+---------------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1849,39 +1835,30 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
                            Filter: (a = ANY ('{1,0,0}'::integer[]))
                            Rows Removed by Filter: 1
                      ->  Append (actual rows=0 loops=102)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (actual rows=0 loops=2)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(37 rows)
+(28 rows)
 
 delete from lprt_a where a = 1;
 explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
-                                           QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
  Finalize Aggregate (actual rows=1 loops=1)
    ->  Gather (actual rows=2 loops=1)
          Workers Planned: 1
@@ -1892,34 +1869,25 @@ explain (analyze, costs off, summary off, timing off) execute ab_q6 (1, 0, 0);
                            Filter: (a = ANY ('{1,0,0}'::integer[]))
                            Rows Removed by Filter: 1
                      ->  Append (actual rows=0 loops=100)
-                           ->  Index Only Scan using ab_a1_b1_a_idx on ab_a1_b1 (never executed)
+                           ->  Index Scan using ab_a1_b1_a_idx on ab_a1_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b2_a_idx on ab_a1_b2 (never executed)
+                           ->  Index Scan using ab_a1_b2_a_idx on ab_a1_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a1_b3_a_idx on ab_a1_b3 (never executed)
+                           ->  Index Scan using ab_a1_b3_a_idx on ab_a1_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
+                           ->  Index Scan using ab_a2_b1_a_idx on ab_a2_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
+                           ->  Index Scan using ab_a2_b2_a_idx on ab_a2_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
+                           ->  Index Scan using ab_a2_b3_a_idx on ab_a2_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
+                           ->  Index Scan using ab_a3_b1_a_idx on ab_a3_b1 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
+                           ->  Index Scan using ab_a3_b2_a_idx on ab_a3_b2 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-                           ->  Index Only Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
+                           ->  Index Scan using ab_a3_b3_a_idx on ab_a3_b3 (never executed)
                                  Index Cond: (a = a.a)
-                                 Heap Fetches: 0
-(37 rows)
+(28 rows)
 
 reset enable_hashjoin;
 reset enable_mergejoin;
@@ -2015,57 +1983,45 @@ set enable_hashjoin = off;
 set enable_mergejoin = off;
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=6 loops=1)
    ->  Seq Scan on tbl1 (actual rows=2 loops=1)
    ->  Append (actual rows=3 loops=2)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 4
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+(15 rows)
 
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=2 loops=1)
    ->  Seq Scan on tbl1 (actual rows=2 loops=1)
    ->  Append (actual rows=1 loops=2)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 > tprt.col1
@@ -2093,57 +2049,45 @@ order by tbl1.col1, tprt.col1;
 insert into tbl1 values (1001), (1010), (1011);
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=23 loops=1)
    ->  Seq Scan on tbl1 (actual rows=5 loops=1)
    ->  Append (actual rows=5 loops=5)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 10
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 11
-         ->  Index Only Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
-               Index Cond: (col1 < tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
+               Index Cond: (tbl1.col1 > col1)
+(15 rows)
 
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=3 loops=1)
    ->  Seq Scan on tbl1 (actual rows=5 loops=1)
    ->  Append (actual rows=1 loops=5)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
+         ->  Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 2
-         ->  Index Only Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
+         ->  Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 1
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 > tprt.col1
@@ -2190,30 +2134,24 @@ delete from tbl1;
 insert into tbl1 values (4400);
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 < tprt.col1;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
  Nested Loop (actual rows=1 loops=1)
    ->  Seq Scan on tbl1 (actual rows=1 loops=1)
    ->  Append (actual rows=1 loops=1)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
-               Index Cond: (col1 > tbl1.col1)
-               Heap Fetches: 1
-(21 rows)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt2_idx on tprt_2 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
+               Index Cond: (tbl1.col1 < col1)
+         ->  Index Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
+               Index Cond: (tbl1.col1 < col1)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 < tprt.col1
@@ -2228,30 +2166,24 @@ delete from tbl1;
 insert into tbl1 values (10000);
 explain (analyze, costs off, summary off, timing off)
 select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
-                               QUERY PLAN                               
-------------------------------------------------------------------------
+                            QUERY PLAN                             
+-------------------------------------------------------------------
  Nested Loop (actual rows=0 loops=1)
    ->  Seq Scan on tbl1 (actual rows=1 loops=1)
    ->  Append (actual rows=0 loops=1)
-         ->  Index Only Scan using tprt1_idx on tprt_1 (never executed)
+         ->  Index Scan using tprt1_idx on tprt_1 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt2_idx on tprt_2 (never executed)
+         ->  Index Scan using tprt2_idx on tprt_2 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt3_idx on tprt_3 (never executed)
+         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt4_idx on tprt_4 (never executed)
+         ->  Index Scan using tprt4_idx on tprt_4 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt5_idx on tprt_5 (never executed)
+         ->  Index Scan using tprt5_idx on tprt_5 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-         ->  Index Only Scan using tprt6_idx on tprt_6 (never executed)
+         ->  Index Scan using tprt6_idx on tprt_6 (never executed)
                Index Cond: (col1 = tbl1.col1)
-               Heap Fetches: 0
-(21 rows)
+(15 rows)
 
 select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 = tprt.col1
@@ -2466,3 +2398,4 @@ select * from boolp where a = (select value from boolvalues where not value);
 (9 rows)
 
 drop table boolp;
+reset enable_indexonlyscan;
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 1d712b4ada..7fe93bbc04 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -256,6 +256,11 @@ create table ab_a3_b1 partition of ab_a3 for values in (1);
 create table ab_a3_b2 partition of ab_a3 for values in (2);
 create table ab_a3_b3 partition of ab_a3 for values in (3);
 
+-- Disallow index only scans as concurrent transactions may stop visibility
+-- bits being set causing "Heap Fetches" to be unstable in the EXPLAIN ANALYZE
+-- output.
+set enable_indexonlyscan = off;
+
 prepare ab_q1 (int, int, int) as
 select * from ab where a between $1 and $2 and b <= $3;
 
@@ -581,3 +586,5 @@ explain (analyze, costs off, summary off, timing off)
 select * from boolp where a = (select value from boolvalues where not value);
 
 drop table boolp;
+
+reset enable_indexonlyscan;
\ No newline at end of file

#16

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: David Rowley (#15)

Re: pgsql: Support partition pruning at execution time

"David" == David Rowley <david.rowley@2ndquadrant.com> writes:

David> I've attached my proposed fix for the unstable regression tests.
David> I removed the vacuums I'd added in the last version and
David> commented why we're doing set enable_indesonlyscan = off;

Looks basically sane - I'll try it out and commit it shortly.

--
Andrew (irc:RhodiumToad)

#17

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Andrew Gierth (#16)

Re: pgsql: Support partition pruning at execution time

Andrew Gierth wrote:

"David" == David Rowley <david.rowley@2ndquadrant.com> writes:

David> I've attached my proposed fix for the unstable regression tests.
David> I removed the vacuums I'd added in the last version and
David> commented why we're doing set enable_indesonlyscan = off;

Looks basically sane - I'll try it out and commit it shortly.

Thanks for cleaning that up. I'll look into why the test (without this
commit) fails with force_parallel_mode=regress next week.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Andrew Gierth

andrew@tao11.riddles.org.uk

almost 8 years ago

In reply to: Alvaro Herrera (#17)

Re: pgsql: Support partition pruning at execution time

"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Alvaro> Thanks for cleaning that up. I'll look into why the test
Alvaro> (without this commit) fails with force_parallel_mode=regress
Alvaro> next week.

Seems clear enough to me - the "Heap Fetches" statistic is kept in the
IndexOnlyScanState node in its own field, not part of ss.ps.instrument,
and is therefore not reported from workers to leader.

--
Andrew (irc:RhodiumToad)

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Andrew Gierth (#18)

Re: pgsql: Support partition pruning at execution time

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
Alvaro> Thanks for cleaning that up. I'll look into why the test
Alvaro> (without this commit) fails with force_parallel_mode=regress
Alvaro> next week.

Seems clear enough to me - the "Heap Fetches" statistic is kept in the
IndexOnlyScanState node in its own field, not part of ss.ps.instrument,
and is therefore not reported from workers to leader.

BTW, pademelon just exhibited a different instability in this test:

*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out	Sun Apr  8 01:56:04 2018
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out	Sun Apr  8 17:48:14 2018
***************
*** 1606,1612 ****
           ->  Partial Aggregate (actual rows=1 loops=3)
                 ->  Parallel Append (actual rows=0 loops=3)
                       Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
                       ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
--- 1606,1612 ----
           ->  Partial Aggregate (actual rows=1 loops=3)
                 ->  Parallel Append (actual rows=0 loops=3)
                       Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
                       ->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
                             Filter: ((a >= $1) AND (a <= $2) AND (b < 4))

======================================================================

Dunno quite what to make of that, but this animal previously passed
at commit
b47a86f Sun Apr 8 05:35:42 2018 UTC Attempt to stabilize partition_prune test output.
so it's not a consistent failure.

regards, tom lane

#20

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#19)

Re: pgsql: Support partition pruning at execution time

On 9 April 2018 at 09:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, pademelon just exhibited a different instability in this test:

*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out  Sun Apr  8 01:56:04 2018
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out   Sun Apr  8 17:48:14 2018
***************
*** 1606,1612 ****
->  Partial Aggregate (actual rows=1 loops=3)
->  Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
--- 1606,1612 ----
->  Partial Aggregate (actual rows=1 loops=3)
->  Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))

Reading code it looks like a bug in choose_next_subplan_for_worker():

The following needs to be changed for this patch:

/* Advance to next plan. */
pstate->pa_next_plan++;

I'll think a bit harder about the best way to fix and submit a patch
for it later.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#20)

2 attachment(s)

Re: pgsql: Support partition pruning at execution time

On 9 April 2018 at 13:03, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 9 April 2018 at 09:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, pademelon just exhibited a different instability in this test:

*** /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out  Sun Apr  8 01:56:04 2018
--- /home/bfarm/bf-data/HEAD/pgsql.build/src/test/regress/results/partition_prune.out   Sun Apr  8 17:48:14 2018
***************
*** 1606,1612 ****
->  Partial Aggregate (actual rows=1 loops=3)
->  Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
--- 1606,1612 ----
->  Partial Aggregate (actual rows=1 loops=3)
->  Parallel Append (actual rows=0 loops=3)
Subplans Removed: 6
!                      ->  Parallel Seq Scan on ab_a2_b1 (actual rows=0 loops=2)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))
->  Parallel Seq Scan on ab_a2_b2 (actual rows=0 loops=1)
Filter: ((a >= $1) AND (a <= $2) AND (b < 4))

Reading code it looks like a bug in choose_next_subplan_for_worker():

The following needs to be changed for this patch:

/* Advance to next plan. */
pstate->pa_next_plan++;

I'll think a bit harder about the best way to fix and submit a patch
for it later.

Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

I've also attached a 2nd patch to fix a spelling mistake and a
misleading comment in the code.

The misleading comment claimed we unset the extern params so we didn't
perform pruning again using these. I'd failed to update this comment
after I realised that we still need to attempt to prune again with the
external params since quals against the partition key may actually
contain a mix of exec and external params, which would mean that it's
only possible to prune partitions using both types of params. No
actual code needs to be updated since the 2nd pass of pruning uses
"allparams" anyway. We could actually get away without the bms_free()
and set to NULL in the lines below the comment, but I wanted some way
to ensure that we never write any code which calls the function twice
on the same PartitionPruneState, but maybe I'm just overly paranoid
there.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-incorrect-logic-for-choosing-the-next-Parallel-A.patchapplication/octet-stream; name=0001-Fix-incorrect-logic-for-choosing-the-next-Parallel-A.patchDownload

From 18260f490a3d8355e09c18e216fc1ec743df1432 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Mon, 9 Apr 2018 14:49:01 +1200
Subject: [PATCH 1/2] Fix incorrect logic for choosing the next Parallel Append
 subplan

In 499be013de support for pruning unneeded Append subnodes was added. The
logic in that commit was not correct and was not correctly checking if the
next subplan was in fact a valid subplan. This could cause parallel workers
processes to be given a subplan to work on which didn't require any work.

In passing also add a comment to explain what first_partial_plan means.
---
 src/backend/executor/nodeAppend.c | 54 ++++++++++++++++++++++++++++++---------
 src/include/nodes/plannodes.h     |  5 ++++
 2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index b135b61324..ddca1b2a2a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -547,6 +547,11 @@ choose_next_subplan_for_leader(AppendState *node)
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
+
+		/*
+		 * We needn't pay attention to as_valid_subplans here as all invalid
+		 * plans have been marked as finished.
+		 */
 		node->as_whichplan--;
 	}
 
@@ -612,18 +617,32 @@ choose_next_subplan_for_worker(AppendState *node)
 	/* Save the plan from which we are starting the search. */
 	node->as_whichplan = pstate->pa_next_plan;
 
-	/* Loop until we find a subplan to execute. */
+	/* Loop until we find a valid subplan to execute. */
 	while (pstate->pa_finished[pstate->pa_next_plan])
 	{
-		if (pstate->pa_next_plan < node->as_nplans - 1)
+		int			nextplan = bms_next_member(node->as_valid_subplans,
+											   pstate->pa_next_plan);
+
+		if (nextplan >= 0)
 		{
-			/* Advance to next plan. */
-			pstate->pa_next_plan++;
+			/* Advance to the next valid plan. */
+			pstate->pa_next_plan = nextplan;
 		}
 		else if (node->as_whichplan > append->first_partial_plan)
 		{
-			/* Loop back to first partial plan. */
-			pstate->pa_next_plan = append->first_partial_plan;
+			/* Try looping back to first valid partial plan. */
+			nextplan = bms_next_member(node->as_valid_subplans,
+									   append->first_partial_plan - 1);
+
+			/*
+			 * There mightn't be any valid partial plans.  If there's none
+			 * then set pa_next_plan to as_whichplan so that we exit in the
+			 * test below, otherwise just select the first valid partial plan.
+			 */
+			if (nextplan < 0)
+				pstate->pa_next_plan = node->as_whichplan;
+			else
+				pstate->pa_next_plan = nextplan;
 		}
 		else
 		{
@@ -644,16 +663,27 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan++;
-	if (pstate->pa_next_plan >= node->as_nplans)
+	node->as_whichplan = pstate->pa_next_plan;
+	pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
+										   pstate->pa_next_plan);
+
+	/*
+	 * If there are no more valid plans then try setting the next plan to the
+	 * first valid partial plan.
+	 */
+	if (pstate->pa_next_plan < 0)
 	{
-		if (append->first_partial_plan < node->as_nplans)
-			pstate->pa_next_plan = append->first_partial_plan;
+		int			nextplan = bms_next_member(node->as_valid_subplans,
+											   append->first_partial_plan - 1);
+
+		if (nextplan >= 0)
+			pstate->pa_next_plan = nextplan;
 		else
 		{
 			/*
-			 * We have only non-partial plans, and we already chose the last
-			 * one; so arrange for the other workers to immediately bail out.
+			 * There are no valid partial plans, and we already chose the last
+			 * non-partial plan; so flag that there's nothing more for our
+			 * fellow workers to do.
 			 */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
 		}
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index c3e5c2c79f..c5c33cd336 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -255,6 +255,11 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+
+	/*
+	 * All 'appendplans' preceding this index are non-partial plans. All
+	 * 'appendplans' from this index onwards are partial plans.
+	 */
 	int			first_partial_plan;
 
 	/*
-- 
2.16.2.windows.1

0002-Fix-misleading-comment-and-typo.patchapplication/octet-stream; name=0002-Fix-misleading-comment-and-typo.patchDownload

From 88d74aa37e95f34000ab19f704994da9e4c2e7f7 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Mon, 9 Apr 2018 14:53:49 +1200
Subject: [PATCH 2/2] Fix misleading comment and typo

Clean up of 499be013de
---
 src/backend/executor/execPartition.c | 5 ++---
 src/include/partitioning/partprune.h | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 50bea2c2ce..b9d5159354 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1524,9 +1524,8 @@ ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate, int nsubnodes)
 
 	/*
 	 * Record that partition pruning has been performed for external params.
-	 * This partly also serves to ensure we never call this function twice
-	 * with the same input and also so that ExecFindMatchingSubPlans is aware
-	 * that pruning has already been performed for external Params.
+	 * These are not required again and NULLifying them helps ensure nothing
+	 * accidentally calls this function twice on the same PartitionPruneState.
 	 */
 	bms_free(prunestate->extparams);
 	prunestate->extparams = NULL;
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index f68b6c9935..2ae2fd16ed 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -40,7 +40,7 @@ typedef struct PartitionPruneContext
 
 	/*
 	 * Can be set when the context is used from the executor to allow params
-	 * found matching the partition key to be evaulated.
+	 * found matching the partition key to be evaluated.
 	 */
 	PlanState  *planstate;
 
-- 
2.16.2.windows.1

#22

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#21)

Re: pgsql: Support partition pruning at execution time

On 9 April 2018 at 15:03, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 9 April 2018 at 13:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

Added to PG11 open items list [1]https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues.

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: David Rowley (#21)

Re: pgsql: Support partition pruning at execution time

David Rowley wrote:

Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

Pushed it just now. Let's see what happens with pademelon now.

The misleading comment claimed we unset the extern params so we didn't
perform pruning again using these. I'd failed to update this comment
after I realised that we still need to attempt to prune again with the
external params since quals against the partition key may actually
contain a mix of exec and external params, which would mean that it's
only possible to prune partitions using both types of params. No
actual code needs to be updated since the 2nd pass of pruning uses
"allparams" anyway. We could actually get away without the bms_free()
and set to NULL in the lines below the comment, but I wanted some way
to ensure that we never write any code which calls the function twice
on the same PartitionPruneState, but maybe I'm just overly paranoid
there.

Pushed this earlier today, thanks.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alvaro Herrera (#23)

Re: pgsql: Support partition pruning at execution time

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

David Rowley wrote:

Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

Pushed it just now. Let's see what happens with pademelon now.

I've had pademelon's host running a "make installcheck" loop all day
trying to reproduce the problem. I haven't gotten a bite yet (although
at 15+ minutes per cycle, this isn't a huge number of tests). I think
we were remarkably (un)lucky to see the problem so quickly after the
initial commit, and I'm afraid pademelon isn't going to help us prove
much about whether this was the same issue.

This does remind me quite a bit though of the ongoing saga with the
postgres_fdw test instability. Given the frequency with which that's
failing in the buildfarm, you would not think it's impossible to
reproduce outside the buildfarm, and yet I'm here to tell you that
it's pretty damn hard. I haven't succeeded yet, and that's not for
lack of trying. Could there be something about the buildfarm
environment that makes these sorts of things more likely?

regards, tom lane

#25

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Andrew Gierth (#18)

2 attachment(s)

Re: pgsql: Support partition pruning at execution time

Andrew Gierth wrote:

"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Alvaro> Thanks for cleaning that up. I'll look into why the test
Alvaro> (without this commit) fails with force_parallel_mode=regress
Alvaro> next week.

Seems clear enough to me - the "Heap Fetches" statistic is kept in the
IndexOnlyScanState node in its own field, not part of ss.ps.instrument,
and is therefore not reported from workers to leader.

Right, thanks for the pointer.

So here's a patch that makes thing behave as expected. I noticed that
instrument->nfiltered3 was available, so I used that to keep the
counter. I wanted to print it using show_instrumentation_count (which
has the nice property that you don't even have to test for es->analyze),
but it doesn't work, because it divides the number by nloops, which is
not what we want in this case. (It also doesn't print if the counter is
zero, which maybe is desirable for the other counters but probably not
for this one).

I then noticed that support for nfiltered3 was incomplete; hence 0001.
(I then noticed that nfiltered3 was added for MERGE. It looks wrong to
me.)

Frankly, I don't like this. I would rather have an instrument->ntuples2
rather than these "divide this by nloops, sometimes" schizoid counters.
This is already being misused by ON CONFLICT (see "other_path" in
show_modifytable_info). But it seems like a correct fix would require
more code.

Anyway, the partition_prune test works correctly now (after reverting
AndrewSN's b47a86f5008f26) in both force_parallel_mode settings.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-print-nfiltered3.patchtext/plain; charset=us-asciiDownload

From 10b29c7706efd00279182164de14592643ff7f40 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 9 Apr 2018 18:42:04 -0300
Subject: [PATCH 1/2] print nfiltered3

---
 src/backend/commands/explain.c    | 4 +++-
 src/backend/executor/instrument.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 989b6aad67..347fbbe1cf 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2592,7 +2592,9 @@ show_instrumentation_count(const char *qlabel, int which,
 	if (!es->analyze || !planstate->instrument)
 		return;
 
-	if (which == 2)
+	if (which == 3)
+		nfiltered = planstate->instrument->nfiltered3;
+	else if (which == 2)
 		nfiltered = planstate->instrument->nfiltered2;
 	else
 		nfiltered = planstate->instrument->nfiltered1;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 86252cee1f..d3045f57ac 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -159,6 +159,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 	dst->nloops += add->nloops;
 	dst->nfiltered1 += add->nfiltered1;
 	dst->nfiltered2 += add->nfiltered2;
+	dst->nfiltered3 += add->nfiltered3;
 
 	/* Add delta of buffer usage since entry to node's totals */
 	if (dst->need_bufusage)
-- 
2.11.0

0002-use-nfiltered3-instead-of-ad-hoc-counter.patchtext/plain; charset=us-asciiDownload

From acaf8e8af643f05605eab21ad3e5a04a8714f06a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Mon, 9 Apr 2018 18:41:20 -0300
Subject: [PATCH 2/2] use nfiltered3 instead of ad-hoc counter

---
 src/backend/commands/explain.c           | 8 ++------
 src/backend/executor/nodeIndexonlyscan.c | 3 +--
 src/include/nodes/execnodes.h            | 1 -
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 347fbbe1cf..434051e7df 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1459,12 +1459,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-			{
-				long		heapFetches =
-					((IndexOnlyScanState *) planstate)->ioss_HeapFetches;
-
-				ExplainPropertyInteger("Heap Fetches", NULL, heapFetches, es);
-			}
+				ExplainPropertyInteger("Heap Fetches", NULL,
+									   planstate->instrument->nfiltered3, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index ddc0ae9061..ef835f13c2 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -162,7 +162,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
-			node->ioss_HeapFetches++;
+			InstrCountFiltered3(node, 1);
 			tuple = index_fetch_heap(scandesc);
 			if (tuple == NULL)
 				continue;		/* no visible tuple, try next index entry */
@@ -509,7 +509,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
-	indexstate->ioss_HeapFetches = 0;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 06456f07cc..d00839c161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1387,7 +1387,6 @@ typedef struct IndexOnlyScanState
 	Relation	ioss_RelationDesc;
 	IndexScanDesc ioss_ScanDesc;
 	Buffer		ioss_VMBuffer;
-	long		ioss_HeapFetches;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
-- 
2.11.0

#26

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alvaro Herrera (#25)

Re: pgsql: Support partition pruning at execution time

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

I then noticed that support for nfiltered3 was incomplete; hence 0001.
(I then noticed that nfiltered3 was added for MERGE. It looks wrong to
me.)

In that case, it's likely to go away when Simon reverts MERGE. Suggest
you hold off committing until he's done so, as he probably already has
some conflicts to deal with and doesn't need another.

Frankly, I don't like this. I would rather have an instrument->ntuples2
rather than these "divide this by nloops, sometimes" schizoid counters.
This is already being misused by ON CONFLICT (see "other_path" in
show_modifytable_info). But it seems like a correct fix would require
more code.

The question then becomes whether to put back nfiltered3, or to do
something more to your liking. Think I'd vote for the latter.

regards, tom lane

#27

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#25)

Re: pgsql: Support partition pruning at execution time

On 10 April 2018 at 09:58, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I then noticed that support for nfiltered3 was incomplete; hence 0001.
(I then noticed that nfiltered3 was added for MERGE. It looks wrong to
me.)

Frankly, I don't like this. I would rather have an instrument->ntuples2
rather than these "divide this by nloops, sometimes" schizoid counters.
This is already being misused by ON CONFLICT (see "other_path" in
show_modifytable_info). But it seems like a correct fix would require
more code.

+1 for a new field for this and making ON CONFLICT use it.

ntuples2 seems fine. If we make it too specific then we'll end up with
lots more than we need.

I don't think re-using the filter counters are very good when it's not
for filtering.

MERGE was probably just following the example made by ON CONFLICT.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Tom Lane (#26)

1 attachment(s)

Re: pgsql: Support partition pruning at execution time

Changed CC to pgsql-hackers.

Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Frankly, I don't like this. I would rather have an instrument->ntuples2
rather than these "divide this by nloops, sometimes" schizoid counters.
This is already being misused by ON CONFLICT (see "other_path" in
show_modifytable_info). But it seems like a correct fix would require
more code.

The question then becomes whether to put back nfiltered3, or to do
something more to your liking. Think I'd vote for the latter.

Doing it properly is not a lot of code actually. Patch attached. ON
CONFLICT is not changed by this patch, but that's a straightforward
change.

Questions:
1. Do we want to back-patch this to 10? I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

2. Do we want to revert Andrew's test stabilization patch? If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible". That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero). Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

heapfetches.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 989b6aad67..fe0419311b 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***************
*** 1459,1470 **** ExplainNode(PlanState *planstate, List *ancestors,
  				show_instrumentation_count("Rows Removed by Filter", 1,
  										   planstate, es);
  			if (es->analyze)
! 			{
! 				long		heapFetches =
! 					((IndexOnlyScanState *) planstate)->ioss_HeapFetches;
! 
! 				ExplainPropertyInteger("Heap Fetches", NULL, heapFetches, es);
! 			}
  			break;
  		case T_BitmapIndexScan:
  			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
--- 1459,1466 ----
  				show_instrumentation_count("Rows Removed by Filter", 1,
  										   planstate, es);
  			if (es->analyze)
! 				ExplainPropertyInteger("Heap Fetches", NULL,
! 									   planstate->instrument->ntuples2, es);
  			break;
  		case T_BitmapIndexScan:
  			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
diff --git a/src/backend/executor/insindex 86252cee1f..fe5d55904d 100644
*** a/src/backend/executor/instrument.c
--- b/src/backend/executor/instrument.c
***************
*** 156,161 **** InstrAggNode(Instrumentation *dst, Instrumentation *add)
--- 156,162 ----
  	dst->startup += add->startup;
  	dst->total += add->total;
  	dst->ntuples += add->ntuples;
+ 	dst->ntuples2 += add->ntuples2;
  	dst->nloops += add->nloops;
  	dst->nfiltered1 += add->nfiltered1;
  	dst->nfiltered2 += add->nfiltered2;
diff --git a/src/backend/executor/nodeInindex ddc0ae9061..3a02a99621 100644
*** a/src/backend/executor/nodeIndexonlyscan.c
--- b/src/backend/executor/nodeIndexonlyscan.c
***************
*** 162,168 **** IndexOnlyNext(IndexOnlyScanState *node)
  			/*
  			 * Rats, we have to visit the heap to check visibility.
  			 */
! 			node->ioss_HeapFetches++;
  			tuple = index_fetch_heap(scandesc);
  			if (tuple == NULL)
  				continue;		/* no visible tuple, try next index entry */
--- 162,168 ----
  			/*
  			 * Rats, we have to visit the heap to check visibility.
  			 */
! 			InstrCountTuples2(node, 1);
  			tuple = index_fetch_heap(scandesc);
  			if (tuple == NULL)
  				continue;		/* no visible tuple, try next index entry */
***************
*** 509,515 **** ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
  	indexstate->ss.ps.plan = (Plan *) node;
  	indexstate->ss.ps.state = estate;
  	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
- 	indexstate->ioss_HeapFetches = 0;
  
  	/*
  	 * Miscellaneous initialization
--- 509,514 ----
diff --git a/src/include/executor/instrument.h index 28eb0093d4..0ba407f83a 100644
*** a/src/include/executor/instrument.h
--- b/src/include/executor/instrument.h
***************
*** 57,62 **** typedef struct Instrumentation
--- 57,63 ----
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
+ 	double		ntuples2;		/* Secondary generic tuple counter */
  	double		nloops;			/* # of run cycles for this node */
  	double		nfiltered1;		/* # tuples removed by scanqual or joinqual OR
  								 * # tuples inserted by MERGE */
diff --git a/src/include/nodes/execnodesindex 06456f07cc..deab875466 100644
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1004,1009 **** typedef struct PlanState
--- 1004,1014 ----
  #define outerPlanState(node)		(((PlanState *)(node))->lefttree)
  
  /* Macros for inline access to certain instrumentation counters */
+ #define InstrCountTuples2(node, delta) \
+ 	do { \
+ 		if (((PlanState *)(node))->instrument) \
+ 			((PlanState *)(node))->instrument->ntuples2 += (delta); \
+ 	} while (0)
  #define InstrCountFiltered1(node, delta) \
  	do { \
  		if (((PlanState *)(node))->instrument) \
***************
*** 1368,1374 **** typedef struct IndexScanState
   *		RelationDesc	   index relation descriptor
   *		ScanDesc		   index scan descriptor
   *		VMBuffer		   buffer in use for visibility map testing, if any
-  *		HeapFetches		   number of tuples we were forced to fetch from heap
   *		ioss_PscanLen	   Size of parallel index-only scan descriptor
   * ----------------
   */
--- 1373,1378 ----

#29

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Alvaro Herrera (#28)

Re: pgsql: Support partition pruning at execution time

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Questions:
1. Do we want to back-patch this to 10? I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

You can't back-patch a change in struct Instrumentation; that'd be
a ABI break. Maybe there are no third parties directly touching that
struct, but I wouldn't bet on it.

2. Do we want to revert Andrew's test stabilization patch? If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible". That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero). Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

I did not especially like the original test output, because even without
the bug at hand, it seemed to me the number of heap fetches might vary
depending on BLCKSZ. Given that the point of the test is just to check
partition pruning, seems like IOS vs regular indexscan isn't a critical
difference. I'd keep Andrew's change but fix the comment.

regards, tom lane

#30

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Alvaro Herrera (#28)

Re: pgsql: Support partition pruning at execution time

On 11 April 2018 at 03:14, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

2. Do we want to revert Andrew's test stabilization patch? If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible". That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero). Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

I think it's probably a good idea to revert it once the
instrumentation is working correctly. It appears this found a bug in
that code, so is probably useful to keep just in case something else
breaks it in the future.

I don't think there is too much risk of instability from other
sources. There's no reason an auto-vacuum would trigger and cause a
change in heap fetches. We only delete one row from lprt_a, that's not
going to trigger an auto-vacuum.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#29)

Re: pgsql: Support partition pruning at execution time

On 11 April 2018 at 03:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

2. Do we want to revert Andrew's test stabilization patch? If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible". That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero). Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

I did not especially like the original test output, because even without
the bug at hand, it seemed to me the number of heap fetches might vary
depending on BLCKSZ. Given that the point of the test is just to check
partition pruning, seems like IOS vs regular indexscan isn't a critical
difference. I'd keep Andrew's change but fix the comment.

hmm, I don't feel strongly about reverting or not, but if
[auto-]vacuum has not visited the table, then I don't see why BLCKSZ
would have an effect here.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Alvaro Herrera (#28)

Re: pgsql: Support partition pruning at execution time

On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:

Questions:
1. Do we want to back-patch this to 10? I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

I haven't looked at this closely, but have you considered adding
bespoke code for IndexOnlyScan that works like
ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
already do rather than jamming this into struct Instrumentation?

I'm inclined to view any node-specific instrumentation that's not
being pulled back to the leader as a rough edge to be filed down when
it bothers somebody more than an outright bug, but perhaps that is an
unduly lenient view. Still, if we take the view that it's an outright
bug, I suspect we find that there may be at least a few more of those.
I was pretty much oblivious to this problem during the initial
parallel query development and mistakenly assumed that bringing over
struct Instrumentation was good enough. It emerged late in the game
that this wasn't really the case, but holding up the whole feature
because some nodes might have details not reported on a per-worker
basis didn't really seem to make sense. Whether that was the right
call is obviously arguable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Robert Haas (#32)

Re: pgsql: Support partition pruning at execution time

Robert Haas wrote:

On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:

Questions:
1. Do we want to back-patch this to 10? I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

I haven't looked at this closely, but have you considered adding
bespoke code for IndexOnlyScan that works like
ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
already do rather than jamming this into struct Instrumentation?

Thanks for pointing these out -- I hadn't come across these.

My initial impression is that those two are about transferring
instrumentation state that's quite a bit more complicated than what my
patch proposes for indexonly scan, which is just a single integer
counter. For example, in the case of Sort, for each worker we display
separately the method and type and amount of memory. In the hash case,
we aggregate all of them together for some reason (though I'm not clear
about this one:
/*
* In a parallel-aware hash join, for now we report the
* maximum peak memory reported by any worker.
*/
hinstrument.space_peak =
Max(hinstrument.space_peak, worker_hi->space_peak);
-- why shouldn't we sum these values?)

In contrast, in an indexonly scan you have a single counter and it
doesn't really matter the distribution of fetches done by workers, so it
seems okay to aggregate them all in a single counter. And it being so
simple, it seems reasonable to me to put it in Instrumentation rather
than have a dedicated struct.

I'm inclined to view any node-specific instrumentation that's not
being pulled back to the leader as a rough edge to be filed down when
it bothers somebody more than an outright bug, but perhaps that is an
unduly lenient view. Still, if we take the view that it's an outright
bug, I suspect we find that there may be at least a few more of those.

OK. As Tom indicated, it's not possible to backpatch this anyway.
Given that nobody has complained to date, it seems okay to be lenient
about this. Seeking a backpatchable solution seems more trouble than is
worth.

I was pretty much oblivious to this problem during the initial
parallel query development and mistakenly assumed that bringing over
struct Instrumentation was good enough. It emerged late in the game
that this wasn't really the case, but holding up the whole feature
because some nodes might have details not reported on a per-worker
basis didn't really seem to make sense. Whether that was the right
call is obviously arguable.

I certainly don't blame you for that.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Alvaro Herrera (#33)

Re: pgsql: Support partition pruning at execution time

On Tue, Apr 10, 2018 at 12:29 PM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:

In contrast, in an indexonly scan you have a single counter and it
doesn't really matter the distribution of fetches done by workers, so it
seems okay to aggregate them all in a single counter. And it being so
simple, it seems reasonable to me to put it in Instrumentation rather
than have a dedicated struct.

I don't have a strong opinion on that. Since we know how many tuples
were processed by each worker, knowing how many heap fetches we have
on a per-worker basis seems like a good thing to have, too. On the
other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that
anyway, since it knows about displaying per-worker instrumentation
(look for /* Show worker detail */). If it doesn't, then that's
probably bad, because I'm pretty sure that when I installed that code
the stuff that got displayed for worker instrumentation pretty much
matched the stuff that got displayed for overall instrumentation.

In any case part of the point is that Instrumentation is supposed to
be a generic structure that contains things that are for the most part
common to all node types. So what MERGE did to that structure looks
like in inadvisable kludge to me. I'd get rid of that and do it a
different way rather than propagate the idea that nfilteredX is
scratch space that can mean something different in every separate
node.

I was pretty much oblivious to this problem during the initial
parallel query development and mistakenly assumed that bringing over
struct Instrumentation was good enough. It emerged late in the game
that this wasn't really the case, but holding up the whole feature
because some nodes might have details not reported on a per-worker
basis didn't really seem to make sense. Whether that was the right
call is obviously arguable.

I certainly don't blame you for that.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 8 years ago

In reply to: Robert Haas (#34)

Re: pgsql: Support partition pruning at execution time

Robert Haas wrote:

I don't have a strong opinion on that. Since we know how many tuples
were processed by each worker, knowing how many heap fetches we have
on a per-worker basis seems like a good thing to have, too. On the
other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that
anyway, since it knows about displaying per-worker instrumentation
(look for /* Show worker detail */). If it doesn't, then that's
probably bad, because I'm pretty sure that when I installed that code
the stuff that got displayed for worker instrumentation pretty much
matched the stuff that got displayed for overall instrumentation.

So, after some experimentation I was able to produce this output with
the previously proposed patch:
QUERY PLAN

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather Merge (cost=1000.51..27083147.02 rows=22950230 width=12) (actual time=1198.906..1198.906 rows=0 loops=1)
Output: tbl1.col1, tprt_1.col1, tprt_1.col1
Workers Planned: 5
Workers Launched: 5
Buffers: shared hit=267516
-> Nested Loop (cost=0.43..24318369.79 rows=4590046 width=12) (actual time=1183.291..1183.291 rows=0 loops=6)
Output: tbl1.col1, tprt_1.col1, tprt_1.col1
Join Filter: (tbl1.col1 = tprt_1.col1)
Rows Removed by Join Filter: 1200012
Buffers: shared hit=1836184
Worker 0: actual time=1174.526..1174.526 rows=0 loops=1
Buffers: shared hit=245599
Worker 1: actual time=1178.845..1178.845 rows=0 loops=1
Buffers: shared hit=270694
Worker 2: actual time=1185.703..1185.703 rows=0 loops=1
Buffers: shared hit=313607
Worker 3: actual time=1189.595..1189.595 rows=0 loops=1
Buffers: shared hit=383714
Worker 4: actual time=1176.965..1176.965 rows=0 loops=1
Buffers: shared hit=355054
-> Parallel Index Only Scan using tprt1_idx on public.tprt_1 (cost=0.43..63100.29 rows=360004 width=4) (actual time=0.041..209.113 rows=300003 loops=6)
Output: tprt_1.col1
Heap Fetches: 1800018
Buffers: shared hit=36166
Worker 0: actual time=0.035..192.255 rows=240750 loops=1
Buffers: shared hit=4849
Worker 1: actual time=0.031..184.332 rows=265307 loops=1
Buffers: shared hit=5387
Worker 2: actual time=0.046..272.412 rows=307509 loops=1
Buffers: shared hit=6098
Worker 3: actual time=0.035..212.866 rows=376064 loops=1
Buffers: shared hit=7650
Worker 4: actual time=0.036..201.076 rows=348048 loops=1
Buffers: shared hit=7006
-> Seq Scan on public.tbl1 (cost=0.00..35.50 rows=2550 width=4) (actual time=0.001..0.002 rows=4 loo
Time: 1205,533 ms (00:01,206)

You're right that per-worker heap-fetches would be probably better, but
that seems like a project larger than I'm willing to tackle at this
point in the devel cycle -- mainly because that "worker detail" does not
have node-specific handling, so there's a bunch of new code to be
written there. This amount of detail seems about right for me: I'm not
sure that you really care all that much about how many fetches each
worker had to do, but rather you care about how many there were in the
table as a whole.

Another query (without verbose) gives this:

QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather (cost=20000001002.10..20249583898.07 rows=80390828 width=12) (actual time=1554.752..6633.272 rows=3600036 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
-> Nested Loop (cost=20000000002.10..20241543815.27 rows=80390828 width=12) (actual time=1547.552..5832.355 rows=3600036 loops=1)
Join Filter: (tbl1.col1 = tprt_1.col1)
Rows Removed by Join Filter: 21600216
Buffers: shared hit=75688 read=54346
-> Merge Append (cost=2.10..371288.65 rows=6305163 width=4) (actual time=0.758..1980.024 rows=6300063 loops=1)
Sort Key: tprt_1.col1
Buffers: shared hit=75688 read=54345
-> Index Only Scan using tprt1_idx on tprt_1 (cost=0.43..77500.44 rows=1800018 width=4) (actual time=0.211..387.916 rows=1800018 loops=1)
Heap Fetches: 1800018
Buffers: shared hit=22028 read=14035
-> Index Only Scan using tprt2_idx on tprt_2 (cost=0.43..121147.34 rows=2700027 width=4) (actual time=0.180..628.350 rows=2700027 loops=1)
Heap Fetches: 2700027
Buffers: shared hit=40597 read=26251
-> Index Only Scan using tprt3_idx on tprt_3 (cost=0.42..29722.56 rows=900009 width=4) (actual time=0.153..184.429 rows=900009 loops=1)
Heap Fetches: 900009
Buffers: shared hit=6538 read=7029
-> Index Only Scan using tprt4_idx on tprt_4 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.023..0.023 rows=0 loops=1)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using tprt5_idx on tprt_5 (cost=0.15..82.41 rows=2550 width=4) (actual time=0.018..0.018 rows=0 loops=1)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using tprt6_idx on tprt_6 (cost=0.42..29734.56 rows=900009 width=4) (actual time=0.165..191.343 rows=900009 loops=1)
Heap Fetches: 900009
Buffers: shared hit=6523 read=7030
-> Materialize (cost=10000000000.00..10000000048.25 rows=2550 width=4) (actual time=0.000..0.000 rows=4 loops=6300063)
Buffers: shared read=1
-> Seq Scan on tbl1 (cost=10000000000.00..10000000035.50 rows=2550 width=4) (actual time=0.033..0.036 rows=4 loops=1)
Buffers: shared read=1
Planning Time: 4.344 ms
Execution Time: 7008.095 ms
(35 filas)

Looks alright to me.

In any case part of the point is that Instrumentation is supposed to
be a generic structure that contains things that are for the most part
common to all node types. So what MERGE did to that structure looks
like in inadvisable kludge to me. I'd get rid of that and do it a
different way rather than propagate the idea that nfilteredX is
scratch space that can mean something different in every separate
node.

Agreed. My proposed patch adds another generic counter "tuples2", which
can be used at least by ON CONFLICT.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#36

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#24)

Re: pgsql: Support partition pruning at execution time

On 10 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

David Rowley wrote:

Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

Pushed it just now. Let's see what happens with pademelon now.

I've had pademelon's host running a "make installcheck" loop all day
trying to reproduce the problem. I haven't gotten a bite yet (although
at 15+ minutes per cycle, this isn't a huge number of tests). I think
we were remarkably (un)lucky to see the problem so quickly after the
initial commit, and I'm afraid pademelon isn't going to help us prove
much about whether this was the same issue.

This does remind me quite a bit though of the ongoing saga with the
postgres_fdw test instability. Given the frequency with which that's
failing in the buildfarm, you would not think it's impossible to
reproduce outside the buildfarm, and yet I'm here to tell you that
it's pretty damn hard. I haven't succeeded yet, and that's not for
lack of trying. Could there be something about the buildfarm
environment that makes these sorts of things more likely?

coypu just demonstrated that this was not the cause of the problem [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C

I'll study the code a bit more and see if I can think why this might
be happening.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#37

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#36)

4 attachment(s)

Re: pgsql: Support partition pruning at execution time

On 11 April 2018 at 18:58, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 10 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

David Rowley wrote:

Okay, I've written and attached a fix for this. I'm not 100% certain
that this is the cause of the problem on pademelon, but the code does
look wrong, so needs to be fixed. Hopefully, it'll make pademelon
happy, if not I'll think a bit harder about what might be causing that
instability.

Pushed it just now. Let's see what happens with pademelon now.

I've had pademelon's host running a "make installcheck" loop all day
trying to reproduce the problem. I haven't gotten a bite yet (although
at 15+ minutes per cycle, this isn't a huge number of tests). I think
we were remarkably (un)lucky to see the problem so quickly after the
initial commit, and I'm afraid pademelon isn't going to help us prove
much about whether this was the same issue.

This does remind me quite a bit though of the ongoing saga with the
postgres_fdw test instability. Given the frequency with which that's
failing in the buildfarm, you would not think it's impossible to
reproduce outside the buildfarm, and yet I'm here to tell you that
it's pretty damn hard. I haven't succeeded yet, and that's not for
lack of trying. Could there be something about the buildfarm
environment that makes these sorts of things more likely?

coypu just demonstrated that this was not the cause of the problem [1]

I'll study the code a bit more and see if I can think why this might
be happening.

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=coypu&dt=2018-04-11%2004%3A17%3A38&stg=install-check-C

I've spent a bit of time tonight trying to dig into this problem to
see if I can figure out what's going on.

I ended up running the following script on both a Linux x86_64 machine
and also a power8 machine.

#!/bin/bash
for x in {1..1000}
do
echo "$x";
for i in {1..1000}
do
psql -d postgres -f test.sql -o test.out
diff -u test.out test.expect
done
done

I was unable to recreate this problem after about 700k loops on the
Linux machine and 130k loops on the power8.

I've emailed the owner of coypu to ask if it would be possible to get
access to the machine, or have him run the script to see if it does
actually fail. Currently waiting to hear back.

I've made another pass over the nodeAppend.c code and I'm unable to
see what might cause this, although I did discover a bug where
first_partial_plan is not set taking into account that some subplans
may have been pruned away during executor init. The only thing I think
this would cause is for parallel workers to not properly help out with
some partial plans if some earlier subplans were pruned. I can see no
reason for this to have caused this particular issue since the
first_partial_plan would be 0 with and without the attached fix.

Tom, would there be any chance you could run the above script for a
while on pademelon to see if it can in fact reproduce the problem?
coypu did show this problem in the install check, so I don't think it
will need the other concurrent tests to fail. If you can recreate,
after adjusting the expected output, does the problem still exist in
5c0675215?

I also checked with other tests perform an EXPLAIN ANALYZE of a plan
with a Parallel Append and I see there's none. So I've not ruled out
that this is an existing bug. git grep "explain.*analyze" also does
not show much outside of the partition_prune tests either.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

test.sqltext/plain; charset=US-ASCII; name=test.sqlDownload

test.expectapplication/octet-stream; name=test.expectDownload

first_partial_plan_fix.patchapplication/octet-stream; name=first_partial_plan_fix.patchDownload

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index d062cfddac..188399cc12 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -204,6 +204,13 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendplanstates = (PlanState **) palloc(nplans *
 											 sizeof(PlanState *));
 
+	/*
+	 * We must determine the first valid partial plan.  Default this to point
+	 * beyond the final valid plan.  We'll determine the actual first valid
+	 * partial subplan in the loop below.
+	 */
+	appendstate->as_first_partial_plan = nplans;
+
 	/*
 	 * call ExecInitNode on each of the valid plans to be executed and save
 	 * the results into the appendplanstates array.
@@ -215,6 +222,13 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 		{
 			Plan	   *initNode = (Plan *) lfirst(lc);
 
+			/*
+			 * Record the lowest appendplanstates index which is a partial
+			 * plan.
+			 */
+			if (i >= node->first_partial_plan && j < appendstate->as_first_partial_plan)
+				appendstate->as_first_partial_plan = j;
+
 			appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
 		}
 		i++;
@@ -556,7 +570,7 @@ choose_next_subplan_for_leader(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
+	if (node->as_whichplan < node->as_first_partial_plan)
 		node->as_pstate->pa_finished[node->as_whichplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
@@ -629,14 +643,14 @@ choose_next_subplan_for_worker(AppendState *node)
 			/* Advance to the next valid plan. */
 			pstate->pa_next_plan = nextplan;
 		}
-		else if (node->as_whichplan > append->first_partial_plan)
+		else if (node->as_whichplan > node->as_first_partial_plan)
 		{
 			/*
 			 * Try looping back to the first valid partial plan, if there is
 			 * one.  If there isn't, arrange to bail out below.
 			 */
 			nextplan = bms_next_member(node->as_valid_subplans,
-									   append->first_partial_plan - 1);
+									   node->as_first_partial_plan - 1);
 			pstate->pa_next_plan =
 				nextplan < 0 ? node->as_whichplan : nextplan;
 		}
@@ -670,7 +684,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	if (pstate->pa_next_plan < 0)
 	{
 		int			nextplan = bms_next_member(node->as_valid_subplans,
-											   append->first_partial_plan - 1);
+											   node->as_first_partial_plan - 1);
 
 		if (nextplan >= 0)
 			pstate->pa_next_plan = nextplan;
@@ -686,7 +700,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
+	if (node->as_whichplan < node->as_first_partial_plan)
 		node->as_pstate->pa_finished[node->as_whichplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index deab875466..cff64337e2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1150,6 +1150,8 @@ struct AppendState
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
 	int			as_whichplan;
+	int			as_first_partial_plan; /* Index of 'appendplans' containing
+										* the first partial plan */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
 	Size		pstate_len;		/* size of parallel coordination info */
 	struct PartitionPruneState *as_prune_state;

setup.sqltext/plain; charset=US-ASCII; name=setup.sqlDownload

#38

Alvaro Herrera

alvherre@alvh.no-ip.org

over 7 years ago

In reply to: David Rowley (#37)

Re: pgsql: Support partition pruning at execution time

David Rowley wrote:

I've made another pass over the nodeAppend.c code and I'm unable to
see what might cause this, although I did discover a bug where
first_partial_plan is not set taking into account that some subplans
may have been pruned away during executor init. The only thing I think
this would cause is for parallel workers to not properly help out with
some partial plans if some earlier subplans were pruned. I can see no
reason for this to have caused this particular issue since the
first_partial_plan would be 0 with and without the attached fix.

Pushed this.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#39

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Alvaro Herrera (#38)

Re: pgsql: Support partition pruning at execution time

On 18 April 2018 at 07:26, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

David Rowley wrote:

I've made another pass over the nodeAppend.c code and I'm unable to
see what might cause this, although I did discover a bug where
first_partial_plan is not set taking into account that some subplans
may have been pruned away during executor init. The only thing I think
this would cause is for parallel workers to not properly help out with
some partial plans if some earlier subplans were pruned. I can see no
reason for this to have caused this particular issue since the
first_partial_plan would be 0 with and without the attached fix.

Pushed this.

Thanks!

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services