cost_index() and path row estimate.

Started by Bernd Helmlealmost 11 years ago3 messageshackers
Jump to latest
#1Bernd Helmle
mailings@oopsware.de

While looking into a customer performance problem, i saw this in
costsize.c, cost_index() (9.3.6, but it looks the same in HEAD):

/* Mark the path with the correct row estimate */
if (path->path.param_info)
{
path->path.rows = path->path.param_info->ppi_rows;
/* also get the set of clauses that should be enforced by the scan */
allclauses = list_concat(list_copy(path->path.param_info->ppi_clauses),
baserel->baserestrictinfo);
}
else
{
path->path.rows = baserel->rows;
/* allclauses should just be the rel's restriction clauses */
allclauses = baserel->baserestrictinfo;
}

What i'm wondering is the else branch, where the baserel row estimate is
assigned to the
IndexPath. However, it occurs to me that in conjunction with a partial
index, the overall row estimate will be constrained by the row estimate
from the partial index itself, no? I see that this doesn't have an impact
on the cost estimation of the index scan itself, since cost_index() does
this later:

/* estimate number of main-table tuples fetched */
tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);

I stumpled across this, since i see heavy misestimates in the EXPLAIN
output, where the estimated row count from the partial index vs. the real
row count heavily differs. In the customers case, there are two tables,
where one of the relation has many tuples in the JOIN condition which are
NULLs, like:

CREATE TABLE a2(id integer);
CREATE TABLE b2(id integer);

INSERT INTO a2 SELECT NULL FROM generate_series(1, 9800) AS t(id);
INSERT INTO a2 SELECT t.id FROM generate_series(1, 200) AS t(id);
INSERT INTO b2 SELECT t.id FROM generate_series(1, 200) AS t(id);

CREATE INDEX ON a2(id) WHERE id IS NOT NULL;
CREATE INDEX ON b2(id);

Here i get the following plan:

EXPLAIN ANALYZE SELECT * FROM b2 INNER JOIN a2 ON a2.id = b2.id ;

Merge Join (cost=10.79..12.63 rows=4 width=8) (actual time=0.084..0.291
rows=200 loops=1)
Merge Cond: (b2.id = a2.id)
-> Sort (cost=10.64..11.14 rows=200 width=4) (actual time=0.069..0.082
rows=200 loops=1)
Sort Key: b2.id
Sort Method: quicksort Memory: 35kB
-> Seq Scan on b2 (cost=0.00..3.00 rows=200 width=4) (actual
time=0.010..0.027 rows=200 loops=1)
-> Index Only Scan using a2_id_idx on a2 (cost=0.14..15.14 rows=10000
width=4) (actual time=0.012..0.074 rows=200 loops=1)
Heap Fetches: 200
Total runtime: 0.350 ms

EXPLAIN ANALYZE SELECT * FROM b2 INNER JOIN a2 ON a2.id = b2.id WHERE a2.id
IS NOT NULL;

Merge Join (cost=10.79..12.12 rows=1 width=8) (actual time=0.080..0.281
rows=200 loops=1)
Merge Cond: (b2.id = a2.id)
-> Sort (cost=10.64..11.14 rows=200 width=4) (actual time=0.063..0.070
rows=200 loops=1)
Sort Key: b2.id
Sort Method: quicksort Memory: 35kB
-> Seq Scan on b2 (cost=0.00..3.00 rows=200 width=4) (actual
time=0.010..0.034 rows=200 loops=1)
-> Index Only Scan using a2_id_idx on a2 (cost=0.14..15.64 rows=200
width=4) (actual time=0.012..0.052 rows=200 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 200
Total runtime: 0.335 ms

With the partial index predicate explicitly specified the row estimate is
accurate, without the predicate the row estimate of the index scan on
a2_id_idx assumes 10000.

It's very likely i miss something really important here, could someone shed
some light on this?

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bernd Helmle (#1)
Re: cost_index() and path row estimate.

Bernd Helmle <mailings@oopsware.de> writes:

While looking into a customer performance problem, i saw this in
costsize.c, cost_index() (9.3.6, but it looks the same in HEAD):
...
What i'm wondering is the else branch, where the baserel row estimate is
assigned to the
IndexPath. However, it occurs to me that in conjunction with a partial
index, the overall row estimate will be constrained by the row estimate
from the partial index itself, no?

No. The non-parameterized paths for a given relation must all have the
same rowcount estimates; otherwise the path comparison logic fails
fundamentally. Another way to look at it is that every correct path
will yield the same number of rows in reality; so it would be wrong to
give a path that makes use of a partial index a rowcount advantage over
a path that is not using the partial index but nonetheless is enforcing
exactly the same set of scan restriction clauses.

What could potentially make sense is to detect applicability of partial or
unique indexes earlier than we do now, and use that knowledge to adjust
the relation's rows estimate overall, for all paths. But I'm not sure
how to do that without either (a) making the code a lot messier than it
is now or (b) duplicating a lot of work or (c) both.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Bernd Helmle
mailings@oopsware.de
In reply to: Tom Lane (#2)
Re: cost_index() and path row estimate.

--On 1. Mai 2015 11:30:51 -0700 Tom Lane <tgl@sss.pgh.pa.us> wrote:

No. The non-parameterized paths for a given relation must all have the
same rowcount estimates; otherwise the path comparison logic fails
fundamentally. Another way to look at it is that every correct path
will yield the same number of rows in reality; so it would be wrong to
give a path that makes use of a partial index a rowcount advantage over
a path that is not using the partial index but nonetheless is enforcing
exactly the same set of scan restriction clauses.

Thanks for the explanation.

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers