BUG #8598: Row count estimates of partial indexes

Started by Marko Tiikkajaover 12 years ago9 messagesbugs
Jump to latest
#1Marko Tiikkaja
marko@joh.to

The following bug has been logged on the website:

Bug reference: 8598
Logged by: Marko Tiikkaja
Email address: marko@joh.to
PostgreSQL version: 9.1.9
Operating system: Linux
Description:

Hi,

We have the following partial index on a small subset of a larger table:

"index_transactions_transaction_balance_details" btree (transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL

However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

=# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed AND
accountbalancesdailyid IS NOT NULL;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using index_transactions_transaction_balance_details on
transactions (cost=0.00..3883160.47 rows=66259403 width=130) (actual
time=0.033..18.268 rows=13962 loops=1)
Total runtime: 18.874 ms
(2 rows)

.. which makes for some silly joins when this index is part of a larger
query.

Is this expected on 9.1? Has this been fixed in more recent versions?

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Marko Tiikkaja (#1)
Re: BUG #8598: Row count estimates of partial indexes

"marko@joh.to" <marko@joh.to> wrote:

   "index_transactions_transaction_balance_details" btree
(transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL

However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

What is the pg_class.reltuples value for the index?

=# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed
AND accountbalancesdailyid IS NOT NULL;

Index Scan using index_transactions_transaction_balance_details
on transactions  (cost=0.00..3883160.47 rows=66259403 width=130)
(actual time=0.033..18.268 rows=13962 loops=1)

.. which makes for some silly joins when this index is part of a
larger query.

Is this expected on 9.1?  Has this been fixed in more recent
versions?

Please provide a little more information:

SELECT version();
SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');

What VACUUM or ANALYZE commands are run outside of autovacuum?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#1)
Re: BUG #8598: Row count estimates of partial indexes

marko@joh.to writes:

We have the following partial index on a small subset of a larger table:
"index_transactions_transaction_balance_details" btree (transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

Yup. Row count estimates are derived by estimating the selectivity of the
given WHERE clauses and multiplying by the (estimated) current table size.
In the particular case you show here, with a partial index that *exactly*
matches the WHERE clause, we could get a better answer by looking at the
index size --- but that doesn't scale to any less simplistic case, such
as a query with additional WHERE clauses.

It's also important to realize that reltuples for an index is a whole lot
less trustworthy than it is for a table; ANALYZE doesn't update the
former, for example. And scaling from the last-reported VACUUM stats
to current reality is going to be shakier.

So on the whole, I don't think this would be a good idea.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#3)
Re: BUG #8598: Row count estimates of partial indexes

On 11/17/13, 5:29 PM, Tom Lane wrote:

marko@joh.to writes:

We have the following partial index on a small subset of a larger table:
"index_transactions_transaction_balance_details" btree (transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

Yup. Row count estimates are derived by estimating the selectivity of the
given WHERE clauses and multiplying by the (estimated) current table size.
In the particular case you show here, with a partial index that *exactly*
matches the WHERE clause, we could get a better answer by looking at the
index size --- but that doesn't scale to any less simplistic case, such
as a query with additional WHERE clauses.

It's also important to realize that reltuples for an index is a whole lot
less trustworthy than it is for a table; ANALYZE doesn't update the
former, for example. And scaling from the last-reported VACUUM stats
to current reality is going to be shakier.

So on the whole, I don't think this would be a good idea.

Any suggestions for a workaround? When reading this index as a part of
a bigger query the horrible estimate ensures that nobody's having fun.
I currently have something like:

SELECT * FROM
(SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
transactions
JOIN ..

And I *really* don't like that as a workaround.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#4)
Re: BUG #8598: Row count estimates of partial indexes

Marko Tiikkaja <marko@joh.to> writes:

Any suggestions for a workaround? When reading this index as a part of
a bigger query the horrible estimate ensures that nobody's having fun.

Why is the estimate so bad? I suppose the answer is that those two
columns are very strongly correlated. Maybe you could refactor your
data representation to avoid that?

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Marko Tiikkaja (#4)
Re: BUG #8598: Row count estimates of partial indexes

On Sun, Nov 17, 2013 at 11:55 AM, Marko Tiikkaja <marko@joh.to> wrote:

On 11/17/13, 5:29 PM, Tom Lane wrote:

marko@joh.to writes:

We have the following partial index on a small subset of a larger table:
"index_transactions_transaction_balance_details" btree
(transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

Yup. Row count estimates are derived by estimating the selectivity of the
given WHERE clauses and multiplying by the (estimated) current table size.
In the particular case you show here, with a partial index that *exactly*
matches the WHERE clause, we could get a better answer by looking at the
index size --- but that doesn't scale to any less simplistic case, such
as a query with additional WHERE clauses.

It's also important to realize that reltuples for an index is a whole lot
less trustworthy than it is for a table; ANALYZE doesn't update the
former, for example. And scaling from the last-reported VACUUM stats
to current reality is going to be shakier.

So on the whole, I don't think this would be a good idea.

Any suggestions for a workaround? When reading this index as a part of a
bigger query the horrible estimate ensures that nobody's having fun. I
currently have something like:

Define a new column which is true iff the where condition is true? It
sounds like that one magic combination has a meaning all of its own, so it
would make sense to encode it in one column.

SELECT * FROM
(SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
transactions
JOIN ..

And I *really* don't like that as a workaround.

I've wanted a function that always returns true, but which the planner
things returns false most of the time, for use in such situations. It
looks like you can make one of these with a compiled module (by creating an
operator and then wrapping that in a function), but I have not found a way
to do it without using C. (CREATE FUNCTION takes a COST and ROWS, but not
a SELECTIVITY.)

Cheers,

Jeff

#7Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#5)
Re: BUG #8598: Row count estimates of partial indexes

On 11/17/13 9:18 PM, Tom Lane wrote:

Marko Tiikkaja <marko@joh.to> writes:

Any suggestions for a workaround? When reading this index as a part of
a bigger query the horrible estimate ensures that nobody's having fun.

Why is the estimate so bad? I suppose the answer is that those two
columns are very strongly correlated. Maybe you could refactor your
data representation to avoid that?

I'll look into that, thanks.

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

I'm completely clueless about how the planner works, but wouldn't it be
easier to have some kind of separate stats for the conditions in partial
indexes? It seems better in all cases than trying infer the stats from
cross-column correlations, even if we had that.

Regards,
Marko Tiikkaja

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#7)
Re: BUG #8598: Row count estimates of partial indexes

Marko Tiikkaja <marko@joh.to> writes:

On 11/17/13 9:18 PM, Tom Lane wrote:

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

I'm completely clueless about how the planner works, but wouldn't it be
easier to have some kind of separate stats for the conditions in partial
indexes? It seems better in all cases than trying infer the stats from
cross-column correlations, even if we had that.

There's been some discussion of providing a way to hint to ANALYZE about
which combinations of columns are worth gathering cross-column statistics
for. But partial index predicates seem like a pretty bad mechanism for
that.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#9Claudio Freire
klaussfreire@gmail.com
In reply to: Tom Lane (#8)
Re: BUG #8598: Row count estimates of partial indexes

On Mon, Nov 18, 2013 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marko Tiikkaja <marko@joh.to> writes:

On 11/17/13 9:18 PM, Tom Lane wrote:

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

I'm completely clueless about how the planner works, but wouldn't it be
easier to have some kind of separate stats for the conditions in partial
indexes? It seems better in all cases than trying infer the stats from
cross-column correlations, even if we had that.

There's been some discussion of providing a way to hint to ANALYZE about
which combinations of columns are worth gathering cross-column statistics
for. But partial index predicates seem like a pretty bad mechanism for
that.

regards, tom lane

Why?

If there's a partial index on some predicate, it does mean the
predicate is of common occurence or at least important and it's quite
expectable that more precise estimations regarding those queries
valuable.

Analyze should simply record the selectivity of partial index
predicates as it would the MFV of the boolean variable equal to the
predicate's result, and modifying the MFV estimation code to look up
for those specific stats doesn't seem too difficult.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs