Expression index ignores column statistics target

Started by Michael Fuhrover 20 years ago11 messages
#1Michael Fuhr
mike@fuhr.org

I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran. I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?

CREATE TABLE foo (x integer);

CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));

INSERT INTO foo (x)
SELECT r1 % r2
FROM generate_series(1, 100) AS g1(r1),
generate_series(1, 100) AS g2(r2);

SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;

SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
most_common_vals
-----------------------------------------------------
{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
Recheck Cond: (x = 13)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
Index Cond: (x = 13)
Total runtime: 2.905 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
Recheck Cond: (abs(x) = 13)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1)
Index Cond: (abs(x) = 13)
Total runtime: 2.875 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
Recheck Cond: (x = 18)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
Index Cond: (x = 18)
Total runtime: 2.393 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
Recheck Cond: (abs(x) = 18)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1)
Index Cond: (abs(x) = 18)
Total runtime: 2.418 ms
(5 rows)

--
Michael Fuhr

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Michael Fuhr (#1)
Re: Expression index ignores column statistics target

This is expected. The main TODO items is:

* Allow accurate statistics to be collected on indexes with more than
one column or expression indexes, perhaps using per-index statistics

Basically, we don't have multi-column or expression statistics. ANALYZE
just analyzes columns, even if an expression index exists.

---------------------------------------------------------------------------

Michael Fuhr wrote:

I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran. I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?

CREATE TABLE foo (x integer);

CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));

INSERT INTO foo (x)
SELECT r1 % r2
FROM generate_series(1, 100) AS g1(r1),
generate_series(1, 100) AS g2(r2);

SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;

SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
most_common_vals
-----------------------------------------------------
{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
Recheck Cond: (x = 13)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
Index Cond: (x = 13)
Total runtime: 2.905 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
Recheck Cond: (abs(x) = 13)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1)
Index Cond: (abs(x) = 13)
Total runtime: 2.875 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
Recheck Cond: (x = 18)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
Index Cond: (x = 18)
Total runtime: 2.393 ms
(5 rows)

EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
Recheck Cond: (abs(x) = 18)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1)
Index Cond: (abs(x) = 18)
Total runtime: 2.418 ms
(5 rows)

--
Michael Fuhr

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#1)
Re: Expression index ignores column statistics target

Michael Fuhr <mike@fuhr.org> writes:

I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran.

The code does in fact honor per-column statistics targets attached to
expression indexes, viz

alter table myfuncindex alter column pg_expression_1 set statistics 100;

This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets. I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.

regards, tom lane

#4Michael Fuhr
mike@fuhr.org
In reply to: Bruce Momjian (#2)
Re: Expression index ignores column statistics target

On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote:

This is expected. The main TODO items is:

* Allow accurate statistics to be collected on indexes with more than
one column or expression indexes, perhaps using per-index statistics

Basically, we don't have multi-column or expression statistics. ANALYZE
just analyzes columns, even if an expression index exists.

But the row count estimates imply that expression index queries do
use column statistics, presumably as a proxy in the absence of
expression statistics. This looks like a relevant commit:

http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php

The behavior I observed is that the planner does appear to use
column statistics when planning an expression index query, but it
doesn't appear to honor a column's non-default statistics target.
In other words:

* Row count estimates for expression index queries (at least simple
ones) are reasonably accurate for the N most common column values,
where N is the value of default_statistics_target when ANALYZE
was run.

* Specifically setting the column's statistics target with ALTER
TABLE SET STATISTICS doesn't result in better statistics for
expression index queries.

That difference in behavior seems odd: if default_statistics_target
has an effect, why doesn't ALTER TABLE SET STATISTICS?

--
Michael Fuhr

#5Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#3)
Re: Expression index ignores column statistics target

On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote:

The code does in fact honor per-column statistics targets attached to
expression indexes, viz

alter table myfuncindex alter column pg_expression_1 set statistics 100;

Aha -- that's the piece I didn't know about. I was wondering where
those statistics were being stored, since they were affected by
default_statistics_target but not by per-column statistics targets.
And now I see them when I don't restrict queries against pg_stats
by just the table or column name. Thanks.

--
Michael Fuhr

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: Expression index ignores column statistics target

I wrote:

I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.

Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php

No responses :-(

regards, tom lane

#7Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#6)
Re: Expression index ignores column statistics target

On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:

I wrote:

I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.

Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php

No responses :-(

Would an ALTER INDEX SET STATISTICS form be possible?

--
Michael Fuhr

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#7)
Re: Expression index ignores column statistics target

Michael Fuhr <mike@fuhr.org> writes:

On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:

Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php

Would an ALTER INDEX SET STATISTICS form be possible?

It's not so much the table/index misnomer that's bothering me, it's
the lack of a clean way to identify which column of the index you
are talking about.

regards, tom lane

#9Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#8)
Re: Expression index ignores column statistics target

On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

Would an ALTER INDEX SET STATISTICS form be possible?

It's not so much the table/index misnomer that's bothering me, it's
the lack of a clean way to identify which column of the index you
are talking about.

Ah, I see -- I wasn't thinking about expressions in multicolumn
indexes. What about identifying the column with the expression
itself, ala quote_ident(pg_get_indexdef())? That might be tedious
for the user to type but it would be attractive from a self-documentation
standpoint.

ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;

I do see that indexes allow multiple instances of the same expression,
so this approach could be ambiguous. Or should such repetition be
prohibited as it is with column names?

test=> CREATE TABLE foo (x integer);
CREATE TABLE
test=> CREATE INDEX foo1_idx ON foo (x, x);
ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index"
test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x));
CREATE INDEX

--
Michael Fuhr

#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#3)
Re: Expression index ignores column statistics target

Tom Lane wrote:

Michael Fuhr <mike@fuhr.org> writes:

I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran.

The code does in fact honor per-column statistics targets attached to
expression indexes, viz

alter table myfuncindex alter column pg_expression_1 set statistics 100;

This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets. I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.

Is this a TODO?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#9)
Re: Expression index ignores column statistics target

Michael Fuhr <mike@fuhr.org> writes:

ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;

Yeah, that could probably be made to work.

I do see that indexes allow multiple instances of the same expression,
so this approach could be ambiguous.

I can't think of an actual use for that, though, so we could just ignore
the possible ambiguity. Or we could have the ALTER update all columns
matching the given expression.

test=> CREATE INDEX foo1_idx ON foo (x, x);
ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index"

Hmm, seems like there should be a more direct check for this ...

regards, tom lane