Expression index ignores column statistics target
I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran. I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?
CREATE TABLE foo (x integer);
CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));
INSERT INTO foo (x)
SELECT r1 % r2
FROM generate_series(1, 100) AS g1(r1),
generate_series(1, 100) AS g2(r2);
SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;
SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
most_common_vals
-----------------------------------------------------
{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)
EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
Recheck Cond: (x = 13)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
Index Cond: (x = 13)
Total runtime: 2.905 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
Recheck Cond: (abs(x) = 13)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1)
Index Cond: (abs(x) = 13)
Total runtime: 2.875 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
Recheck Cond: (x = 18)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
Index Cond: (x = 18)
Total runtime: 2.393 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
Recheck Cond: (abs(x) = 18)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1)
Index Cond: (abs(x) = 18)
Total runtime: 2.418 ms
(5 rows)
--
Michael Fuhr
This is expected. The main TODO items is:
* Allow accurate statistics to be collected on indexes with more than
one column or expression indexes, perhaps using per-index statistics
Basically, we don't have multi-column or expression statistics. ANALYZE
just analyzes columns, even if an expression index exists.
---------------------------------------------------------------------------
Michael Fuhr wrote:
I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran. I'm
still rummaging through the archives looking for past discussion;
is this behavior a known limitation or just an oversight?CREATE TABLE foo (x integer);
CREATE INDEX foo_x_idx ON foo (x);
CREATE INDEX foo_abs_x_idx ON foo (abs(x));INSERT INTO foo (x)
SELECT r1 % r2
FROM generate_series(1, 100) AS g1(r1),
generate_series(1, 100) AS g2(r2);SET default_statistics_target TO 15;
ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20;
ANALYZE foo;SELECT most_common_vals FROM pg_stats WHERE attname = 'x';
most_common_vals
-----------------------------------------------------
{0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18}
(1 row)EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1)
Recheck Cond: (x = 13)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1)
Index Cond: (x = 13)
Total runtime: 2.905 ms
(5 rows)EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1)
Recheck Cond: (abs(x) = 13)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1)
Index Cond: (abs(x) = 13)
Total runtime: 2.875 ms
(5 rows)EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1)
Recheck Cond: (x = 18)
-> Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1)
Index Cond: (x = 18)
Total runtime: 2.393 ms
(5 rows)EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1)
Recheck Cond: (abs(x) = 18)
-> Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1)
Index Cond: (abs(x) = 18)
Total runtime: 2.418 ms
(5 rows)--
Michael Fuhr---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Michael Fuhr <mike@fuhr.org> writes:
I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran.
The code does in fact honor per-column statistics targets attached to
expression indexes, viz
alter table myfuncindex alter column pg_expression_1 set statistics 100;
This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets. I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.
regards, tom lane
On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote:
This is expected. The main TODO items is:
* Allow accurate statistics to be collected on indexes with more than
one column or expression indexes, perhaps using per-index statisticsBasically, we don't have multi-column or expression statistics. ANALYZE
just analyzes columns, even if an expression index exists.
But the row count estimates imply that expression index queries do
use column statistics, presumably as a proxy in the absence of
expression statistics. This looks like a relevant commit:
http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php
The behavior I observed is that the planner does appear to use
column statistics when planning an expression index query, but it
doesn't appear to honor a column's non-default statistics target.
In other words:
* Row count estimates for expression index queries (at least simple
ones) are reasonably accurate for the N most common column values,
where N is the value of default_statistics_target when ANALYZE
was run.
* Specifically setting the column's statistics target with ALTER
TABLE SET STATISTICS doesn't result in better statistics for
expression index queries.
That difference in behavior seems odd: if default_statistics_target
has an effect, why doesn't ALTER TABLE SET STATISTICS?
--
Michael Fuhr
On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote:
The code does in fact honor per-column statistics targets attached to
expression indexes, vizalter table myfuncindex alter column pg_expression_1 set statistics 100;
Aha -- that's the piece I didn't know about. I was wondering where
those statistics were being stored, since they were affected by
default_statistics_target but not by per-column statistics targets.
And now I see them when I don't restrict queries against pg_stats
by just the table or column name. Thanks.
--
Michael Fuhr
I wrote:
I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.
Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php
No responses :-(
regards, tom lane
On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:
I wrote:
I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.phpNo responses :-(
Would an ALTER INDEX SET STATISTICS form be possible?
--
Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes:
On Sat, Oct 01, 2005 at 02:19:06AM -0400, Tom Lane wrote:
Ah, here it is:
http://archives.postgresql.org/pgsql-hackers/2004-03/msg00502.php
Would an ALTER INDEX SET STATISTICS form be possible?
It's not so much the table/index misnomer that's bothering me, it's
the lack of a clean way to identify which column of the index you
are talking about.
regards, tom lane
On Sat, Oct 01, 2005 at 02:42:32AM -0400, Tom Lane wrote:
Michael Fuhr <mike@fuhr.org> writes:
Would an ALTER INDEX SET STATISTICS form be possible?
It's not so much the table/index misnomer that's bothering me, it's
the lack of a clean way to identify which column of the index you
are talking about.
Ah, I see -- I wasn't thinking about expressions in multicolumn
indexes. What about identifying the column with the expression
itself, ala quote_ident(pg_get_indexdef())? That might be tedious
for the user to type but it would be attractive from a self-documentation
standpoint.
ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;
I do see that indexes allow multiple instances of the same expression,
so this approach could be ambiguous. Or should such repetition be
prohibited as it is with column names?
test=> CREATE TABLE foo (x integer);
CREATE TABLE
test=> CREATE INDEX foo1_idx ON foo (x, x);
ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index"
test=> CREATE INDEX foo2_idx ON foo (abs(x), abs(x));
CREATE INDEX
--
Michael Fuhr
Tom Lane wrote:
Michael Fuhr <mike@fuhr.org> writes:
I've noticed that row count estimates for expression indexes appear
to rely on default_statistics_target rather than on a column's
actual statistics target. That is, if I use ALTER TABLE SET
STATISTICS to increase a column's statistics target and then run
ANALYZE, then estimates for non-expression-index queries improve
as expected. However, queries that use an expression index remain
accurate for only around the N most common values, where N is the
default_statistics_target that was in effect when ANALYZE ran.The code does in fact honor per-column statistics targets attached to
expression indexes, vizalter table myfuncindex alter column pg_expression_1 set statistics 100;
This isn't documented, mainly because pg_dump doesn't promise to dump
such things, which it doesn't do because I didn't want to see the
"pg_expression_N" naming for expression index columns become graven on
stone tablets. I seem to recall bringing up the question of whether
we could find a less implementation-specific way of commanding this
behavior, but I can't find it in the archives right now.
Is this a TODO?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Michael Fuhr <mike@fuhr.org> writes:
ALTER INDEX indexname ALTER COLUMN "the expression" SET STATISTICS 100;
Yeah, that could probably be made to work.
I do see that indexes allow multiple instances of the same expression,
so this approach could be ambiguous.
I can't think of an actual use for that, though, so we could just ignore
the possible ambiguity. Or we could have the ALTER update all columns
matching the given expression.
test=> CREATE INDEX foo1_idx ON foo (x, x);
ERROR: duplicate key violates unique constraint "pg_attribute_relid_attnam_index"
Hmm, seems like there should be a more direct check for this ...
regards, tom lane