PoC/WIP: Extended statistics on expressions

Started by Tomas Vondraover 5 years ago107 messageshackers
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

Fairly straighforward, I think. I'm sure the logic which expressions
are allowed is not 100% (e.g. volatile functions etc.) but that's
a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

I ended up adding a separate column, similar to indexprs, except that
the order of columns/expressions does not matter, so we don't need to
bother with storing 0s in stxkeys - we simply consider expressions to
be "after" all the simple columns.

3) build statistics

This should work too, for all three types of statistics we have (mcv,
dependencies and ndistinct). This should work too, although the code
changes are often very hackish "to make it work".

The main challenge here was how to represent the expressions in the
statistics - e.g. in ndistinct, which track ndistinct estimates for
combinations of parameters, and so far we used attnums for that. I
decided the easiest way it to keep doing that, but offset the
expressions by MaxHeapAttributeNumber. That seems to work, but maybe
there's a better way.

4) apply the statistics

This is the hard part, really, and the exact state of the support
depends on type of statistics.

For ndistinct coefficients, it generally works. I'm sure there may be
bugs in estimate_num_groups, etc. but in principle it works.

For MCV lists, it generally works too - you can define statistics on
the expressions and the estimates should improve. The main downside
here is that it requires at least two expressions, otherwise we can't
build/apply the extended statistics. So for example

SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

may be estimated "correctly", once you drop any of the conditions it
gets much worse as we don't have stats for individual expressions.
That's rather annoying - it does not break the extended MCV, but the
behavior will certainly cause confusion.

For functional dependencies, the estimation does not work yet. Also,
the missing per-column statistics have bigger impact than on MCV,
because while MCV can work fine without it, the dependencies heavily
rely on the per-column estimates. We only apply "corrections" based
on the dependency degree, so we still need (good) per-column
estimates, which does not quite work with the expressions.

Of course, the lack of per-expression statistics may be somewhat
fixed by adding indexes on expressions, but that's kinda expensive.

Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.

table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;

poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
QUERY PLAN

----------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
Rows Removed by Filter: 952381
Planning Time: 0.329 ms
Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
QUERY PLAN

-----------------------------------------------------------------------------------------------
HashAggregate (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
Group Key: mod(a, 3), mod(a, 7)
Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB
-> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Planning Time: 0.277 ms
Execution Time: 502.803 ms
(6 rows)

improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
QUERY PLAN

-----------------------------------------------------------------------------------------------
HashAggregate (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
Group Key: mod(a, 3), mod(a, 7)
Batches: 1 Memory Usage: 24kB
-> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Planning Time: 0.135 ms
Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
QUERY PLAN

-------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
Rows Removed by Filter: 952381
Planning Time: 0.702 ms
Execution Time: 152.280 ms
(5 rows)

Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Support-for-extended-statistics-on-expressi-20201116.patchtext/x-patch; charset=UTF-8; name=0001-Support-for-extended-statistics-on-expressi-20201116.patchDownload+2839-282
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: PoC/WIP: Extended statistics on expressions

On 11/16/20 2:49 PM, Tomas Vondra wrote:

Hi,

...

4) apply the statistics

This is the hard part, really, and the exact state of the support
depends on type of statistics.

For ndistinct coefficients, it generally works. I'm sure there may be
bugs in estimate_num_groups, etc. but in principle it works.

For MCV lists, it generally works too - you can define statistics on
the expressions and the estimates should improve. The main downside
here is that it requires at least two expressions, otherwise we can't
build/apply the extended statistics. So for example

SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

may be estimated "correctly", once you drop any of the conditions it
gets much worse as we don't have stats for individual expressions.
That's rather annoying - it does not break the extended MCV, but the
behavior will certainly cause confusion.

For functional dependencies, the estimation does not work yet. Also,
the missing per-column statistics have bigger impact than on MCV,
because while MCV can work fine without it, the dependencies heavily
rely on the per-column estimates. We only apply "corrections" based
on the dependency degree, so we still need (good) per-column
estimates, which does not quite work with the expressions.

Of course, the lack of per-expression statistics may be somewhat
fixed by adding indexes on expressions, but that's kinda expensive.

FWIW after re-reading [1]/messages/by-id/6331.1579041473@sss.pgh.pa.us, I think the plan to build pg_statistic rows
for expressions and stash them in pg_statistic_ext_data is the way to
go. I was thinking that maybe we'll need some new statistics type to
request this, e.g.

CREATE STATISTICS s (expressions) ON ...

but on second thought I think we should just build this whenever there
are expressions in the definition. It'll require some changes (e.g. we
require at least two items in the list, but we'll want to allow building
stats on a single expression too, I guess), but that's doable.

Of course, we don't have any catalogs with composite types yet, so it's
not 100% sure this will work, but it's worth a try.

regards

[1]: /messages/by-id/6331.1579041473@sss.pgh.pa.us
/messages/by-id/6331.1579041473@sss.pgh.pa.us

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#2)
Re: PoC/WIP: Extended statistics on expressions

Hi,

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.

Overall, the patch does two main things:

* it adds a new "expressions" statistics kind, building per-expression
statistics (i.e it's similar to having expression index)

* it allows using expressions in definition of extended statistics, and
properly handles that in all existing statistics kinds (dependencies,
mcv, ndistinct)

The expression handling mostly copies what we do for indexes, with
similar restrictions - no volatile functions, aggregates etc. The list
of expressions is stored in pg_statistic_ext catalog, but unlike for
indexes we don't need to worry about the exact order of elements, so
there are no "0" for expressions in stxkeys etc. We simply assume the
expressions come after simple columns, and that's it.

To reference expressions in the built statistics (e.g. in a dependency)
we use "special attnums" computed from the expression index by adding
MaxHeapAttributeNumber. So the first expression has attnum 1601 etc.

This mapping expressions to attnums is used both while building and
applying the statistics to clauses, as it makes the whole process much
simpler than dealing with attnums and expressions entirely separately.

The first part allows us to do something like this:

CREATE TABLE t (a int);
INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
ANALYZE t;

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM t WHERE mod(a,10) = 0;

CREATE STATISTICS s (expressions) ON mod(a,10) FROM t;
ANALYZE t;

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM t WHERE mod(a,10) = 0;

Without the statistics we get this:

QUERY PLAN
--------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=5000 width=4)
(actual rows=100000 loops=1)
Filter: (mod(a, 10) = 0)
Rows Removed by Filter: 900000
Planning Time: 0.216 ms
Execution Time: 157.552 ms
(5 rows)

while with the statistics we get this

QUERY PLAN
----------------------------------------------------------
Seq Scan on t (cost=0.00..19425.00 rows=100900 width=4)
(actual rows=100000 loops=1)
Filter: (mod(a, 10) = 0)
Rows Removed by Filter: 900000
Planning Time: 0.399 ms
Execution Time: 157.530 ms
(5 rows)

So that's pretty nice improvement. In practice you could get the same
effect by creating an expression index

CREATE INDEX ON t (mod(a,10));

but of course that's far from free - there's cost to maintain the index,
it blocks HOT, and it takes space on disk. The statistics have none of
these issues.

Implementation-wise, this simply builds per-column statistics for each
expression, and stashes them into a new column in pg_statistic_ext_data
catalog as an array of elements with pg_statistic composite type. And
then in selfuncs.c we look not just at indexes, but also at this when
looking for expression stats.

So that gives us the per-expression stats. This is enabled by default
when you don't specify the statistics type and the definition includes
any expression that is not a simple column reference. Otherwise you may
also request it explicitly by using "expressions" in the CREATE.

Now, the second part is really just a natural extension of the existing
stats to also work with expressions. The easiest thing is probably to
show some examples, so consider this:

CREATE TABLE t (a INT, b INT, c INT);
INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i);
ANALYZE t;

which without any statistics gives us this:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

QUERY PLAN
------------------------------------------------------
Seq Scan on t (cost=0.00..25406.00 rows=25 width=4)
(actual rows=100000 loops=1)
Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
Rows Removed by Filter: 900000
Planning Time: 0.080 ms
Execution Time: 161.445 ms
(5 rows)

EXPLAIN (ANALYZE, TIMING OFF)
SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

QUERY PLAN
------------------------------------------------------------------
HashAggregate (cost=76656.00..99468.50 rows=1000000 width=12)
(actual rows=10 loops=1)
Group Key: mod(a, 10), mod(b, 5)
Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB
-> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Planning Time: 0.232 ms
Execution Time: 514.446 ms
(6 rows)

and now let's add statistics on the expressions:

CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t;
ANALYZE t;

which ends up with these spot-on estimates:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0;

QUERY PLAN
---------------------------------------------------------
Seq Scan on t (cost=0.00..25406.00 rows=97400 width=4)
(actual rows=100000 loops=1)
Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0))
Rows Removed by Filter: 900000
Planning Time: 0.366 ms
Execution Time: 159.207 ms
(5 rows)

EXPLAIN (ANALYZE, TIMING OFF)
SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5);

QUERY PLAN
-----------------------------------------------------------------
HashAggregate (cost=25406.00..25406.15 rows=10 width=12)
(actual rows=10 loops=1)
Group Key: mod(a, 10), mod(b, 5)
Batches: 1 Memory Usage: 24kB
-> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Planning Time: 0.299 ms
Execution Time: 530.793 ms
(6 rows)

Of course, this is a very simple query, but hopefully you get the idea.

There's about two main areas where I think might be hidden issues:

1) We're kinda faking the pg_statistic entries, and I suppose there
might be some loose ends (e.g. with respect to ACLs etc.).

2) Memory management while evaluating the expressions during analyze is
kinda simplistic, we're probably keeping the memory around longer than
needed etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Extended-statistics-on-expressions-20201122.patchtext/x-patch; charset=UTF-8; name=0001-Extended-statistics-on-expressions-20201122.patchDownload+4346-337
#4Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#3)
Re: PoC/WIP: Extended statistics on expressions

On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.

I was looking at the previous patch, so now read this one instead, and attach
some proposed fixes.

+ * This matters especially for * expensive expressions, of course.

+   The expression can refer only to columns of the underlying table, but
+   it can use all columns, not just the ones the statistics is defined
+   on.

I don't know what these are trying to say?

+                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates". Or
should statistics support predicates, too ?

Idea: if a user specifies no stakinds, and there's no expression specified,
then you automatically build everything except for expressional stats. But if
they specify only one statistics "column", it gives an error. If that's a
non-simple column reference, should that instead build *only* expressional
stats (possibly with a NOTICE, since the user might be intending to make MV
stats).

I think pg_stats_ext should allow inspecting the pg_statistic data in
pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by
something, so maybe it should use ORDINALITY (?)

I hacked more on bootstrap.c so included that here.

--
Justin

Attachments:

0001-bootstrap-convert-Typ-to-a-List.patchtext/x-diff; charset=us-asciiDownload+31-39
0002-Allow-composite-types-in-bootstrap.patchtext/x-diff; charset=us-asciiDownload+28-1
0003-Extended-statistics-on-expressions.patchtext/x-diff; charset=us-asciiDownload+4328-337
0004-Fix-silly-errors.patchtext/x-diff; charset=us-asciiDownload+6-10
0005-Small-language-fixen.patchtext/x-diff; charset=us-asciiDownload+13-14
0006-Some-cleanup.patchtext/x-diff; charset=us-asciiDownload+44-87
0007-WIP-Update-pg_stats_ext-for-expressional-stats-incom.patchtext/x-diff; charset=us-asciiDownload+211-1
#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#4)
Re: PoC/WIP: Extended statistics on expressions

On 11/23/20 3:26 AM, Justin Pryzby wrote:

On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote:

attached is a significantly improved version of the patch, allowing
defining extended statistics on expressions. This fixes most of the
problems in the previous WIP version and AFAICS it does pass all
regression tests (including under valgrind). There's a bunch of FIXMEs
and a couple loose ends, but overall I think it's ready for reviews.

I was looking at the previous patch, so now read this one instead, and attach
some proposed fixes.

+ * This matters especially for * expensive expressions, of course.

The point this was trying to make is that we evaluate the expressions
only once, and use the results to build all extended statistics. Instead
of leaving it up to every "build" to re-evaluate it.

+   The expression can refer only to columns of the underlying table, but
+   it can use all columns, not just the ones the statistics is defined
+   on.

I don't know what these are trying to say?

D'oh. That's bogus para, copied from the CREATE INDEX docs (where it
talked about the index predicate, which is irrelevant here).

+                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates". Or
should statistics support predicates, too ?

Right. Stupid copy-pasto.

Idea: if a user specifies no stakinds, and there's no expression specified,
then you automatically build everything except for expressional stats. But if
they specify only one statistics "column", it gives an error. If that's a
non-simple column reference, should that instead build *only* expressional
stats (possibly with a NOTICE, since the user might be intending to make MV
stats).

Right, that was the intention - but I messed up and it works only if you
specify the "expressions" kind explicitly (and I also added the ERROR
message to expected output by mistake). I agree we should handle this
automatically, so that

CREATE STATISTICS s ON (a+b) FROM t

works and only creates statistics for the expression.

I think pg_stats_ext should allow inspecting the pg_statistic data in
pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by
something, so maybe it should use ORDINALITY (?)

I agree we should expose the expression statistics, but I'm not
convinced we should do that in the pg_stats_ext view itself. The problem
is that it's a table bested in a table, essentially, with non-trivial
structure, so I was thinking about adding a separate view exposing just
this one part. Something like pg_stats_ext_expressions, with about the
same structure as pg_stats, or something.

I hacked more on bootstrap.c so included that here.

Thanks. As for the 0004-0007 patches:

0004 - Seems fine. IMHO not really "silly errors" but OK.

0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs,
though. The paragraph about "t1" is old, so if we want to reword it then
maybe we should backpatch too.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere. Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.

Thanks for the review! I'll try to look more closely at those patches
sometime next week, and merge most of it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#5)
Re: PoC/WIP: Extended statistics on expressions

Hi,

Attached is an updated version of the patch series, merging some of the
changes proposed by Justin. I've kept the bootstrap patches separate, at
least for now.

As for the individual 0004-0007 patches:

1) 0004 - merged as is

2) 0005 - I've merged some of the docs changes, but some of the wording
was copied from CREATE INDEX docs in which case I've kept that. I've
also not merged changed to pre-existing docs, like the t1 example which
is unrelated to this patch.

OTOH I've corrected the t3 example description, which was somewhat bogus
and unrelated to what the example actually did. I've also removed the
irrelevant para which originally described index predicates and was
copied from CREATE INDEX docs by mistake.

3) 0006 - I've committed something similar / less invasive, achieving
the same goals (I think), and I've added a couple regression tests.

4) 0007 - I agreed we need a way to expose the stats, but including this
in pg_stats_ext seems rather inconvenient (table in a table is difficult
to work with). Instead I've added a new catalog pg_stats_ext_exprs with
structure similar to pg_stats. I've also added the expressions to the
pg_stats_ext catalog, which was only showing the attributes, and some
basic docs for the catalog changes.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-bootstrap-convert-Typ-to-a-List-20201123.patchtext/x-patch; charset=UTF-8; name=0001-bootstrap-convert-Typ-to-a-List-20201123.patchDownload+31-39
0002-Allow-composite-types-in-bootstrap-20201123.patchtext/x-patch; charset=UTF-8; name=0002-Allow-composite-types-in-bootstrap-20201123.patchDownload+28-1
0003-Extended-statistics-on-expressions-20201123.patchtext/x-patch; charset=UTF-8; name=0003-Extended-statistics-on-expressions-20201123.patchDownload+4815-356
#7Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#5)
Re: PoC/WIP: Extended statistics on expressions

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:

0004 - Seems fine. IMHO not really "silly errors" but OK.

This is one of the same issues you pointed out - shadowing a variable.
Could be backpatched.

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:

+                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates". Or
should statistics support predicates, too ?

Right. Stupid copy-pasto.

Right, but then I was wondering if CREATE STATS should actually support
predicates, since one use case is to do what indexes do without their overhead.
I haven't thought about it enough yet.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere.

-       if (build_expressions && (list_length(stxexprs) == 0))
+       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
                ereport(ERROR,  
                                (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-                                errmsg("extended expression statistics require at least one expression")));
+                                errmsg("multi-variate statistics require at least two columns")));

I think all of "CREATE STATISTICS" has been known as "extended stats", so I
think it may be confusing to say that it requires two columns for the general
facility.

Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

Right. I think "i" is poor variable name when it isn't a loop variable and not
of limited scope.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.

Right - then unnest() the whole thing and return one row per expression rather
than array, as you've done. Maybe the docs should say that this returns one
row per expression.

Looking quickly at your new patch: I guess you know there's a bunch of
lingering references to "indexes" and "predicates":

I don't know if you want to go to the effort to prohibit this.
postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
CREATE STATISTICS

I think a lot of people will find this confusing:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR: extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
ERROR: extended expression statistics require at least one expression
postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
CREATE STATISTICS

I haven't looked, but is it possible to make it work without parens ?

--
Justin

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#7)
Re: PoC/WIP: Extended statistics on expressions

On 11/24/20 5:23 PM, Justin Pryzby wrote:

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:

0004 - Seems fine. IMHO not really "silly errors" but OK.

This is one of the same issues you pointed out - shadowing a variable.
Could be backpatched.

On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote:

+                                errmsg("statistics expressions and predicates can refer only to the table being indexed")));
+        * partial-index predicates.  Create it in the per-index context to be

I think these are copied and shouldn't mention "indexes" or "predicates". Or
should statistics support predicates, too ?

Right. Stupid copy-pasto.

Right, but then I was wondering if CREATE STATS should actually support
predicates, since one use case is to do what indexes do without their overhead.
I haven't thought about it enough yet.

Well, it's not supported now, so the message is bogus. I'm not against
supporting "partial statistics" with predicates in the future, but it's
going to be non-trivial project on it's own. It's not something I can
bolt onto the current patch easily.

0006 - Not sure. I think CreateStatistics can be fixed with less code,
keeping it more like PG13 (good for backpatching). Not sure why rename
extended statistics to multi-variate statistics - we use "extended"
everywhere.

-       if (build_expressions && (list_length(stxexprs) == 0))
+       if (!build_expressions_only && (list_length(stmt->exprs) < 2))
ereport(ERROR,  
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-                                errmsg("extended expression statistics require at least one expression")));
+                                errmsg("multi-variate statistics require at least two columns")));

I think all of "CREATE STATISTICS" has been known as "extended stats", so I
think it may be confusing to say that it requires two columns for the general
facility.

Not sure what's the point of serialize_expr_stats changes,
that's code is mostly copy-paste from update_attstats.

Right. I think "i" is poor variable name when it isn't a loop variable and not
of limited scope.

OK, I understand. I'll consider tweaking that.

0007 - I suspect this makes the pg_stats_ext too complex to work with,
IMHO we should move this to a separate view.

Right - then unnest() the whole thing and return one row per expression rather
than array, as you've done. Maybe the docs should say that this returns one
row per expression.

Looking quickly at your new patch: I guess you know there's a bunch of
lingering references to "indexes" and "predicates":

I don't know if you want to go to the effort to prohibit this.
postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t;
CREATE STATISTICS

Hmm, we're already rejecting system attributes, I suppose we should do
the same thing for expressions on system attributes.

I think a lot of people will find this confusing:

postgres=# CREATE STATISTICS asf ON i FROM t;
ERROR: extended statistics require at least 2 columns
postgres=# CREATE STATISTICS asf ON (i) FROM t;
CREATE STATISTICS

postgres=# CREATE STATISTICS asf (expressions) ON i FROM t;
ERROR: extended expression statistics require at least one expression
postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t;
CREATE STATISTICS

I haven't looked, but is it possible to make it work without parens ?

Hmm, you're right that may be surprising. I suppose we could walk the
expressions while creating the statistics, and replace such trivial
expressions with the nested variable, but I haven't tried. I wonder what
the CREATE INDEX behavior would be in these cases.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#6)
Re: PoC/WIP: Extended statistics on expressions

Hi,

Attached is a patch series rebased on top of 25a9e54d2d which improves
estimation of OR clauses. There were only a couple minor conflicts.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-bootstrap-convert-Typ-to-a-List-20201203.patchtext/x-patch; charset=UTF-8; name=0001-bootstrap-convert-Typ-to-a-List-20201203.patchDownload+31-39
0002-Allow-composite-types-in-bootstrap-20201203.patchtext/x-patch; charset=UTF-8; name=0002-Allow-composite-types-in-bootstrap-20201203.patchDownload+28-1
0003-Extended-statistics-on-expressions-20201203.patchtext/x-patch; charset=UTF-8; name=0003-Extended-statistics-on-expressions-20201203.patchDownload+4816-357
#10Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#9)
Re: PoC/WIP: Extended statistics on expressions

On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1]/messages/by-id/1009.1579038764@sss.pgh.pa.us, I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

CREATE STATISTICS s ON col FROM tbl;
ERROR: extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

Regards,
Dean

[1]: /messages/by-id/1009.1579038764@sss.pgh.pa.us

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#10)
Re: PoC/WIP: Extended statistics on expressions

On 12/7/20 10:56 AM, Dean Rasheed wrote:

On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Attached is a patch series rebased on top of 25a9e54d2d.

After reading this thread and [1], I think I prefer the name
"standard" rather than "expressions", because it is meant to describe
the kind of statistics being built rather than what they apply to, but
maybe that name doesn't actually need to be exposed to the end user:

Looking at the current behaviour, there are a couple of things that
seem a little odd, even though they are understandable. For example,
the fact that

CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;

fails, but

CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;

succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax

CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;

tends to suggest that it's going to create statistics on the pair of
expressions, describing their correlation, when actually it builds 2
independent statistics. Also, this error text isn't entirely accurate:

CREATE STATISTICS s ON col FROM tbl;
ERROR: extended statistics require at least 2 columns

because extended statistics don't always require 2 columns, they can
also just have an expression, or multiple expressions and 0 or 1
columns.

I think a lot of this stems from treating "expressions" in the same
way as the other (multi-column) stats kinds, and it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

The first syntax would create single-column stats, and wouldn't accept
a statistics_kind argument, because there is only one kind of
single-column statistic. Maybe that might change in the future, but if
so, it's likely that the kinds of single-column stats will be
different from the kinds of multi-column stats.

In the second syntax, the only accepted kinds would be the current
multi-column stats kinds (ndistinct, dependencies, and mcv), and it
would always build stats describing the correlations between the
columns listed. It would continue to build standard/expression stats
on any expressions in the list, but that's more of an implementation
detail.

It would no longer be possible to do "CREATE STATISTICS s
(expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
issue 2 separate "CREATE STATISTICS" commands, but that seems more
logical, because they're independent stats.

The parsing code might not change much, but some of the errors would
be different. For example, the errors "building only extended
expression statistics on simple columns not allowed" and "extended
expression statistics require at least one expression" would go away,
and the error "extended statistics require at least 2 columns" might
become more specific, depending on the stats kind.

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#11)
Re: PoC/WIP: Extended statistics on expressions

On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 12/7/20 10:56 AM, Dean Rasheed wrote:

it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Regards,
Dean

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#12)
Re: PoC/WIP: Extended statistics on expressions

On 12/7/20 5:02 PM, Dean Rasheed wrote:

On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 12/7/20 10:56 AM, Dean Rasheed wrote:

it might actually be
neater to have separate documented syntaxes for single- and
multi-column statistics:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
ON (expression)
FROM table_name

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON { column_name | (expression) } , { column_name | (expression) } [, ...]
FROM table_name

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.

Hmm, I'm not sure it would really be a good idea to build MCV stats on
expressions without also building the standard stats for those
expressions, otherwise the assumptions that
mcv_combine_selectivities() makes about simple_sel and mcv_basesel
wouldn't really hold. But then, if multiple MCV stats shared the same
expression, it would be quite wasteful to build standard stats on the
expression more than once.

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

It feels like it should build a single extended stats object for each
unique expression, with appropriate dependencies for any MCV stats
that used those expressions, but I'm not sure how complex that would
be. Dropping the last MCV stat object using a standard expression stat
object might reasonably drop the expression stats ... except if they
were explicitly created by the user, independently of any MCV stats.
That could get quite messy.

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#13)
Re: PoC/WIP: Extended statistics on expressions

On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.

Yes, I think that's a reasonable approach to take. As long as the
documentation makes it clear that building MCV stats also causes
standard expression stats to be built on any expressions included in
the list, then the user will know and can avoid duplication most of
the time. I don't think there's any need for code to try to prevent
that -- just as we don't bother with code to prevent a user building
multiple indexes on the same column.

The only case where duplication won't be avoidable is where there are
multiple MCV stats sharing the same expression, but that's probably
quite unlikely in practice, and it seems acceptable to leave improving
that as a possible future optimisation.

Regards,
Dean

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#14)
Re: PoC/WIP: Extended statistics on expressions

On 12/11/20 1:58 PM, Dean Rasheed wrote:

On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.

Yes, I think that's a reasonable approach to take. As long as the
documentation makes it clear that building MCV stats also causes
standard expression stats to be built on any expressions included in
the list, then the user will know and can avoid duplication most of
the time. I don't think there's any need for code to try to prevent
that -- just as we don't bother with code to prevent a user building
multiple indexes on the same column.

The only case where duplication won't be avoidable is where there are
multiple MCV stats sharing the same expression, but that's probably
quite unlikely in practice, and it seems acceptable to leave improving
that as a possible future optimisation.

OK. Attached is an updated version, reworking it this way.

I tried tweaking the grammar to differentiate these two syntax variants,
but that led to shift/reduce conflicts with the existing ones. I tried
fixing that, but I ended up doing that in CreateStatistics().

The other thing is that we probably can't tie this to just MCV, because
functional dependencies need the per-expression stats too. So I simply
build expression stats whenever there's at least one expression.

I also decided to keep the "expressions" statistics kind - it's not
allowed to specify it in CREATE STATISTICS, but it's useful internally
as it allows deciding whether to build the stats in a single place.
Otherwise we'd need to do that every time we build the statistics, etc.

I added a brief explanation to the sgml docs, not sure if that's good
enough - maybe it needs more details.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-bootstrap-convert-Typ-to-a-List-20201211.patchtext/x-patch; charset=UTF-8; name=0001-bootstrap-convert-Typ-to-a-List-20201211.patchDownload+31-39
0002-Allow-composite-types-in-bootstrap-20201211.patchtext/x-patch; charset=UTF-8; name=0002-Allow-composite-types-in-bootstrap-20201211.patchDownload+28-1
0003-Extended-statistics-on-expressions-20201211.patchtext/x-patch; charset=UTF-8; name=0003-Extended-statistics-on-expressions-20201211.patchDownload+4805-353
#16Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#15)
Re: PoC/WIP: Extended statistics on expressions

On Fri, 11 Dec 2020 at 20:17, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

OK. Attached is an updated version, reworking it this way.

Cool. I think this is an exciting development, so I hope it makes it
into the next release.

I have started looking at it. So far I have only looked at the
catalog, parser and client changes, but I thought it's worth posting
my comments so far.

I tried tweaking the grammar to differentiate these two syntax variants,
but that led to shift/reduce conflicts with the existing ones. I tried
fixing that, but I ended up doing that in CreateStatistics().

Yeah, that makes sense. I wasn't expecting the grammar to change.

The other thing is that we probably can't tie this to just MCV, because
functional dependencies need the per-expression stats too. So I simply
build expression stats whenever there's at least one expression.

Makes sense.

I also decided to keep the "expressions" statistics kind - it's not
allowed to specify it in CREATE STATISTICS, but it's useful internally
as it allows deciding whether to build the stats in a single place.
Otherwise we'd need to do that every time we build the statistics, etc.

Yes, I thought that would be the easiest way to do it. Essentially the
"expressions" stats kind is an internal implementation detail, hidden
from the user, because it's built automatically when required, so you
don't need to (and can't) explicitly ask for it. This new behaviour
seems much more logical to me.

I added a brief explanation to the sgml docs, not sure if that's good
enough - maybe it needs more details.

Yes, I think that could use a little tidying up, but I haven't looked
too closely yet.

Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.

* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
* Statistics attributes can be either simple column references, or arbitrary
* expressions in parens. For compatibility with index attributes permitted
* in CREATE INDEX, we allow an expression that's just a function call to be
* written without parens.
*/

* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.

* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.

* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().

* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.

That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.

Regards,
Dean

#17Justin Pryzby
pryzby@telsasoft.com
In reply to: Dean Rasheed (#16)
Re: PoC/WIP: Extended statistics on expressions

On Mon, Jan 04, 2021 at 03:34:08PM +0000, Dean Rasheed wrote:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

This part is from me.

I can review the names if it's desired , but it'd be fine to fall back to the
earlier patch. I thought a pglist was cleaner, but it's not needed.

--
Justin

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#16)
Re: PoC/WIP: Extended statistics on expressions

On 1/4/21 4:34 PM, Dean Rasheed wrote:

...

Some other comments:

* I'm not sure I understand the need for 0001. Wasn't there an earlier
version of this patch that just did it by re-populating the type
array, but which still had it as an array rather than turning it into
a list? Making it a list falsifies some of the comments and
function/variable name choices in that file.

That's a bit done to Justin - I think it's fine to use the older version
repopulating the type array, but that question is somewhat unrelated to
this patch.

* There's a comment typo in catalog/Makefile -- "are are reputedly
other...", should be "there are reputedly other...".

* Looking at the pg_stats_ext view, I think perhaps expressions stats
should be omitted entirely from that view, since it doesn't show any
useful information about them. So it could remove "e" from the "kinds"
array, and exclude rows whose only kind is "e", since such rows have
no interesting data in them. Essentially, the new view
pg_stats_ext_exprs makes having any expression stats in pg_stats_ext
redundant. Hiding this data in pg_stats_ext would also be consistent
with making the "expressions" stats kind hidden from the user.

Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea.
On the one hand it's internal detail, on the other hand most of that
view is internal detail too. Excluding rows with only 'e' seems
reasonable, though. I need to think about this.

* In gram.y, it wasn't quite obvious why you converted the column list
for CREATE STATISTICS from an expr_list to a stats_params list. I
figured it out, and it makes sense, but I think it could use a
comment, perhaps something along the lines of the one for index_elem,
e.g.:

/*
* Statistics attributes can be either simple column references, or arbitrary
* expressions in parens. For compatibility with index attributes permitted
* in CREATE INDEX, we allow an expression that's just a function call to be
* written without parens.
*/

OH, right. I'd have trouble figuring this myself, and I wrote that code
myself only one or two months ago.

* In parse_func.c and parse_agg.c, there are a few new error strings
that use the abbreviation "stats expressions", whereas most other
errors refer to "statistics expressions". For consistency, I think
they should all be the latter.

OK, will fix.

* In generateClonedExtStatsStmt(), I think the "expressions" stats
kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE
...) fails if the source table has expression stats.

Yeah, will fix. I guess this also means we're missing some tests.

* CreateStatistics() uses ShareUpdateExclusiveLock, but in
tcop/utility.c the relation is opened with a ShareLock. ISTM that the
latter lock mode should be made to match CreateStatistics().

Not sure, will check.

* Why does the new code in tcop/utility.c not use
RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation?
That seems preferable to doing the ACL check in CreateStatistics().
For one thing, as it stands, it allows the lock to be taken even if
the user doesn't own the table. Is it intentional that the current
code allows extended stats to be created on system catalogs? That
would be one thing that using RangeVarCallbackOwnsRelation would
change, but I can't see a reason to allow it.

I think I copied the code from somewhere - probably expression indexes,
or something like that. Not a proof that it's the right/better way to do
this, though.

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

Not sure I understand. Why would this make it consistent with CREATE
STATISTICS? Can you elaborate?

* The pg_dump output for a stats object whose only kind is
"expressions" is broken -- it includes a spurious "()" for the kinds
list.

Will fix. Again, this suggests there are TAP tests missing.

That's it for now. I'll look at the optimiser changes next, and try to
post more comments later this week.

Thanks!

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#18)
Re: PoC/WIP: Extended statistics on expressions

On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/4/21 4:34 PM, Dean Rasheed wrote:

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

Not sure I understand. Why would this make it consistent with CREATE
STATISTICS? Can you elaborate?

This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
Table "public.foo"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | integer | | |
Statistics objects:
"public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

Table "public.foo"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | integer | | |
Statistics objects:
"public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.

Regards,
Dean

#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#19)
Re: PoC/WIP: Extended statistics on expressions

On 1/5/21 3:10 PM, Dean Rasheed wrote:

On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 1/4/21 4:34 PM, Dean Rasheed wrote:

* In src/bin/psql/describe.c, I think the \d output should also
exclude the "expressions" stats kind and just list the other kinds (or
have no kinds list at all, if there are no other kinds), to make it
consistent with the CREATE STATISTICS syntax.

Not sure I understand. Why would this make it consistent with CREATE
STATISTICS? Can you elaborate?

This isn't absolutely essential, but I think it would be neater. For
example, if I have a table with stats like this:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo;

then the \d output is as follows:

\d foo
Table "public.foo"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | integer | | |
Statistics objects:
"public"."foo_s_ab" (mcv) ON a, b FROM foo

and the stats line matches the DDL used to create the stats. It could,
for example, be copy-pasted and tweaked to create similar stats on
another table, but even if that's not very likely, it's neat that it
reflects how the stats were created.

OTOH, if there are expressions in the list, it produces something like this:

Table "public.foo"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | integer | | |
Statistics objects:
"public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo

which no longer matches the DDL used, and isn't part of an accepted
syntax, so seems a bit inconsistent.

In general, if we're making the "expressions" kind an internal
implementation detail that just gets built automatically when needed,
then I think we should hide it from this sort of output, so the list
of kinds matches the list that the user used when the stats were
created.

Hmm, I see. You're probably right it's not necessary to show this, given
the modified handling of expression stats (which makes them an internal
detail, not exposed to users). I'll tweak this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#20)
#22Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#22)
#24Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#24)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#25)
#27Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#27)
#29Zhihong Yu
zyu@yugabyte.com
In reply to: Tomas Vondra (#28)
#30Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#28)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Zhihong Yu (#29)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#30)
#33Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#28)
#34Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Justin Pryzby (#33)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#34)
#36Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#35)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#36)
#38Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#37)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#38)
#40Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#39)
#41Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#40)
#42Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Justin Pryzby (#41)
#43Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#42)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#40)
#45Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#39)
#46Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#45)
#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#46)
#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#47)
#49Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#17)
#50Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#49)
#51Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#48)
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#51)
#53Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#52)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#53)
#55Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#55)
#57Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#56)
#58Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#57)
#59Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#58)
#60Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#59)
#61Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#60)
#62Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#61)
#63Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#61)
#64Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#63)
#65Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#64)
#66Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#65)
#67Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#66)
#68Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#67)
#69Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#68)
#70Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#62)
#71Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#62)
#72Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#69)
#73Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#72)
#74Justin Pryzby
pryzby@telsasoft.com
In reply to: Dean Rasheed (#73)
#75Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Justin Pryzby (#74)
#76Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#72)
#77Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#76)
#78Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#77)
#79Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#76)
#80Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#76)
#81Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#76)
#82Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#81)
#83Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#82)
#84Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#83)
#85Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#84)
#86Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#85)
#87Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#85)
#88Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#44)
#89Noah Misch
noah@leadboat.com
In reply to: Tomas Vondra (#85)
#90Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Noah Misch (#89)
#91Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#90)
#92Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#91)
#93Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#92)
#94Noah Misch
noah@leadboat.com
In reply to: Tomas Vondra (#90)
#95Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Noah Misch (#94)
#96Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#44)
#97Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#11)
#98Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#96)
#99Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#97)
#100Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#99)
#101Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#100)
#102Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#99)
#103Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#102)
#104Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#103)
#105Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#104)
#106Justin Pryzby
pryzby@telsasoft.com
In reply to: Tomas Vondra (#103)
#107Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Justin Pryzby (#106)