Allow to collect statistics on virtual generated columns

Started by Yugo Nagataabout 1 year ago25 messageshackers

nagata@sraoss.co.jp

about 1 year ago

Hi hackers,

Now we can create a table with a virtual generated column, but
when a condition in WHERE clause contains virtual generated column,
estimated rows are not correct since no statistics on this is
collectef.

[Ex.1]

test=# CREATE TABLE t (i int, v int GENERATED ALWAYS AS (i+1) VIRTUAL);
CREATE TABLE

test=# INSERT INTO t SELECT generate_series(1,1000);
INSERT 0 1000

test=# INSERT INTO t SELECT 1 FROM generate_series(1,1000);
INSERT 0 1000

test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..36.02 rows=9 width=8) (actual time=0.093..3.059 rows=1001.00 loops=1)
Filter: ((i + 1) = 2)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=26
Planning Time: 1.142 ms
Execution Time: 3.434 ms
(8 rows)

Therefore, I would like to allow to collect statistics on virtual enerated columns.

I think there are at least three approaches for this.

(1) Allow the normal ANALYZE to collect statistics on virtual generated columns

ANALYZE expands virtual generated columns' expression, and collects statistics
on evaluated values. In this approach, the statistics on virtual generated columns
are collected in default, but ANALYZE on the table would become a bit expensive.

(2) Allow to create an index on virtual generated column

This is proposed in [1]/messages/by-id/CACJufxGao-cypdNhifHAdt8jHfK6-HX=tRBovBkgRuxw063GaA@mail.gmail.com. This proposal itself would be useful, I believe it is better
to provide a way to collect statistics without cost of creating an index.

[1]: /messages/by-id/CACJufxGao-cypdNhifHAdt8jHfK6-HX=tRBovBkgRuxw063GaA@mail.gmail.com

(3) Allow to create extended statistics on virtual generated columns

In this approach, ANALYZE processes virtual generated columns only if corresponding
extended statistics are defined. Although we can create extended statistics on
expressions of virtual generated columns even in the current implementation, this enables
that users to create a useful statistics this just by specifying a column name without
specifying complex expression.

I can also think of two variations for this approach.

(3a)
At the timing when an extended statistics is created, virtual generated columns are
expanded, and the statistics is defined on this expression.

(3b)
At the timing when an extended statistics is created, virtual generated columns are
NOT expanded. The statistics is defined on the virtual generated column itself and,
the expression is expanded when ANALYZE processes the extended statistics.

I've attached a draft patch based on (3a). However, if it is possible we could change
the definition of generated columns in future (as proposed in [2]/messages/by-id/CACJufxH3VETr7orF5rW29GnDk3n1wWbOE3WdkHYd3iPGrQ9E_A@mail.gmail.com), (3b) might be preferred.

[2]: /messages/by-id/CACJufxH3VETr7orF5rW29GnDk3n1wWbOE3WdkHYd3iPGrQ9E_A@mail.gmail.com

Here is an example of how the patch works.

[Ex.2]

test=# CREATE STATISTICS exstat ON v FROM t;
CREATE STATISTICS
test=# ANALYZE t;
ANALYZE
test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.067..2.422 rows=1001.00 loops=1)
Filter: ((i + 1) = 2)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=14
Planning Time: 0.785 ms
Execution Time: 2.744 ms
(8 rows)

What do you think of this? Which approach of (1), (3a), or (3b) is good?
Or, completely different approach is better?
With your feedback, I would like to progress or rework the patch.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Andres Freund

andres@anarazel.de

12 months ago

In reply to: Yugo Nagata (#1)

Re: Allow to collect statistics on virtual generated columns

Hi,

On 2025-04-22 18:10:06 +0900, Yugo Nagata wrote:

With your feedback, I would like to progress or rework the patch.

Right now the tests seem to always fail:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5714

Fails e.g. with:
https://api.cirrus-ci.com/v1/artifact/task/5921189782093824/testrun/build/testrun/regress/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out	2025-05-26 00:59:01.813042000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out	2025-05-26 01:02:20.350387000 +0000
@@ -56,7 +56,6 @@
 ERROR:  unrecognized statistics kind "unrecognized"
 -- incorrect expressions
 CREATE STATISTICS tst ON (y) FROM ext_stats_test; -- single column reference
-ERROR:  extended statistics require at least 2 columns
 CREATE STATISTICS tst ON y + z FROM ext_stats_test; -- missing parentheses
 ERROR:  syntax error at or near "+"
 LINE 1: CREATE STATISTICS tst ON y + z FROM ext_stats_test;
@@ -69,25 +68,24 @@
 -- statistics on virtual generated column not allowed
 CREATE TABLE ext_stats_test1 (x int, y int, z int GENERATED ALWAYS AS (x+y) VIRTUAL, w xid);
 CREATE STATISTICS tst on z from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
 CREATE STATISTICS tst on (z) from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
 CREATE STATISTICS tst on (z+1) from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
 CREATE STATISTICS tst (ndistinct) ON z from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
 -- statistics on system column not allowed
 CREATE STATISTICS tst on tableoid from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
 CREATE STATISTICS tst on (tableoid) from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
 CREATE STATISTICS tst on (tableoid::int+1) from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
 CREATE STATISTICS tst (ndistinct) ON xmin from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
 -- statistics without a less-than operator not supported
 CREATE STATISTICS tst (ndistinct) ON w from ext_stats_test1;
-ERROR:  column "w" cannot be used in statistics because its type xid has no default btree operator class
+ERROR:  statistics object "tst" already exists
 DROP TABLE ext_stats_test1;
 -- Ensure stats are dropped sanely, and test IF NOT EXISTS while at it
 CREATE TABLE ab1 (a INTEGER, b INTEGER, c INTEGER);

Greetings,

Andres

Yugo Nagata

nagata@sraoss.co.jp

11 months ago

In reply to: Andres Freund (#2)

Re: Allow to collect statistics on virtual generated columns

On Tue, 17 Jun 2025 10:43:41 -0400
Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-04-22 18:10:06 +0900, Yugo Nagata wrote:

With your feedback, I would like to progress or rework the patch.

Right now the tests seem to always fail:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F571

Thank you for letting me know it.

I've attached an updated patch to fix the test failure.

However, I'm now reconsidering the current approach, where the expression
of a virtual generated column is expanded at the time of creating extended
statistics. This seems not be ideal, as the statistics would become useless
if the expression is later modified.

Instead, I'm thinking of an alternative approach: expanding the expression
at the time statistics are collected.

Best regards,
Yugo Nagata

Fails e.g. with:
https://api.cirrus-ci.com/v1/artifact/task/5921189782093824/testrun/build/testrun/regress/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out	2025-05-26 00:59:01.813042000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out	2025-05-26 01:02:20.350387000 +0000
@@ -56,7 +56,6 @@
ERROR:  unrecognized statistics kind "unrecognized"
-- incorrect expressions
CREATE STATISTICS tst ON (y) FROM ext_stats_test; -- single column reference
-ERROR:  extended statistics require at least 2 columns
CREATE STATISTICS tst ON y + z FROM ext_stats_test; -- missing parentheses
ERROR:  syntax error at or near "+"
LINE 1: CREATE STATISTICS tst ON y + z FROM ext_stats_test;
@@ -69,25 +68,24 @@
-- statistics on virtual generated column not allowed
CREATE TABLE ext_stats_test1 (x int, y int, z int GENERATED ALWAYS AS (x+y) VIRTUAL, w xid);
CREATE STATISTICS tst on z from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
CREATE STATISTICS tst on (z) from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
CREATE STATISTICS tst on (z+1) from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
CREATE STATISTICS tst (ndistinct) ON z from ext_stats_test1;
-ERROR:  statistics creation on virtual generated columns is not supported
+ERROR:  statistics object "tst" already exists
-- statistics on system column not allowed
CREATE STATISTICS tst on tableoid from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
CREATE STATISTICS tst on (tableoid) from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
CREATE STATISTICS tst on (tableoid::int+1) from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
CREATE STATISTICS tst (ndistinct) ON xmin from ext_stats_test1;
-ERROR:  statistics creation on system columns is not supported
+ERROR:  statistics object "tst" already exists
-- statistics without a less-than operator not supported
CREATE STATISTICS tst (ndistinct) ON w from ext_stats_test1;
-ERROR:  column "w" cannot be used in statistics because its type xid has no default btree operator class
+ERROR:  statistics object "tst" already exists
DROP TABLE ext_stats_test1;
-- Ensure stats are dropped sanely, and test IF NOT EXISTS while at it
CREATE TABLE ab1 (a INTEGER, b INTEGER, c INTEGER);

Greetings,

Andres

--
Yugo Nagata <nagata@sraoss.co.jp>

Yugo Nagata

nagata@sraoss.co.jp

10 months ago

In reply to: Yugo Nagata (#3)

Re: Allow to collect statistics on virtual generated columns

Hi,

On Tue, 24 Jun 2025 17:05:33 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Instead, I'm thinking of an alternative approach: expanding the expression
at the time statistics are collected.

I've attached a new patch in this approache.

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=10
Planning Time: 0.299 ms
Execution Time: 0.948 ms
(8 rows)

- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=6
Planning Time: 0.374 ms
Execution Time: 1.028 ms
(8 rows)

Note that the patch is still a work in progress, so documentation and tests are not included.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Yugo Nagata

nagata@sraoss.co.jp

10 months ago

In reply to: Yugo Nagata (#4)

Re: Allow to collect statistics on virtual generated columns

On Fri, 1 Aug 2025 00:28:30 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Hi,

On Tue, 24 Jun 2025 17:05:33 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Instead, I'm thinking of an alternative approach: expanding the expression
at the time statistics are collected.

I've attached a new patch in this approache.

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=10
Planning Time: 0.299 ms
Execution Time: 0.948 ms
(8 rows)

- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=6
Planning Time: 0.374 ms
Execution Time: 1.028 ms
(8 rows)

Note that the patch is still a work in progress, so documentation and tests are not included.

I've attached an updated patch.

I modified the documentation to remove the statement that virtual generated columns
do not have statistics.

In addition, I added a test to ensure that statistics on virtual generated columns
are available.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Yugo Nagata

nagata@sraoss.co.jp

9 months ago

In reply to: Yugo Nagata (#5)

Re: Allow to collect statistics on virtual generated columns

On Fri, 8 Aug 2025 12:21:25 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Fri, 1 Aug 2025 00:28:30 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Hi,

On Tue, 24 Jun 2025 17:05:33 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Instead, I'm thinking of an alternative approach: expanding the expression
at the time statistics are collected.

I've attached a new patch in this approache.

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=10
Planning Time: 0.299 ms
Execution Time: 0.948 ms
(8 rows)

- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=6
Planning Time: 0.374 ms
Execution Time: 1.028 ms
(8 rows)

Note that the patch is still a work in progress, so documentation and tests are not included.

I've attached an updated patch.

I modified the documentation to remove the statement that virtual generated columns
do not have statistics.

In addition, I added a test to ensure that statistics on virtual generated columns
are available.

I've attached a rebased patch.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Yugo Nagata

nagata@sraoss.co.jp

9 months ago

In reply to: Yugo Nagata (#6)

Re: Allow to collect statistics on virtual generated columns

On Wed, 20 Aug 2025 14:10:28 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Fri, 8 Aug 2025 12:21:25 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Fri, 1 Aug 2025 00:28:30 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Hi,

On Tue, 24 Jun 2025 17:05:33 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

Instead, I'm thinking of an alternative approach: expanding the expression
at the time statistics are collected.

I've attached a new patch in this approache.

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=10
Planning Time: 0.299 ms
Execution Time: 0.948 ms
(8 rows)

- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=6
Planning Time: 0.374 ms
Execution Time: 1.028 ms
(8 rows)

Note that the patch is still a work in progress, so documentation and tests are not included.

I've attached an updated patch.

I modified the documentation to remove the statement that virtual generated columns
do not have statistics.

In addition, I added a test to ensure that statistics on virtual generated columns
are available.

I've attached a rebased patch.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Yugo Nagata

nagata@sraoss.co.jp

5 months ago

In reply to: Yugo Nagata (#7)

Re: Allow to collect statistics on virtual generated columns

On Tue, 2 Sep 2025 16:33:41 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

This allows to collect statistics on virtual generated columns.

During ANALYZE, generation expressions are expanded, and statistics are computed
using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
are now exported from extended_stats.c. However, since they are no longer specific
to extended statistics, it might be better to move them to analyze.c and vacuum.h.

To enable the optimizer to make use of these statistics, a new field named
virtual_gencols is added to RelOptInfo. This field holds the expressions of
virtual generated columns in the table. In examine_variable(), if an expression
in a WHERE clause matches a virtual generated column, the corresponding statistics
are used for that expression.

Example:

- Before applying the patch, the cardinality estimate is erroneous.

test=# create table t (i int, j int generated always as (i*10) virtual);
CREATE TABLE
test=# insert into t select generate_series(1,1000);
INSERT 0 1000
test=# insert into t select 1 from generate_series(1,1000);
INSERT 0 1000
test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=10
Planning Time: 0.299 ms
Execution Time: 0.948 ms
(8 rows)

- After applying the patch, the cardinality estimate is correct.

test=# analyze t;
ANALYZE
test=# explain analyze select * from t where j = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
Filter: ((i * 10) = 10)
Rows Removed by Filter: 999
Buffers: shared hit=9
Planning:
Buffers: shared hit=6
Planning Time: 0.374 ms
Execution Time: 1.028 ms
(8 rows)

Note that the patch is still a work in progress, so documentation and tests are not included.

I've attached an updated patch.

I modified the documentation to remove the statement that virtual generated columns
do not have statistics.

In addition, I added a test to ensure that statistics on virtual generated columns
are available.

I've attached an updated patch that fixes the broken test since 10c4fe074a.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

Dean Rasheed

dean.a.rasheed@gmail.com

2 months ago

In reply to: Yugo Nagata (#8)

Re: Allow to collect statistics on virtual generated columns

On Wed, 31 Dec 2025 at 10:45, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch that fixes the broken test since 10c4fe074a.

I took a look at this, and the patch appears to work as intended.
However, I have my doubts as to whether this is the best approach.

Building stats on a virtual generated column is potentially quite
expensive, and not something that everyone will want, so I think this
really should be an optional feature that people can selectively
enable, if they want.

Therefore, I think that the previous approach was probably better. If
I'm understanding it correctly, that allowed

CREATE STATISTICS stat_name ON virt_col FROM tbl;

as well as allowing statistics to be built on expressions including
virtual generated columns, making it more flexible.

The problem with this previous approach was that it didn't work
correctly if a virtual generated column's expression was changed using
ALTER TABLE ... SET EXPRESSION. I think that could be solved by
expanding generated column expressions at ANALYZE time, rather than at
CREATE STATISTICS time. So then the expression stored in the catalogs
would be one referring to virtual generated columns, not their
expanded forms (compare a CHECK constraint referring to a virtual
generated column).

Regards,
Dean

#10

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#9)

Re: Allow to collect statistics on virtual generated columns

On Tue, 24 Mar 2026 09:18:06 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Thank you for your comments!

On Wed, 31 Dec 2025 at 10:45, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch that fixes the broken test since 10c4fe074a.

I took a look at this, and the patch appears to work as intended.
However, I have my doubts as to whether this is the best approach.

Building stats on a virtual generated column is potentially quite
expensive, and not something that everyone will want, so I think this
really should be an optional feature that people can selectively
enable, if they want.

I agree that collecting statistics on a virtual generated column should
be optional. While it can be disabled or enabled with:

ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ...

this is not very intuitive or discoverable for users, so a more explicit
interface might be preferable.

Therefore, I think that the previous approach was probably better. If
I'm understanding it correctly, that allowed

CREATE STATISTICS stat_name ON virt_col FROM tbl;

as well as allowing statistics to be built on expressions including
virtual generated columns, making it more flexible.

The problem with this previous approach was that it didn't work
correctly if a virtual generated column's expression was changed using
ALTER TABLE ... SET EXPRESSION. I think that could be solved by
expanding generated column expressions at ANALYZE time, rather than at
CREATE STATISTICS time.

Yes. I think this approach could work.

One concern is that CREATE STATISTICS does not allow creating extended
statistics on a single colmun. If we try this, it raises the following error:

ERROR: extended statistics require at least 2 columns

Therefore, if we take this approach, we would need to add an exception
to this rule for virtual columns.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#11

Dean Rasheed

dean.a.rasheed@gmail.com

2 months ago

In reply to: Yugo Nagata (#10)

Re: Allow to collect statistics on virtual generated columns

On Tue, 24 Mar 2026 at 10:57, Yugo Nagata <nagata@sraoss.co.jp> wrote:

One concern is that CREATE STATISTICS does not allow creating extended
statistics on a single colmun. If we try this, it raises the following error:

ERROR: extended statistics require at least 2 columns

Therefore, if we take this approach, we would need to add an exception
to this rule for virtual columns.

I think that would be fine. It makes sense because a virtual column
expands to an expression, and we already allow CREATE STATISTICS on a
single expression.

Regards,
Dean

#12

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#11)

Re: Allow to collect statistics on virtual generated columns

On Tue, 24 Mar 2026 11:43:47 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Tue, 24 Mar 2026 at 10:57, Yugo Nagata <nagata@sraoss.co.jp> wrote:

One concern is that CREATE STATISTICS does not allow creating extended
statistics on a single colmun. If we try this, it raises the following error:

ERROR: extended statistics require at least 2 columns

Therefore, if we take this approach, we would need to add an exception
to this rule for virtual columns.

I think that would be fine. It makes sense because a virtual column
expands to an expression, and we already allow CREATE STATISTICS on a
single expression.

I see, that makes sense. I'll update the patch accordingly.

Regards,
Yugo Nagata

Regards,
Dean

--
Yugo Nagata <nagata@sraoss.co.jp>

#13

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#11)

Re: Allow to collect statistics on virtual generated columns

On Tue, 24 Mar 2026 11:43:47 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Tue, 24 Mar 2026 at 10:57, Yugo Nagata <nagata@sraoss.co.jp> wrote:

One concern is that CREATE STATISTICS does not allow creating extended
statistics on a single colmun. If we try this, it raises the following error:

ERROR: extended statistics require at least 2 columns

Therefore, if we take this approach, we would need to add an exception
to this rule for virtual columns.

I think that would be fine. It makes sense because a virtual column
expands to an expression, and we already allow CREATE STATISTICS on a
single expression.

I've attached a patch updated along these lines.

This allows users to define extended statistics on virtual generated
columns. Expressions including such columns are stored in the catalog
as-is, and expanded at ANALYZE time. Extended statistics on a single
virtual generated column are also allowed, since it is treated as a
single expression.

To enable the optimizer to make use of these statistics, expressions
are also expanded at planning time.

Additional documentation and tests are needed, but are not yet included.

Is this approach better then the previous one?

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#14

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Yugo Nagata (#13)

Re: Allow to collect statistics on virtual generated columns

On Wed, 25 Mar 2026 13:01:27 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Tue, 24 Mar 2026 11:43:47 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Tue, 24 Mar 2026 at 10:57, Yugo Nagata <nagata@sraoss.co.jp> wrote:

One concern is that CREATE STATISTICS does not allow creating extended
statistics on a single colmun. If we try this, it raises the following error:

ERROR: extended statistics require at least 2 columns

Therefore, if we take this approach, we would need to add an exception
to this rule for virtual columns.

I think that would be fine. It makes sense because a virtual column
expands to an expression, and we already allow CREATE STATISTICS on a
single expression.

I've attached a patch updated along these lines.

This allows users to define extended statistics on virtual generated
columns. Expressions including such columns are stored in the catalog
as-is, and expanded at ANALYZE time. Extended statistics on a single
virtual generated column are also allowed, since it is treated as a
single expression.

To enable the optimizer to make use of these statistics, expressions
are also expanded at planning time.

Additional documentation and tests are needed, but are not yet included.

Is this approach better then the previous one?

I've attached an updated patch including the documentation and tests.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#15

Dean Rasheed

dean.a.rasheed@gmail.com

2 months ago

In reply to: Yugo Nagata (#14)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

I think this is a much better approach.

One thing that stands out in CreateStatistics() is that the check for
a less-than operator can be skipped if there is just a single virtual
generated column, for the same reason as for statistics on a single
expression. I.e., it should be possible to build statistics on a
single virtual generated column of any data type.

Regards,
Dean

#16

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#15)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 16:00:38 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

I think this is a much better approach.

One thing that stands out in CreateStatistics() is that the check for
a less-than operator can be skipped if there is just a single virtual
generated column, for the same reason as for statistics on a single
expression. I.e., it should be possible to build statistics on a
single virtual generated column of any data type.

I've attached a revised patch to skip the less-than operator check
for a single virtual generated column.

In fact, this change skips the check for any single column,
not just virtual generated columns. However, using a non-virtual
single column will result in an error elsewhere.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#17

Dean Rasheed

dean.a.rasheed@gmail.com

2 months ago

In reply to: Dean Rasheed (#15)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 at 16:00, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

Looking at get_relation_statistics(), I think that you need to call
expand_generated_columns_in_expr() *before* ChangeVarNodes() so that
Vars in the expanded expression end up with the correct varno.

This obviously affects queries with more than one table in the FROM
clause, e.g.:

drop table if exists foo;
create table foo (a int, b int generated always as (a*2) virtual);
insert into foo select x from generate_series(1,10) x;
insert into foo select 100 from generate_series(1,500);
create statistics s on b from foo;
analyse foo;
explain select * from foo f1, foo f2 where f1.b = 200 and f2.b = 200;

QUERY PLAN
-------------------------------------------------------------------
Nested Loop (cost=0.00..47.56 rows=1500 width=16)
-> Seq Scan on foo f1 (cost=0.00..10.65 rows=500 width=4)
Filter: ((a * 2) = 200)
-> Materialize (cost=0.00..10.66 rows=3 width=4)
-> Seq Scan on foo f2 (cost=0.00..10.65 rows=3 width=4)
Filter: ((a * 2) = 200)
(6 rows)

Regards,
Dean

#18

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#17)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 17:25:23 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 16:00, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

Looking at get_relation_statistics(), I think that you need to call
expand_generated_columns_in_expr() *before* ChangeVarNodes() so that
Vars in the expanded expression end up with the correct varno.

Thank you for the review.

I've corrected the placement of expand_generated_columns_in_expr()
so that it is called before ChangeVarNodes(). A revised patch is attached.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

#19

Dean Rasheed

dean.a.rasheed@gmail.com

2 months ago

In reply to: Yugo Nagata (#16)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 at 17:18, Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Thu, 26 Mar 2026 16:00:38 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

I think this is a much better approach.

One thing that stands out in CreateStatistics() is that the check for
a less-than operator can be skipped if there is just a single virtual
generated column, for the same reason as for statistics on a single
expression. I.e., it should be possible to build statistics on a
single virtual generated column of any data type.

I've attached a revised patch to skip the less-than operator check
for a single virtual generated column.

In fact, this change skips the check for any single column,
not just virtual generated columns. However, using a non-virtual
single column will result in an error elsewhere.

Makes sense. I think the comments could just be simplified to say
"Disallow data types without a less-than operator in multivariate
statistics", and add the word "multivariate" to the error message, as
in the expression case.

Regards,
Dean

#20

Yugo Nagata

nagata@sraoss.co.jp

2 months ago

In reply to: Dean Rasheed (#19)

Re: Allow to collect statistics on virtual generated columns

On Thu, 26 Mar 2026 17:44:01 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 17:18, Yugo Nagata <nagata@sraoss.co.jp> wrote:

On Thu, 26 Mar 2026 16:00:38 +0000
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Thu, 26 Mar 2026 at 15:09, Yugo Nagata <nagata@sraoss.co.jp> wrote:

I've attached an updated patch including the documentation and tests.

I think this is a much better approach.

One thing that stands out in CreateStatistics() is that the check for
a less-than operator can be skipped if there is just a single virtual
generated column, for the same reason as for statistics on a single
expression. I.e., it should be possible to build statistics on a
single virtual generated column of any data type.

I've attached a revised patch to skip the less-than operator check
for a single virtual generated column.

In fact, this change skips the check for any single column,
not just virtual generated columns. However, using a non-virtual
single column will result in an error elsewhere.

Makes sense. I think the comments could just be simplified to say
"Disallow data types without a less-than operator in multivariate
statistics", and add the word "multivariate" to the error message, as
in the expression case.