Add starelid, attnum to pg_stats and leverage this in pg_dump
Per side conversation in [1]/messages/by-id/aZ3RbTE8Racscykc@nathan, this patch exposes pg_statistic.starelid in
the pg_stats view (0001) and then changes pg_dump to use the starelid in
the queries on pg_stats rather than the combination of schemaname+relname,
which gets a bit clumsy in bulk array fetching, and also leads to bad query
plans against pg_stats because the security barrier is also an optimization
barrier, which means that queries often try to hash join, which causes the
very large table pg_statistic to be sequentially scanned, and that's a bad
time. Currently we get around this by adding a redundant qual to the query
which gooses the plan towards an index, and that works fine for now, but
there's no guarantee that it will continue to work in the future, so this
change brings some plan safety as well.
0001 also exposes pg_statistic.attnum. This is no direct application of
this in 0002, but people have often expressed surprise that pg_dump orders
the pg_restore_attribute_stats() calls by attname rather than attnum, and
if we wanted to change that now we could (albeit only on new versions).
Attachments:
v1-0001-Add-attnum-and-starelid-columns-to-pg_stats.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-attnum-and-starelid-columns-to-pg_stats.patchDownload+338-100
v1-0002-pg_dump-Use-starelid-in-getAttributeStats.patchtext/x-patch; charset=US-ASCII; name=v1-0002-pg_dump-Use-starelid-in-getAttributeStats.patchDownload+83-20
Hi,
Per side conversation in [1], this patch exposes pg_statistic.starelid in the pg_stats view (0001)
and then changes pg_dump to use the starelid in the queries on pg_stats rather than the
combination of schemaname+relname, which gets a bit clumsy in bulk array fetching, and
also leads to bad query plans against pg_stats because the security barrier is also an
optimization barrier, which means that queries often try to hash join, which causes the
very large table pg_statistic to be sequentially scanned, and that's a bad time. Currently
we get around this by adding a redundant qual to the query which gooses the plan towards
an index, and that works fine for now, but there's no guarantee that it will continue to work
in the future, so this change brings some plan safety as well.
This alone seems like a good enough reason to include startled in
pg_stat, besides other
future use-cases this will simplify as discussed in [1]/messages/by-id/aZ3RbTE8Racscykc@nathan
0001 also exposes pg_statistic.attnum. This is no direct application of this in 0002,
but people have often expressed surprise that pg_dump orders the
pg_restore_attribute_stats() calls by attname rather than attnum,
and if we wanted to change that now we could (albeit only on new versions).
This seems logical
v1-0001 comments:
1/
- nspname AS schemaname,
- relname AS tablename,
- attname AS attname,
+ n.nspname AS schemaname,
+ c.relname AS tablename,
+ s.starelid AS starelid,
+ a.attnum AS attnum,
+ a.attname AS attname,
Ideally, we would want the OID and attnum to be first columns to match
the pg_stat_activity/pg_stat_progress style, but that will be too invasive.
The new attname column should be moved after tablename, however.
2/
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>attnum</structfield> <type>name</type>
+ (references <link
linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.<structfield>attnum</structfield>)
+ </para>
+ <para>
+ Position of column described by this row
+ </para></entry>
+ </row>
- This <type> should be int2
- Maybe for the describtion, it should be something like:
"The number of the column, as in
<structname>pg_attribute</structname>.<structfield>attnum</structfield>"
to be close to the description in pg_attribute
3/
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>starelid</structfield> <type>oid</type>
+ </para>
+ <para>
+ ID of the relation
+ </para></entry>
+ </row>
This should indicate it is a reference to pg_class.oid, like so:
```
(references <link
linkend="catalog-pg-class"><structname>pg_class</structname></link>.<structfield>oid</structfield>)
```
Maybe "ID of the table or index" is better, since this can only be a
table or index
for pg_stats.
I dislike the existing "pg_stats.tablename", since this can also be an
expression index.
"pg_stats.relation" with a description of "Name of table or index" is
more appropriate.
It is a change that we can possibly make in a major version. Looked
through the archives,
and did not see this being reported/discussed.
v1-0002:
I examined the version variants of getAttributeStats and all looks good,
and also ran a test on a 18 and 19 server version with the patched pg_dump
client and all looks good.
One minor comment is:
+ pg_fatal("statistics table oid information missing");
I noticed in other pg_fatal messages, we include OIDs
pg_fatal("could not find function definition for function
with OID %u",
cast->castfunc);
Should we do the same here?
[1]: /messages/by-id/aZ3RbTE8Racscykc@nathan
--
Sami Imseih
Amazon Web Services (AWS)
Corey Huinker <corey.huinker@gmail.com> writes:
Per side conversation in [1], this patch exposes pg_statistic.starelid in
the pg_stats view (0001) and then changes pg_dump to use the starelid in
the queries on pg_stats rather than the combination of schemaname+relname,
I don't object to the idea, but I don't like exposing the name
"starelid". That just screams implementation artifact, and it
goes against the idea that pg_stats is trying to hide the physical
representation of pg_statistic. I wish we could use "tableoid",
but that's taken already as a system column. Maybe "tableid" or
"tablerelid"?
Also, the proposed column ordering seems excessively random:
n.nspname AS schemaname,
c.relname AS tablename,
s.starelid AS starelid,
a.attnum AS attnum,
a.attname AS attname,
stainherit AS inherited,
I could see either of these as plausible:
n.nspname AS schemaname,
c.relname AS tablename,
s.starelid AS tableid,
a.attname AS attname,
a.attnum AS attnum,
stainherit AS inherited,
n.nspname AS schemaname,
c.relname AS tablename,
a.attname AS attname,
s.starelid AS tableid,
a.attnum AS attnum,
stainherit AS inherited,
but I don't see the rationale behind your version.
regards, tom lane
Ideally, we would want the OID and attnum to be first columns to match
the pg_stat_activity/pg_stat_progress style, but that will be too invasive.
I don't think there is an order that will make anyone happy, let alone
everyone.
The existing columns (schemaname, tablename, attname, inherited) constitute
a grain, so its hard to separate them.
Once we add attnum, that can replace attname in the above grain. Once we
add starelid/tableid, then that can be combined win inherited and either
attnum/attname for a grain. The ordering of these columns will largely fall
down to the perspective of the person seeking to use it.
- This <type> should be int2
- Maybe for the describtion, it should be something like:
+1
"The number of the column, as in
<structname>pg_attribute</structname>.<structfield>attnum</structfield>"
We don't do that for attname, why would attnum be different?
+ <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>starelid</structfield> <type>oid</type> + </para> + <para> + ID of the relation + </para></entry> + </row>This should indicate it is a reference to pg_class.oid, like so:
I went with pg_attribute, but the reference is there now.
Maybe "ID of the table or index" is better, since this can only be a
table or index
for pg_stats.
I went with table, because it matches the tablename definition (i.e. it is
equally inaccurate).
I dislike the existing "pg_stats.tablename", since this can also be an
expression index.
"pg_stats.relation" with a description of "Name of table or index" is
more appropriate.
It is a change that we can possibly make in a major version. Looked
through the archives,
and did not see this being reported/discussed.
I don't see it changing in any version, minor or major.
v1-0002:
I examined the version variants of getAttributeStats and all looks good,
and also ran a test on a 18 and 19 server version with the patched pg_dump
client and all looks good.One minor comment is:
+ pg_fatal("statistics table oid information missing");
I noticed in other pg_fatal messages, we include OIDs
pg_fatal("could not find function definition for function
with OID %u",
cast->castfunc);Should we do the same here?
If I had the oid, I wouldn't have the error. :)
I've modified the error message to give the namespace/tag of the entry.
Will post revised patch later.
I don't object to the idea, but I don't like exposing the name
"starelid". That just screams implementation artifact, and it
goes against the idea that pg_stats is trying to hide the physical
representation of pg_statistic. I wish we could use "tableoid",
but that's taken already as a system column. Maybe "tableid" or
"tablerelid"?
I went with tableid, as it pairs with tablename, and at least they're
inaccurate in the same way.
Also, the proposed column ordering seems excessively random:
n.nspname AS schemaname,
c.relname AS tablename,
s.starelid AS starelid,
a.attnum AS attnum,
a.attname AS attname,
stainherit AS inherited,I could see either of these as plausible:
n.nspname AS schemaname,
c.relname AS tablename,
s.starelid AS tableid,
a.attname AS attname,
a.attnum AS attnum,
stainherit AS inherited,n.nspname AS schemaname,
c.relname AS tablename,
a.attname AS attname,
s.starelid AS tableid,
a.attnum AS attnum,
stainherit AS inherited,but I don't see the rationale behind your version.
The rationale went like this:
In order to get the grain of this table you need:
( (schemaname AND tablename) OR tableid )
AND
(attnum OR attname)
AND
inherited
As I stated earlier, I don't think there's a way to "win" with the column
ordering here, any ordering we choose will look awkward to at least one use
case. Your first suggestion seems fine to me, so I'm going with that one in
this round.
Attachments:
v2-0001-Add-tableid-and-attnum-columns-to-pg_stats.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-tableid-and-attnum-columns-to-pg_stats.patchDownload+339-100
v2-0002-pg_dump-Use-tableid-in-getAttributeStats.patchtext/x-patch; charset=US-ASCII; name=v2-0002-pg_dump-Use-tableid-in-getAttributeStats.patchDownload+84-20
I dislike the existing "pg_stats.tablename", since this can also be an
expression index.
"pg_stats.relation" with a description of "Name of table or index" is
more appropriate.
It is a change that we can possibly make in a major version. Looked
through the archives,
and did not see this being reported/discussed.I don't see it changing in any version, minor or major.
This could be a separate discussion as it's not the fault of this patch,
but clearly "tablename" is not correct here.
I noticed in other pg_fatal messages, we include OIDs
pg_fatal("could not find function definition for function
with OID %u",
cast->castfunc);Should we do the same here?
If I had the oid, I wouldn't have the error. :)
oops, you're right.
I noticed that you changed the tests to selecting individual columns. I am
not clear as to why this is better?
-SELECT *
+SELECT schemaname, tablename, attname, attnum, inherited, null_frac, avg_width,
+ n_distinct, most_common_vals, most_common_freqs, histogram_bounds,
+ correlation, most_common_elems, most_common_elem_freqs,
+ elem_count_histogram, range_length_histogram, range_empty_frac,
+ range_bounds_histogram
Otherwise v2 LGTM.
--
Sami Imseih
Amazon Web Services (AWS)
"pg_stats.relation" with a description of "Name of table or index" is
more appropriate.
It is a change that we can possibly make in a major version. Looked
through the archives,
and did not see this being reported/discussed.I don't see it changing in any version, minor or major.
This could be a separate discussion as it's not the fault of this patch,
but clearly "tablename" is not correct here.
Certainly up for debate, but changing it would break existing scripts, and
that's generally a non-starter around here.
I noticed that you changed the tests to selecting individual columns. I am
not clear as to why this is better?-SELECT * +SELECT schemaname, tablename, attname, attnum, inherited, null_frac, avg_width, + n_distinct, most_common_vals, most_common_freqs, histogram_bounds, + correlation, most_common_elems, most_common_elem_freqs, + elem_count_histogram, range_length_histogram, range_empty_frac, + range_bounds_histogram
Well, the oid is now a part of the view, and that's not stable from one
regression run to the next, so we have to exclude it.
I noticed that you changed the tests to selecting individual columns. I am
not clear as to why this is better?-SELECT * +SELECT schemaname, tablename, attname, attnum, inherited, null_frac, avg_width, + n_distinct, most_common_vals, most_common_freqs, histogram_bounds, + correlation, most_common_elems, most_common_elem_freqs, + elem_count_histogram, range_length_histogram, range_empty_frac, + range_bounds_histogram
Makes sense.
BTW, I do not see a CF enty for this.
Thanks!
--
Sami Imseih
Amazon Web Services (AWS)
BTW, I do not see a CF enty for this.
I just created https://commitfest.postgresql.org/patch/6568/ - thanks!
On Thu, Mar 05, 2026 at 04:40:56PM -0500, Corey Huinker wrote:
I just created https://commitfest.postgresql.org/patch/6568/ - thanks!
I just skimmed through the latest patches, and they look generally
reasonable to me. I'll take a closer look soon and hopefully get these
committed.
Note to self: 0001 requires a catversion bump.
--
nathan
On Fri, Mar 06, 2026 at 05:08:58PM -0600, Nathan Bossart wrote:
I just skimmed through the latest patches, and they look generally
reasonable to me. I'll take a closer look soon and hopefully get these
committed.
Oh, a question I forgot to ask: why wouldn't we do the same thing for
pg_stats_ext and pg_stats_ext_exprs?
--
nathan