Extended Statistics set/restore/clear functions.

Started by Corey Huinkerover 1 year ago141 messageshackers

corey.huinker@gmail.com

over 1 year ago

This is a separate thread for work started in [1]/messages/by-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com but focused purely on
getting the following functions working:

* pg_set_extended_stats
* pg_clear_extended_stats
* pg_restore_extended_stats

These functions are analogous to their relation/attribute counterparts, use
the same calling conventions, and build upon the same basic infrastructure.

I think it is important that we get these implemented because they close
the gap that was left in terms of the ability to modify existing statistics
and to round out the work being done to carry over statistics via
dump/restore and pg_upgrade i [1]/messages/by-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com.

The purpose of each patch is as follows (adapted from previous thread):

0001 - This makes the input function for pg_ndistinct functional.

0002 - This makes the input function for pg_dependencies functional.

0003 - Makes several static functions in attribute_stats.c public for use
by extended stats. One of those is get_stat_attr_type(), which in the last
patchset was modified to take an attribute name rather than attnum, thus
saving a syscache lookup. However, extended stats identifies attributes by
attnum not name, so that optimization had to be set aside, at least
temporarily.

0004 - These implement the functions pg_set_extended_stats(),
pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
their relation/attribute equivalents. If we can get these committed and
used by pg_dump, then we don't have to debate how to handle post-upgrade
steps for users who happen to have extended stats vs the approximately
99.75% of users who do not have extended stats.

This patchset does not presently include any work to integrate these
functions into pg_dump, but may do so once that work is settled, or it may
become its own thread.

[1]: /messages/by-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com
/messages/by-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Corey Huinker (#1)

Re: Extended Statistics set/restore/clear functions.

Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:

This is a separate thread for work started in [1] but focused purely on
getting the following functions working:

* pg_set_extended_stats
* pg_clear_extended_stats
* pg_restore_extended_stats

These functions are analogous to their relation/attribute counterparts,
use the same calling conventions, and build upon the same basic
infrastructure.

I think it is important that we get these implemented because they close
the gap that was left in terms of the ability to modify existing
statistics and to round out the work being done to carry over statistics
via dump/restore and pg_upgrade i [1].

The purpose of each patch is as follows (adapted from previous thread):

0001 - This makes the input function for pg_ndistinct functional.

0002 - This makes the input function for pg_dependencies functional.

I only quickly skimmed the patches, but a couple comments:

1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

{"1, 2": 2323, "1, 3" : 3232, ...}

with this:

[ {"keys": [1, 2], "ndistinct" : 2323},
{"keys": [1, 3], "ndistinct" : 3232},
... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

Yes, it's more verbose, but maybe better for "mechanical" processing?

2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

{"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

0003 - Makes several static functions in attribute_stats.c public for use
by extended stats. One of those is get_stat_attr_type(), which in the last
patchset was modified to take an attribute name rather than attnum, thus
saving a syscache lookup. However, extended stats identifies attributes by
attnum not name, so that optimization had to be set aside, at least
temporarily.

0004 - These implement the functions pg_set_extended_stats(),
pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
their relation/attribute equivalents. If we can get these committed and
used by pg_dump, then we don't have to debate how to handle post-upgrade
steps for users who happen to have extended stats vs the approximately
99.75% of users who do not have extended stats.

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

FWIW there's a bunch of whitespace issues during git apply.

This patchset does not presently include any work to integrate these
functions into pg_dump, but may do so once that work is settled, or it
may become its own thread.

OK. Thanks for the patch!

regards

--
Tomas Vondra

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: Extended Statistics set/restore/clear functions.

On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:

This is a separate thread for work started in [1] but focused purely on
getting the following functions working:

* pg_set_extended_stats
* pg_clear_extended_stats
* pg_restore_extended_stats

These functions are analogous to their relation/attribute counterparts,
use the same calling conventions, and build upon the same basic
infrastructure.

I think it is important that we get these implemented because they close
the gap that was left in terms of the ability to modify existing
statistics and to round out the work being done to carry over statistics
via dump/restore and pg_upgrade i [1].

The purpose of each patch is as follows (adapted from previous thread):

0001 - This makes the input function for pg_ndistinct functional.

0002 - This makes the input function for pg_dependencies functional.

I only quickly skimmed the patches, but a couple comments:

1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

{"1, 2": 2323, "1, 3" : 3232, ...}

with this:

[ {"keys": [1, 2], "ndistinct" : 2323},
{"keys": [1, 3], "ndistinct" : 3232},
... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

That is almost exactly what I did back when the stats import functions took
a nested JSON argument.

The biggest problem with changing that format is that the old format would
still show up in the system being exported, so we would have to process
that format as well as the new one.

Yes, it's more verbose, but maybe better for "mechanical" processing?

It absolutely would be better for processing, but we'd still have to read
the old format from older systems. I suppose the pg_dump code could do
some SQL gymnastics to convert the old json-but-sad format into the
processing-friendly format of the future, and I could easily adapt what
I've already written over a year ago to that task. I suppose it's just a
matter of having the community behind changing the output format to enable
a better input format.

2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

{"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

We used to have _lots_ of validation for data quality issues, much of which
was removed at the request of reviewers. However, much of that discussion
was about the health of the statistics, but these are standalone data
types, maybe they're held to a higher standard. If so, what sort of checks
do you think would be needed and/or wanted?

So far, I can imagine the following:

* no negative attnums in key list
* no duplicate attnums in key list

anything else?

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

The datatype itself is in isolation, but once we've created a valid
pg_ndistinct or pg_dependencies, there's nothing stopping us from
validating the values in the datatype against the statistics object and the
relation it belongs to, but that might get the same resistance that I got
to say, ensuring that frequency lists were monotonically decreasing.

I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Presently there are no cross-version differences, though earlier I address
the pros and cons of changing it. No matter what, the burden of having a
valid format is on the user in the case of pg_set_extended_stats() and
pg_restore_extended_stats() has a server version number associated, so the
option of handling a format change could be baked in, but then we're doing
version tests and input typecasts like we do with ANYARRAY types. Not
impossible, but definitely more work.

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

The family of pg_set_*_stats functions expect the input to be meaningful
and correct for that relation on that server version. Any attnum
translation would have to be done by the user to adapt to the new or
changed relation.

The family of pg_restore_*_stats functions are designed to be forward
compatible, and to work across versions but for basically the same relation
or relation of the same shape. Basically, they're for pg_restore and
pg_upgrade, so no changes in attnums would be expected.

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I had that working by attname matching way back in the early days, but it
would involve creating an intermediate format for translating the attnums
to attnames on export, and then re-translating them on the way back in.

I suppose someone could write the following utility functions

pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) ->
json
pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
pg_ndistinct

and that would bridge the gap for the special case where you want to adapt
pg_ndistinct from one table structure to a slightly different one.

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

I believe this *is* doing something more complicated, especially with the
MCVList, though I was very pleased to see how much of the existing
infrastructure I was able to reuse.

0003 - Makes several static functions in attribute_stats.c public for use
by extended stats. One of those is get_stat_attr_type(), which in the

last

patchset was modified to take an attribute name rather than attnum, thus
saving a syscache lookup. However, extended stats identifies attributes

by

attnum not name, so that optimization had to be set aside, at least
temporarily.

0004 - These implement the functions pg_set_extended_stats(),
pg_clear_extended_stats(), and pg_restore_extended_stats() and behave

like

their relation/attribute equivalents. If we can get these committed and
used by pg_dump, then we don't have to debate how to handle post-upgrade
steps for users who happen to have extended stats vs the approximately
99.75% of users who do not have extended stats.

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

I wanted to put it there, but there was a reason I didn't and I've now
forgotten what it was. I'll make an effort to relocate it to mcv.c.

For that matter, it might make sense to break out the expressions code into
its own file, because every other stat attribute has its own. Thoughts on
that?

FWIW there's a bunch of whitespace issues during git apply.

OK. Thanks for the patch!

Thanks for the feedback, please keep it coming. I think it's really
important that extended stats, though used rarely, be included in our
dump/restore/upgrade changes so as to make for a more consistent user
experience.

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 1 year ago

In reply to: Corey Huinker (#3)

Re: Extended Statistics set/restore/clear functions.

On 1/23/25 15:51, Corey Huinker wrote:

On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me
<mailto:tomas@vondra.me>> wrote:

Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:

This is a separate thread for work started in [1] but focused

purely on

getting the following functions working:

* pg_set_extended_stats
* pg_clear_extended_stats
* pg_restore_extended_stats

These functions are analogous to their relation/attribute

counterparts,

use the same calling conventions, and build upon the same basic
infrastructure.

I think it is important that we get these implemented because they

close

the gap that was left in terms of the ability to modify existing
statistics and to round out the work being done to carry over

statistics

via dump/restore and pg_upgrade i [1].

The purpose of each patch is as follows (adapted from previous

thread):

0001 - This makes the input function for pg_ndistinct functional.

0002 - This makes the input function for pg_dependencies functional.

I only quickly skimmed the patches, but a couple comments:

1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

{"1, 2": 2323, "1, 3" : 3232, ...}

with this:

[ {"keys": [1, 2], "ndistinct" : 2323},
{"keys": [1, 3], "ndistinct" : 3232},
... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

That is almost exactly what I did back when the stats import functions
took a nested JSON argument.

The biggest problem with changing that format is that the old format
would still show up in the system being exported, so we would have to
process that format as well as the new one.

Yes, it's more verbose, but maybe better for "mechanical" processing?

It absolutely would be better for processing, but we'd still have to
read the old format from older systems. I suppose the pg_dump code
could do some SQL gymnastics to convert the old json-but-sad format into
the processing-friendly format of the future, and I could easily adapt
what I've already written over a year ago to that task. I suppose it's
just a matter of having the community behind changing the output format
to enable a better input format.

D'oh! I always forget about the backwards compatibility issue, i.e. that
we still need to ingest values from already released versions. Yeah,
that makes the format change less beneficial.

2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

{"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

We used to have _lots_ of validation for data quality issues, much of
which was removed at the request of reviewers. However, much of that
discussion was about the health of the statistics, but these are
standalone data types, maybe they're held to a higher standard. If so,
what sort of checks do you think would be needed and/or wanted?

So far, I can imagine the following:

* no negative attnums in key list
* no duplicate attnums in key list

anything else?

Yeah, I recall there were a lot of checks initially and we dropped them
over time. I'm not asking to reinstate all of those thorough checks.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats
either (and I'd bet one could cause a lot of "fun" this way).

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

The datatype itself is in isolation, but once we've created a valid
pg_ndistinct or pg_dependencies, there's nothing stopping us from
validating the values in the datatype against the statistics object and
the relation it belongs to, but that might get the same resistance that
I got to say, ensuring that frequency lists were monotonically decreasing.

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Presently there are no cross-version differences, though earlier I
address the pros and cons of changing it. No matter what, the burden of
having a valid format is on the user in the case of
pg_set_extended_stats() and pg_restore_extended_stats() has a server
version number associated, so the option of handling a format change
could be baked in, but then we're doing version tests and input
typecasts like we do with ANYARRAY types. Not impossible, but definitely
more work.

OK, makes sense.

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

The family of pg_set_*_stats functions expect the input to be meaningful
and correct for that relation on that server version. Any attnum
translation would have to be done by the user to adapt to the new or
changed relation.

The family of pg_restore_*_stats functions are designed to be forward
compatible, and to work across versions but for basically the same
relation or relation of the same shape. Basically, they're for
pg_restore and pg_upgrade, so no changes in attnums would be expected.

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I had that working by attname matching way back in the early days, but
it would involve creating an intermediate format for translating the
attnums to attnames on export, and then re-translating them on the way
back in.

I suppose someone could write the following utility functions

pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -

json

pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
pg_ndistinct

and that would bridge the gap for the special case where you want to
adapt pg_ndistinct from one table structure to a slightly different one.

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

I believe this *is* doing something more complicated, especially with
the MCVList, though I was very pleased to see how much of the existing
infrastructure I was able to reuse.

0003 - Makes several static functions in attribute_stats.c public

for use

by extended stats. One of those is get_stat_attr_type(), which in

the last

patchset was modified to take an attribute name rather than

attnum, thus

saving a syscache lookup. However, extended stats identifies

attributes by

attnum not name, so that optimization had to be set aside, at least
temporarily.

0004 - These implement the functions pg_set_extended_stats(),
pg_clear_extended_stats(), and pg_restore_extended_stats() and

behave like

their relation/attribute equivalents. If we can get these

committed and

used by pg_dump, then we don't have to debate how to handle post-

upgrade

steps for users who happen to have extended stats vs the approximately
99.75% of users who do not have extended stats.

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

I wanted to put it there, but there was a reason I didn't and I've now
forgotten what it was. I'll make an effort to relocate it to mcv.c.

For that matter, it might make sense to break out the expressions code
into its own file, because every other stat attribute has its own.
Thoughts on that?

+1 to that, if it reduced unnecessary code duplication

FWIW there's a bunch of whitespace issues during git apply.

+1

OK. Thanks for the patch!

Thanks for the feedback, please keep it coming. I think it's really
important that extended stats, though used rarely, be included in our
dump/restore/upgrade changes so as to make for a more consistent user
experience.

I agree, and I appreciate you working on it.

--
Tomas Vondra

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: Extended Statistics set/restore/clear functions.

* no negative attnums in key list

Disregard this suggestion - negative attnums mean the Nth expression in the
extended stats object, though it boggles the mind how we could have 222
expressions...

* no duplicate attnums in key list

This one is still live, am considering.

At this point I was really thinking only about validating the attnums,

i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats

Agreed.

either (and I'd bet one could cause a lot of "fun" this way).

If by "fun" you mean "create a fuzzing tool", then yes.

As an aside, the "big win" in all these functions is the ability to dump a
database --no-data, but have all the schema and statistics, thus allowing
for checking query plans on existing databases with sensitive data while
not actually exposing the data (except mcv, obvs), nor spending the I/O to
load that data.

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

I suppose someone could write the following utility functions

pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -

json

pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
pg_ndistinct

and that would bridge the gap for the special case where you want to
adapt pg_ndistinct from one table structure to a slightly different one.

OK

As they'll be pure-SQL functions, I'll likely post the definitions here,
but not put them into a patch unless it draws interest.

For that matter, it might make sense to break out the expressions code

into its own file, because every other stat attribute has its own.
Thoughts on that?

+1 to that, if it reduced unnecessary code duplication

I'm uncertain that it actually would deduplicate any code, but I'll
certainly try.

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Corey Huinker (#3)

Re: Extended Statistics set/restore/clear functions.

I see there's a couple MCV-specific functions in the extended_stats.c.

Shouldn't those go into mvc.c instead?

I wanted to put it there, but there was a reason I didn't and I've now
forgotten what it was. I'll make an effort to relocate it to mcv.c.

Looking at it now, I see that check_mcvlist_array() expects access to
extarginfo, so I either I add a bunch of Datum-aware and extarginfo-aware
code to mcv.c, or I do those checks outside of import_mcvlist() and leave
check_mcvlist_array() where it is, which is my current inclination, though
obviously it's not a perfectly clean break.

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: Extended Statistics set/restore/clear functions.

I agree, and I appreciate you working on it.

I've tried to incorporate all your feedback:

* duplicates in pg_ndistinct and pg_dependencies now cause the input
function to fail, with tests for each

* all the non-array, non-parametery stuff moved out of import_mcvlist so
that import_mcvlist can move to mcv.c in a sane way

* whitespace issues

* validating attnums inside pg_ndistinct. I held off on doing
pg_dependencies pending feedback on how I handled pg_ndistinct. tests for
invalid positive (attnum isn't involved in the extended stats object), 0
attnum (just wrong) and invalid negative (more than the number of
expressions in the object)

Still to do:

* attnum checking for dependencies, should be simple, once we've locked
down how to do it for pg_ndistinct

* move import_expressions code to expressions.c along with some other
expression-specific stuff.

* more tests, probably

I'd like to merge these down to 3 patches again, but I'm keeping them
separate for this patchset to isolate the attnum-checking code for this
go-round.

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Corey Huinker (#7)

Re: Extended Statistics set/restore/clear functions.

I'd like to merge these down to 3 patches again, but I'm keeping them
separate for this patchset to isolate the attnum-checking code for this
go-round.

These are mock-ups of the to/from JSON functions, but building from/to text
rather than the not-yet-committed pg_ndistinct and pg_dependencies data
types. Currently they're done with JSON rather than JSONB because I assume
that the ordering within the datatype matters. We can probably preserve
order by adding an "order" field populated by WITH ORDINALITY.

To get all Jurrassic about this: I've spent some time thinking about
whether we CAN make these functions, it's time to consider whether we
SHOULD. And that leads me to a couple of points:

p1. We could switch to the new formats without any change to the internal
representation, but pg_dump would always need to know about the old formats.

p2. The JSON format is both more understandable and easier to manipulate.

p3. If we thought the number of people using extended stats was small, the
number of people tweaking extended stats is going to be smaller.

So that gives us a few paths forward:

o1. Switch to the new input/output format, and the queries inside these
functions get incorporated into some future pg_dump queries.

o2. Keep the old formats, create these functions inside the system and
whoever wants to use them can use them.

o3. Keep old formats, and make these functions work as the CASTs to and
from JSON/JSONB.

o4. Keep old formats, could create these functions in an extension.

o5. Keep old formats, leave these function definitions here for a future
intrepid hacker.

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Corey Huinker (#7)

Re: Extended Statistics set/restore/clear functions.

hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.

select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
^
DETAIL: All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.

select '{}'::pg_ndistinct ;
segfault

select '{"1,":"1"}'::pg_ndistinct ;
ERROR: malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
^
DETAIL: All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".

select pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname
from pg_type pt1
where pt1.typname ~*'distinct';

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.

all the errcode one pair of the parenthesis is unnecessary.
for example
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
errdetail("Must begin with \"{\"")));
can change to
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
errdetail("Must begin with \"{\""));

see https://www.postgresql.org/docs/current/error-message-reporting.html

#10

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: jian he (#9)

Re: Extended Statistics set/restore/clear functions.

hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)

this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)

pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

freeJsonLexContext(lex);
if (result == JSON_SUCCESS)
{
......
}
else
{
ereturn(parse_state.escontext, (Datum) 0,
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", str),
errdetail("Must be valid JSON."));
PG_RETURN_NULL();
}
result should be either JSON_SUCCESS or anything else.

all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?

ndistinctParseState need to add to src/tools/pgindent/typedefs.list
probably change it to "typedef struct ndistinctParseState".
also struct ndistinctParseState need placed in the top of
src/backend/statistics/mvdistinct.c
not in line 340?

#11

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: jian he (#9)

Re: Extended Statistics set/restore/clear functions.

On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com>
wrote:

hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.

Noted.

select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
^
DETAIL: All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.

What error message do you think is appropriate in that situation?

select '{}'::pg_ndistinct ;
segfault

Mmm, gotta look into that!

select '{"1,":"1"}'::pg_ndistinct ;
ERROR: malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
^
DETAIL: All ndistinct attnum lists must be a comma separated list of
attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".

That sounds better.

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.

I think it's probably ok for now. The datatype currently has no utility
other than extended statistics, and I'm doubtful that it ever will.

#12

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: jian he (#10)

Re: Extended Statistics set/restore/clear functions.

On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:

hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but
I'll look into the parsing code to see what can be done.

this is not what we expected?

For the VALUE part of pg_ndistinct, float8 has 3 special values: inf,
-inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)

I don't have a good answer as to what should happen here. Special cases
like this make Tomas' suggestion to change the in/out format more
attractive.

pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

freeJsonLexContext(lex);
if (result == JSON_SUCCESS)
{
......
}
else
{
ereturn(parse_state.escontext, (Datum) 0,
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", str),
errdetail("Must be valid JSON."));
PG_RETURN_NULL();
}
result should be either JSON_SUCCESS or anything else.

all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?

The compiler isn't complaining so far, but I see no harm in it.

#13

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Corey Huinker (#12)

Re: Extended Statistics set/restore/clear functions.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;

pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but
I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an
integer before formatting it %d, so it's being treated as an integer even
if it is not stored as one. After some consultation with Tomas, it made the
most sense to just replicate this on the input side as well, and that is
addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The
formats are clumsy, more processing-friendly formats would be easier, but
the need for such processing is minimal bordering on theoretical, so there
is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies
value assigned to an extended statistics object actually makes sense for
that object. What this amounts to is checking that for every attnum cited,
the positive attnums are also ones found the in the stxkeys of the
pg_statistic_ext tuple, and the negative attnums correspond do not exceed
the number of expressions in the attnum. In other words, if the stats
object has no expressions in it, then no negative numbers will be accepted,
if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

#14

Corey Huinker

corey.huinker@gmail.com

over 1 year ago

In reply to: Corey Huinker (#13)

Re: Extended Statistics set/restore/clear functions.

I think my initial reaction is to just refuse those special values, but
I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an
integer before formatting it %d, so it's being treated as an integer even
if it is not stored as one. After some consultation with Tomas, it made the
most sense to just replicate this on the input side as well, and that is
addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The
formats are clumsy, more processing-friendly formats would be easier, but
the need for such processing is minimal bordering on theoretical, so there
is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies
value assigned to an extended statistics object actually makes sense for
that object. What this amounts to is checking that for every attnum cited,
the positive attnums are also ones found the in the stxkeys of the
pg_statistic_ext tuple, and the negative attnums correspond do not exceed
the number of expressions in the attnum. In other words, if the stats
object has no expressions in it, then no negative numbers will be accepted,
if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

A rebasing, and a few changes
* regnamespace and name parameters changed to statistics_schemaname as text
and statistics_name as text, so that there's one less thing that can
potentially fail in an upgrade
* schema lookup and stat name lookup failures now issue a warning and
return false, rather than ERROR
* elevel replaced with hardcoded WARNING most everywhere, as has been done
with relation/attribute stats

#15

Corey Huinker

corey.huinker@gmail.com

about 1 year ago

In reply to: Corey Huinker (#14)

Re: Extended Statistics set/restore/clear functions.

Just rebasing.

#16

Corey Huinker

corey.huinker@gmail.com

about 1 year ago

In reply to: Corey Huinker (#15)

Re: Extended Statistics set/restore/clear functions.

On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

Just rebasing.

At pgconf.dev this year, the subject of changing the formats of
pg_ndistinct and pg_depdentencies came up again.

To recap: presently these datatypes have no working input function, but
would need one for statistics import to work on extended statistics. The
existing input formats are technically JSON, but the keys themselves are a
comma-separated list of attnums, so they require additional parsing. That
parsing is already done in the patches in this thread, but overall the
format is terrible for any sort of manipulation, like the manipulation that
people might want to do to translate the values to a table with a different
column order (say, after a restore of a table that had dropped columns), or
to do query planner experiments.

Because the old formats don't have a corresponding input function, there is
no risk of the ouptut not matching required inputs, but there will be once
we add new input functions, so this is our last chance to change the format
to something we like better.

The old format can be trivially translated via functions posted earlier in
this thread back in January (pg_xlat_ndistinct_to_attnames,
pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/),
so dumping values from older versions will not be difficult.

I believe that we should take this opportunity to make the change. While we
don't have a pressing need to manipulate these structures now, we might in
the future and failing to do so now makes a later change much harder.

With that in mind, I'd like people to have a look at the proposed format
change if pg_ndistinct (the changes to pg_dependencies are similar), to see
if they want to make any improvements or comments. As you can see, the new
format is much less compact (about 3x as large), which could get bad if the
number of elements grew by a lot, but the number of elements is tied to the
number of factors in the extended support (N choose N, then N choose N-1,
etc, excluding choose 1), so this can't get too out of hand.

Existing format (newlines/formatting added by me to make head-to-head
comparison easier):

'{"2, 3": 4,
"2, -1": 4,
"2, -2": 4,
"3, -1": 4,
"3, -2": 4,
"-1, -2": 3,
"2, 3, -1": 4,
"2, 3, -2": 4,
"2, -1, -2": 4,
"3, -1, -2": 4}'::pg_ndistinct

Proposed new format (again, all formatting here is just for ease of humans
reading):

' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct

The pg_dependencies structure is only slightly more complex:

An abbreviated example:

{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3,
-1 => 2": 1.000000},

Becomes:

[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
{"attributes": [2], "dependency": -1, "degree": 1.000000},
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
...,
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]

Any thoughts on using/improving these structures?

#17

Corey Huinker

corey.huinker@gmail.com

about 1 year ago

In reply to: Corey Huinker (#16)

Re: Extended Statistics set/restore/clear functions.

Any thoughts on using/improving these structures?

Hearing no objections, here is the latest patchset.

0001 - Changes input/output functions of pg_ndistinct to the format
described earlier.
0002 - Changes input/output functions of pg_dependencies to the format
described earlier.
0003 - Makes some previously internal/static attribute stats functions
visible to extended_stats.c, because the exprs attribute is basically an
array of partially filled-out pg_statistic rows.
0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
the pattern of their relation/attribute brethren.
0005 - adds the dumping and restoring of extended statistics back to v10.
No command line flag changes needed.

#18

Corey Huinker

corey.huinker@gmail.com

11 months ago

In reply to: Corey Huinker (#17)

Re: Extended Statistics set/restore/clear functions.

On Sat, Jun 21, 2025 at 8:12 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

Any thoughts on using/improving these structures?

Hearing no objections, here is the latest patchset.

0001 - Changes input/output functions of pg_ndistinct to the format
described earlier.
0002 - Changes input/output functions of pg_dependencies to the format
described earlier.
0003 - Makes some previously internal/static attribute stats functions
visible to extended_stats.c, because the exprs attribute is basically an
array of partially filled-out pg_statistic rows.
0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
the pattern of their relation/attribute brethren.
0005 - adds the dumping and restoring of extended statistics back to v10.
No command line flag changes needed.

Rebased. Enjoy.

#19

Corey Huinker

corey.huinker@gmail.com

10 months ago

In reply to: Corey Huinker (#18)

Re: Extended Statistics set/restore/clear functions.

Rebased. Enjoy.

Rebased.

Show quoted text

#20

Corey Huinker

corey.huinker@gmail.com

8 months ago

In reply to: Corey Huinker (#19)

Re: Extended Statistics set/restore/clear functions.

On Fri, Sep 12, 2025 at 12:55 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

Rebased. Enjoy.

Rebased.

And rebased again to conform to 688dc6299 and 4bd919129.

Extended Statistics set/restore/clear functions.

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: