Add expressions to pg_restore_extended_stats()

Started by Corey Huinker5 months ago42 messageshackers

corey.huinker@gmail.com

5 months ago

This is a new thread that continues the work [1]https://commitfest.postgresql.org/patch/5517/ of the Extended Statistics
set/restore/clear functions thread [2]/messages/by-id/aTE4AL7U0dp1Jjrx@paquier.xyz which itself was a continuation of
the work [3]https://commitfest.postgresql.org/patch/4538/ of the Statistics Export and Import thread [4]/messages/by-id/CADkLM=cB0rF3p_FuWRTMSV0983ihTRpsH+OCpNyiqE7Wk0vUWA@mail.gmail.com, all of which
is too much for anyone to review, so I'll give a recap:

A longstanding complaint about postgres has been the amount of time
required to rebuild stats after a major version upgrade, during which the
database is online but under heavy I/O and queries of any real complexity
will get _terrible_ query plans because they lack the optimizer statistics
which the previous major version had just moments ago, but weren't carried
over to the new version.

Version 18 introduced the ability to import stats at the relation level and
attribute level [3]https://commitfest.postgresql.org/patch/4538/, and these stats were now dumped and restored by
default via pg_upgrade. This meant that most customers could reduce the
time in which the database was online but in a state of degraded
performance. It was, however, not a complete solution, because it still
lacked statistics for extended objects (i.e. CREATE STATISTICS xyz...) and
custom statistic kinds like those found in extensions like PostGIS. Still,
this made things better for 99% of installations, and while it was not
trivial to determine if a given instance was in that 1%, enhancements were
made to vacuumdb [5]https://commitfest.postgresql.org/patch/5523/ to detect what tables were missing statistics and
analyze just those tables, thus reducing the scope of the I/O-intensive
rebuild period for those in the unlucky 1%.

Work in this 19-dev cycle has sought to close that 1% gap by importing
statistics for extended statistics objects. These stats are quite a bit
more complicated than their relation and attribute equivalents, but good
progress has been made [1]https://commitfest.postgresql.org/patch/5517/, resulting in the carryover of many statistics
types: ndistinct, dependencies, and MCV. All of them, except for the
statistics associated with expressions in the definition of the statistics
object (i.e. CREATE STATISTICS xyz on upper(name), ...). This last type of
statistics has proved to be a tough nut to crack for reasons I will
describe in detail. We could stop here, but if we did we would actually
create work for vacuumdb, specifically the code that processes the
--missing-stats-only option, which currently looks for matching extended
statistics data (pg_statistic_ext_data) rows that match the object
definition (pg_statistic_ext) rows, and considers any match to be
sufficient for "not missing stats". That assumption would no longer hold in
the case of stats objects that have expressions, because they'd be missing
their stxdexprs stats. While we can teach vacuumdb that difference, we
could instead solve the expressions problem, and close the statistics gap
even further [6]The issue of custom statistic kinds like those found in PostGIS would still remain..

We have a working but not thoroughly tested implementation (attached).
There remains one sticky problem: the serialization format of the
statistics stored in pg_statistic_ext_data.stxdexprs. The type of the
attribute is pg_catalog.pg_statistic[], which is to say that it's an array
of records, the length of the array is equal to the number of expressions
in the extended statistics object. pg_statistic is where attribute
statistics are imported, so the structure has the complexity of attribute
stats itself, slightly reduced by the fact that the fields describing the
attribute are left as InvalidOid values, but still quite complicated.
Several of the attributes in pg_statistic are of type ANYARRAY, because
they are most-common-values/histogram/most-common-elements arrays, each of
which has a composite datatype determined by the datatype of the
expression(s) and other columns in the object definition. This presents a
problem for utilities like deconstruct_array(), in that the datatype to be
deconstructed varies by column and by the datatype of the expression
definition, and that datatype could itself be an array which
deconstruct_array would then try to deconstruct...there is no way to get
deconstruct_array() to stop 2 levels deep.

This problem was solved for pg_restore_attribute_stats by having pg_dump
export the ANYARRAY values CAST-ed to type "text" rather than "text[]",
which allowed each type of statistics to be decomposed according to it's
own rules, and that worked fine when each statistic type became a parameter
in pg_restore_attribute_stats(). But now we've got all of those types, and
we're getting them multiple times, so that method doesn't quite scale.

I've considered several ways around this issue:

1. Defining a strict order for the statistics types, following the order
they appear in pg_stats_ext (null_frac, avg_width, n_distinct,
most_common_elems, ...) and then exprs from pg_stats_ext_exprs in last
place. Each value is CAST-ed to "text",
which means that we can deconstruct them in a fashion very similar to how
we did for attribute stats. Those text values are put into an array in the
strict order, and those arrays are aggregated into a 2-D array.

Pros:
- This method is implemented and it works, and the code for it is reusing
tools and coding patterns we've already incorporated (deconstruct_array,
safe input functions, arglist arrays). Patch attached.

Cons:
- The ordering is completely opaque. Documenting that ordering might help a
little, but there's nothing intuitive about it and checking it has been an
irritant to author and reviewer alike.
- This opaque ordering must be kept in sync between
pg_restore_extended_stats and pg_dump or else statistical garbage will
result.

2. Defining a record type specifically for purpose.

Pros:
- It could be decomposed via standard composite input function, and then
each type deconstructed on its own terms

Cons:
- It's type-clutter, and a type that is likely only used during upgrades.
- When new stats types are introduced, the structure would also have to
change, breaking typecasts of composite values from older versions. This
alone makes this option unpalatable to most reviewers, and I can't advocate
for it.

3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pg_restore_attribute_stats
(i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]

Pros:
- Flexibility in ordering
- Clarity at a glance, provided that the reader has seen the kwargs
convention of the pg_restore_*_stats functions.
- Still uses existing tooling like #1, and not that much more overhead.
- The name-value pairing problem has the same solution as the arg-pairing
that the function already does

Cons:
- adds overhead of storing the stat type names, and the key-value pairing
- the 2-D nature of the array requires that the number of elements be
fixed, so we couldn't leave out a stat type from one row unless we left it
out of the other one as well
- adds the possibility of duplicate names

4. JSON. The outer structure would be an array of objects, each object
would be a key-value.

Pros:
- Flexibility in ordering
- Clarity at a glance in a format well understood even without prior
knowledge of our kwargs convention
- we have already implemented similar things for the new formats of
pg_ndistinct and pg_dependences.
- This method currently has the interest of Michael Paquier, the committer
of all the v19 work to date.

Cons:
- Requires implementing a state engine to parse the json, check for missing
values, resolve duplicates. We do that for pg_dependencies, and that takes
800-ish lines of code to handle 3 key names, this would have 10.
- the escaping of values in a composite record CASTed to text and then
further encoded as a JSON value would be extremely unreadable, and likely
quite bloated.
- using JSON for stats serialization met with immediate strong opposition
from several reviewers. That resistance may not be there for this vastly
reduced scope, especially in light of the new JSON-compatible formats for
pg_ndistinct and pg_dependencies, but it does give me pause.

And...that's the major decision point. If we solve that, the rest is far
less controversial. My apologies that this summary itself needs a summary.

Thanks for reading. Eager to hear perspectives on the serialization methods
propsed, or suggestions of other methods.

[1]: https://commitfest.postgresql.org/patch/5517/
[2]: /messages/by-id/aTE4AL7U0dp1Jjrx@paquier.xyz
/messages/by-id/aTE4AL7U0dp1Jjrx@paquier.xyz
[3]: https://commitfest.postgresql.org/patch/4538/
[4]: /messages/by-id/CADkLM=cB0rF3p_FuWRTMSV0983ihTRpsH+OCpNyiqE7Wk0vUWA@mail.gmail.com
/messages/by-id/CADkLM=cB0rF3p_FuWRTMSV0983ihTRpsH+OCpNyiqE7Wk0vUWA@mail.gmail.com
[5]: https://commitfest.postgresql.org/patch/5523/
[6]: The issue of custom statistic kinds like those found in PostGIS would still remain.
still remain.

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#1)

Re: Add expressions to pg_restore_extended_stats()

On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:

3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pg_restore_attribute_stats
(i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]

4. JSON. The outer structure would be an array of objects, each object
would be a key-value.

I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability. We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?

Structurally, I feel that import_expressions() is overcomplicated, and
with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statatt_build_stavalues() and statatt_set_slot(),
perhaps.

I have also posted a few more arguments here, for reference:
/messages/by-id/aXvqN2fhDJZhL2RS@paquier.xyz
--
Michael

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#2)

Re: Add expressions to pg_restore_extended_stats()

On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz>
wrote:

On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:

3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pg_restore_attribute_stats
(i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]

4. JSON. The outer structure would be an array of objects, each object
would be a key-value.

I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability.

Hmm, maybe it isn't so bad:

SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
text
---------------------------------------------------
{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
(1 row)

SELECT
array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
array_to_json
---------------------------------------------------
["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
(1 row)

Mind you, this is an ANYARRAY first casted to text, if we cast the
pg_stats_ext_exprs.most_common_values directly to JSON then it'll drill
down into the innards of the composite values because it can see the local
datatypes, and that breaks our ability to use regular input functions. I
learned that the hard way when using JSON for serializing attribute stats
stuff when this effort first began.

Before seeing that, I wanted to try option 3 first, as it brings clarity
with no real increase in tooling other than looking for repeated keywords,
but if you're hyped for json then I'll try that first.

We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?

If we're doing a kwargs-thing then I may as well just track which keywords
were already used. We already bail out on the whole expressions array at
the first sign of inconsistency, so it's not like we have to decide which
of the duplicates to keep.

Structurally, I feel that import_expressions() is overcomplicated, and

with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statatt_build_stavalues() and statatt_set_slot(),
perhaps.

I don't think we can get around those. It's a limitation of how the
sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want
to fill in each stakind as we find it, and we don't know how many of them
we've already filled out. An array of records would have been better, but
we've got 5 parallel arrays of scalars and we have to live with it.

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Corey Huinker (#3)

Re: Add expressions to pg_restore_extended_stats()

On Fri, Jan 30, 2026 at 3:07 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz>
wrote:

On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:

3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pg_restore_attribute_stats
(i.e.

ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]

4. JSON. The outer structure would be an array of objects, each object
would be a key-value.

I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability.

Hmm, maybe it isn't so bad:

SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
text
---------------------------------------------------
{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
(1 row)

SELECT
array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
array_to_json
---------------------------------------------------
["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
(1 row)

Mind you, this is an ANYARRAY first casted to text, if we cast the
pg_stats_ext_exprs.most_common_values directly to JSON then it'll drill
down into the innards of the composite values because it can see the local
datatypes, and that breaks our ability to use regular input functions. I
learned that the hard way when using JSON for serializing attribute stats
stuff when this effort first began.

Before seeing that, I wanted to try option 3 first, as it brings clarity
with no real increase in tooling other than looking for repeated keywords,
but if you're hyped for json then I'll try that first.

We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?

If we're doing a kwargs-thing then I may as well just track which keywords
were already used. We already bail out on the whole expressions array at
the first sign of inconsistency, so it's not like we have to decide which
of the duplicates to keep.

Structurally, I feel that import_expressions() is overcomplicated, and

with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statatt_build_stavalues() and statatt_set_slot(),
perhaps.

I don't think we can get around those. It's a limitation of how the
sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want
to fill in each stakind as we find it, and we don't know how many of them
we've already filled out. An array of records would have been better, but
we've got 5 parallel arrays of scalars and we have to live with it.

Ok, so I refactored the exprs input to be a jsonb array, and the good news
is that it results in legitmately more readable test cases. See below

-- ok, with warning: extra exprs param
SELECT pg_catalog.pg_restore_extended_stats(
'schemaname', 'stats_import',
'relname', 'test',
'statistics_schemaname', 'stats_import',
'statistics_name', 'test_stat_mcelem',
'inherited', false,
'exprs', '[
{
"avg_width": 33,
"null_frac": 0,
"n_distinct": -1,
"correlation": 1,
"histogram_bounds":
"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}",
"most_common_vals": null,
"most_common_elems": "{-1,0,1,2,3}",
"most_common_freqs": null,
"elem_count_histogram":
"{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.5}",
"most_common_elem_freqs":
"{0.25,0.25,0.5,0.25,0.25,0.25,0.5,0.25}",
"bad_param": "text no one will ever parse"
}
]'::jsonb);
WARNING: malformed expr expression: "bad_param": is not an expression key
name, value ignored

The downside is that it results in 200 more lines of code in
extended_stats_funcs.c and 500 more lines of test output in
stats_import.sql.

There are a few reasons for this:

- It's up to us to validate the [{expr},{expr},...] structure ourselves,
something a 2-D text array did rather succinctly.
- null is a valid jsonb value, and we have to check for it. This quirk
alone results in 4 more test cases, though some other test cases did go
away.
- json strings are unterminated c strings, so we have to convert them first
to c strings and then to text datums to leverage the existing attribute
stats functions

I briefly thought about making the exprs input parameter jsonb[] thus
saving some structure checks, but what little that buys us is then taken
away by making the pg_dump command that much more complex.

There are probably some simplifications that can happen:

- casting all stat values to text to avoid having to deal with the
neither-fish-nor-fowl json numeric values.
- utility function for converting jbvString values to TextDatum.
- utility function to simplify json iteration in a simple key/value object
- deciding that unknown parameters aren't worth a warning

So...does this look better? Is it worth it?

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#4)

Re: Add expressions to pg_restore_extended_stats()

On Sun, Feb 01, 2026 at 07:39:04PM -0500, Corey Huinker wrote:

The downside is that it results in 200 more lines of code in
extended_stats_funcs.c and 500 more lines of test output in
stats_import.sql.

There are a few reasons for this:

- It's up to us to validate the [{expr},{expr},...] structure ourselves,
something a 2-D text array did rather succinctly.
- null is a valid jsonb value, and we have to check for it. This quirk
alone results in 4 more test cases, though some other test cases did go
away.
- json strings are unterminated c strings, so we have to convert them first
to c strings and then to text datums to leverage the existing attribute
stats functions

This one would not actually exist if using a 1-D text[] with elements
made of key/value/key/value as the array deconstruction would take
care of that, wouldn't it?

There are probably some simplifications that can happen:

- casting all stat values to text to avoid having to deal with the
neither-fish-nor-fowl json numeric values.

Hmm. We'd still need to deal with buggy inputs even if this
conversion is enforced in pg_dump.

- utility function for converting jbvString values to TextDatum.
- utility function to simplify json iteration in a simple key/value object
- deciding that unknown parameters aren't worth a warning

Do you actually need check_valid_expr_argnames() at all? At the end
of import_pg_statistic() we know that all the parameters have been
found. So we could just check on the numbers of values fetched and
complain based on NUM_ATTRIBUTE_STATS_ELEMS? That would remove some
code, not impact the safety. key_lookup_notnull() could be changed to
return three possible states rather than a boolean: okay and NULL
value, okay and not-NULL value, error if key not found in JSON blob or
incorrect parsing of the input value (aka infinite, input function
error, etc). Then increment the counter if one of the first two states
is found as NULL can be valid for some of the key/value pairs.

bv_get_float4_datum() and jbv_get_int4_datum() are alike, they could
be refactored as a single function that takes a type name in put for
the error message and an input function pointer. That would cut some
code.

So...does this look better? Is it worth it?

Well, it does to me. The pg_dump bits based on jsonb_build_object()
are rather attractive in shape. This addresses the concerns I had
about item ordering and unicity of the keys in a natural way through
the data type filtering.

I don't see a need for key_lookup(), its sole caller is
key_lookup_notnull().

+ 'exprs', '{{0,4,-0.75,"{1}","{0.5}","{-1,0}",-0.6,NULL,NULL,NULL},{0.25,4,-0.5,"{2}","{0.5}",NULL,1,NULL,NULL,NULL}}'::text[]);

This is in the documentation, and would need fixing.

+WARNING: invalid expression elements most_common_vals and most_common_freqs: conflicting NULL/NOT NULL

For this warning in the regression tests, you should only need one
element, reducing the number of input lines?

The patch has a query checking the contents of pg_stats_ext_exprs for
test_mr_stat, no imports. This links to the multirange case mentioned
in import_pg_statistic().

Something that looks missing here is that we do not attach to which
negative attribute an item of "exprs" links to. It seems to me that
this needs to include an extra key called "attribute" with a negative
associated with it, so as we can rely on the attnum instead of the
order of the items restored.

+ * Create a pg_statisitc tuple from an expression container.

Typo here.

+ * The attytypids, attytypmods, and atttypcols arrays have all the

Incorrect variable names.

+ errmsg("malformed expressions \"%s\": must be a regular numeric",

errmsg not project-style.

+ errmsg("malformed expressions \"%s\": exprected array of %d, found %d",

Bis repetita, and s/exprected/expected/.

+        ereport(WARNING,
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("cannot specify parameter \"%s\".",
+                       extarginfo[EXPRESSIONS_ARG].argname),
+                errhint("Extended statistics object "
+                        "does not support statistics of this type."));

errmsg on a line line, object name to be included in the errhint. No
dot at the end of errmsg.

+ errmsg("invalid expression element \"%s\": must be type string"

This one also should be reworded.
--
Michael

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#5)

Re: Add expressions to pg_restore_extended_stats()

This one would not actually exist if using a 1-D text[] with elements
made of key/value/key/value as the array deconstruction would take
care of that, wouldn't it?

If it were 1-D we'd have to know where one expression ended and the next
one started. If you're asking what I think you're asking, an array like:

array[
array['null_frac','0.4','avg_width', '...', ...].
array['null_frac','0.5','avg_width', '...', ...].
]

this would force us to either require an order for the keys, in which case,
why bother, or it would require a loop through each 1-D array comparing
each keyword against the list of known keywords so we can put the values in
order (and resolve duplicates). It's not the cleanest thing, but I have
done it before, and it eventually resulted in
stats_fill_fcinfo_from_arg_pairs(), which could be adapted to handle this
as well.

Hmm. We'd still need to deal with buggy inputs even if this
conversion is enforced in pg_dump.

True, but we're already doing it in both the 2-D text array version and the
json version, the only difference is the number of hoops we have to jump
through to get an input to a *Safe() input function.

- utility function for converting jbvString values to TextDatum.
- utility function to simplify json iteration in a simple key/value

object

- deciding that unknown parameters aren't worth a warning

Do you actually need check_valid_expr_argnames() at all?

No, not if we're ok silently missing the "ok, with warning: extra exprs
param" test. It seems like something we should report, but also something
we shouldn't fail on.

At the end
of import_pg_statistic() we know that all the parameters have been
found.

We don't, actually. Any un-found parameters would be left at their default
values without error. It would be easy enough to set a flag to true for
each one found and report on that, but I haven't implemented that yet.

So we could just check on the numbers of values fetched and
complain based on NUM_ATTRIBUTE_STATS_ELEMS? That would remove some
code, not impact the safety. key_lookup_notnull() could be changed to
return three possible states rather than a boolean: okay and NULL
value, okay and not-NULL value, error if key not found in JSON blob or
incorrect parsing of the input value (aka infinite, input function
error, etc). Then increment the counter if one of the first two states
is found as NULL can be valid for some of the key/value pairs.

In the situation you're describing, the *lack* of an expected key is itself
grounds for a WARNING and failing the whole import_expressions() assuming
that the version parameter indicated that servers of that vintage knew
about that parameter. I'd be ok with it.

bv_get_float4_datum() and jbv_get_int4_datum() are alike, they could
be refactored as a single function that takes a type name in put for
the error message and an input function pointer. That would cut some
code.

Yes, a bit. That's also true of their predecessors text_to_float4 and
text_to_int4, so it's a savings, but not unique to this implementation.

So...does this look better? Is it worth it?

Well, it does to me. The pg_dump bits based on jsonb_build_object()
are rather attractive in shape. This addresses the concerns I had
about item ordering and unicity of the keys in a natural way through
the data type filtering.

I had actually figured that the pg_dump part would actually turn you OFF of
this approach.

The JSONB type is a bit of a saving grace in that if you had duplicate
input keys {"x": "y", "x": "z"} it'll rewrite that as {"x": ["z","y"]} and
that *is* easy for us to catch in the existing tooling. That's one of the
few areas where I liked what the jsonb api did for us.

I don't see a need for key_lookup(), its sole caller is
key_lookup_notnull().

A lot of this code was exhumed from the v1 patch of the relation/attribute
stats functions [1]/messages/by-id/attachment/149862/v1-0001-Introduce-the-system-view-pg_stats_export-and-the.patch, and they can be taken further to verify the json-type
of the value (in our case we'd only want numeric/string test variants, or
perhaps only the string-test variant.

+ 'exprs',
'{{0,4,-0.75,"{1}","{0.5}","{-1,0}",-0.6,NULL,NULL,NULL},{0.25,4,-0.5,"{2}","{0.5}",NULL,1,NULL,NULL,NULL}}'::text[]);

This is in the documentation, and would need fixing.

Lots of things need fixing if we're going this route. Part of this patch
was to see if the results horrified you, and since they clearly haven't,
the fixing is now worth it.

+WARNING: invalid expression elements most_common_vals and
most_common_freqs: conflicting NULL/NOT NULL

For this warning in the regression tests, you should only need one
element, reducing the number of input lines?

That's true IF we decide that missing expression keys are a thing that we
allow, and per conversation above it's not clear yet that we do.

The patch has a query checking the contents of pg_stats_ext_exprs for
test_mr_stat, no imports. This links to the multirange case mentioned
in import_pg_statistic().

There *is* an input, we just have to check output from pg_stats_ext()
first, and then pg_stats_ext_exprs(). I suppose I could combine the queries
but chose not to for simplicity.
As those comments mention, there are attribute-level stat types specific
for ranges that are not reflected in extended stats via ANALYZE, so
importing expressions on test_mr isn't all that interesting, but it's there
for completeness.

Something that looks missing here is that we do not attach to which
negative attribute an item of "exprs" links to. It seems to me that
this needs to include an extra key called "attribute" with a negative
associated with it, so as we can rely on the attnum instead of the
order of the items restored.

Oof. So there is no such attribute number exposed in pg_stat_ext_exprs(),
we have to rely on the deterministic order that they are fetched from the
view. The "expr" text definition is essentially a node tree, which we can't
guarantee stays stable across major versions, so that's no help.
jsonb_agg() doesn't guarantee order on it's own, but it can be enforced via
an ORDINALITY-like trick. I had to do this before, can I excavate that code
again. Thanks for the reminder.

(Many other corrections redacted).

+1 to all cases.

Since clearly, you're on board with the jsonb idea, I'll make the next
version all about cleanups and tightening up this method.

The outstanding questions from the above are, given an jsonb tree of [ {
"null_frac": 0.5, ...}, {"avg_width": "1", ...}], which for notation I'll
call an array of expr1 and expr2

1. If an expected key is missing from either expr1 or expr2, is that worth
failing the whole import_expressions()?
2. Is an unexpected key in either expr1 or expr2 worth failing the whole
import_expressions()?
3. If the answer to #2 is no, do we need to warn about the existence of the
weird parameter at all, or just silently skip it?
4. Does casting the numeric scalar values (null_frac, correlation,
avg_width, n_distinct) to text make sense, since we have to put them
through type-specific input functions anyway?
5. Do any of these jsonb-twiddling operations look generically useful
enough to put in jsonb_utils.c?

(my answers would be 1. no, 2. no, 3. can't skip it, have to issue a
warning on the first weird param, but it doesn't fail anything 4. yes, and
5. maybe jbvString to cstring and/or jbvString to TextDatum)

[1]: /messages/by-id/attachment/149862/v1-0001-Introduce-the-system-view-pg_stats_export-and-the.patch
/messages/by-id/attachment/149862/v1-0001-Introduce-the-system-view-pg_stats_export-and-the.patch

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#6)

Re: Add expressions to pg_restore_extended_stats()

On Mon, Feb 02, 2026 at 03:21:35AM -0500, Corey Huinker wrote:

I had actually figured that the pg_dump part would actually turn you OFF of
this approach.

Sorry, I guess. :)

For this warning in the regression tests, you should only need one
element, reducing the number of input lines?

That's true IF we decide that missing expression keys are a thing that we
allow, and per conversation above it's not clear yet that we do.

If ANALYZE can generate some partial data for a set of expression (aka
generating some data for a portion of the expressions defined but skip
some of them because of reason x_y_z), the restore function needs to
cope with this policy (I'd need to double-check the code to be sure,
not sure on top of my mind now and I cannot do that today,
unfortunately).

Something that looks missing here is that we do not attach to which
negative attribute an item of "exprs" links to. It seems to me that
this needs to include an extra key called "attribute" with a negative
associated with it, so as we can rely on the attnum instead of the
order of the items restored.

Oof. So there is no such attribute number exposed in pg_stat_ext_exprs(),
we have to rely on the deterministic order that they are fetched from the
view. The "expr" text definition is essentially a node tree, which we can't
guarantee stays stable across major versions, so that's no help.
jsonb_agg() doesn't guarantee order on it's own, but it can be enforced via
an ORDINALITY-like trick. I had to do this before, can I excavate that code
again. Thanks for the reminder.

Hmm. I would not rely on the ordering of the items as they are
scanned, that seems like a recipe for disaster. We have a text
representation of the expression, as of pg_stats_ext_exprs.expr. This
could be mapped with the elements of hte text[] in pg_stats_ext.exprs.
Here is a wilder idea: why not just putting the expression text itself
in the data given in input of the restore function rather than a
guessed argument number? For the case of manual stats injections,
that kinds of makes things simpler.

Since clearly, you're on board with the jsonb idea, I'll make the next
version all about cleanups and tightening up this method.

Yeah, this approach makes sense to me in terms of clarity and item
handling for the expression parts. Now, as you have mentioned me
offline, there was a discussion about applying this format to other
parts of the arguments, which would be around here I guess:
/messages/by-id/CADkLM=fhVk3BJ8z20iQW2wGfuOk+ZSgzNHUVGtLpmvzbQ9Ontg@mail.gmail.com

This area of the discussion focused on the catalog data types we have
for dependencies and ndistinct, as well.

The outstanding questions from the above are, given an jsonb tree of [ {
"null_frac": 0.5, ...}, {"avg_width": "1", ...}], which for notation I'll
call an array of expr1 and expr2

1. If an expected key is missing from either expr1 or expr2, is that worth
failing the whole import_expressions()?
2. Is an unexpected key in either expr1 or expr2 worth failing the whole
import_expressions()?
3. If the answer to #2 is no, do we need to warn about the existence of the
weird parameter at all, or just silently skip it?

Yeah, it sounds to me that we should just set ok=false and give up
rather than have a semi-filled set of numbers for a single extended
stats object. There is an argument in favor of that: it can simplify
the detection of missing stats for a single extended stats definition.
I understand that you'd want to keep going with loading the data even
if it's partial. My question is: is it possible for ANALYZE to fill
in only a portion of the expressions and can these be partially
skipped? If the answer to my question is yes, the restore function
should do the same and my idea of the matter is wrong. If the answer
to my question is no, then your idea on this matter is the right one.

4. Does casting the numeric scalar values (null_frac, correlation,
avg_width, n_distinct) to text make sense, since we have to put them
through type-specific input functions anyway?

In terms of JSON, it makes the use of a representation simpler. I
don't think that we need to apply a strict casting when fetching them.

5. Do any of these jsonb-twiddling operations look generically useful
enough to put in jsonb_utils.c?

This one I cannot say yet. It depends on the patch as written.

@Tomas (now added in CC for confirmation): would you see a problem
against applying a JSONB data type to the argument for the restore of
extended stats? This level of data serialization would be required
when inserting data for what would show up into pg_stats_ext_exprs.
--
Michael

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#7)

Re: Add expressions to pg_restore_extended_stats()

For this warning in the regression tests, you should only need one
element, reducing the number of input lines?

That's true IF we decide that missing expression keys are a thing that we
allow, and per conversation above it's not clear yet that we do.

If ANALYZE can generate some partial data for a set of expression (aka
generating some data for a portion of the expressions defined but skip
some of them because of reason x_y_z), the restore function needs to
cope with this policy (I'd need to double-check the code to be sure,
not sure on top of my mind now and I cannot do that today,
unfortunately).

In this latest patch, I decided that we can tolerate missing exprs
parameters, and as you'll see from the 0002 patch, it does result in
considerable decluttering.

Hmm. I would not rely on the ordering of the items as they are

scanned, that seems like a recipe for disaster. We have a text
representation of the expression, as of pg_stats_ext_exprs.expr. This
could be mapped with the elements of hte text[] in pg_stats_ext.exprs.
Here is a wilder idea: why not just putting the expression text itself
in the data given in input of the restore function rather than a
guessed argument number? For the case of manual stats injections,
that kinds of makes things simpler.

The patch I submitted here hasn't yet implemented using the expression text
as a key, and honestly I'm not too hyped on the prospect of trying.

I think that:

1. I don't think we can guarantee that the expression node text is stable
across major versions, and that would break upgrades, the primary function
of this code.
2. Anyone wanting to modify/hack the exprs values has almost certainly
extracted it using the jsonb_build_object() code in pg_dump, so they
already have all expressions before editing.
3. Array unnest() has proven to give a stable order in all tests so far.
4. We don't decompose mcv into it's parts, so why do that for exprs?

Yeah, it sounds to me that we should just set ok=false and give up

rather than have a semi-filled set of numbers for a single extended
stats object. There is an argument in favor of that: it can simplify
the detection of missing stats for a single extended stats definition.
I understand that you'd want to keep going with loading the data even
if it's partial. My question is: is it possible for ANALYZE to fill
in only a portion of the expressions and can these be partially
skipped? If the answer to my question is yes, the restore function
should do the same and my idea of the matter is wrong. If the answer
to my question is no, then your idea on this matter is the right one.

In which case I think you'll like the latest patchset.

4. Does casting the numeric scalar values (null_frac, correlation,
avg_width, n_distinct) to text make sense, since we have to put them
through type-specific input functions anyway?

In terms of JSON, it makes the use of a representation simpler. I
don't think that we need to apply a strict casting when fetching them.

Everything is jbvStrings now.

@Tomas (now added in CC for confirmation): would you see a problem
against applying a JSONB data type to the argument for the restore of
extended stats? This level of data serialization would be required
when inserting data for what would show up into pg_stats_ext_exprs.

Very interested to hear his thoughts as well.

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#8)

Re: Add expressions to pg_restore_extended_stats()

On Mon, Feb 02, 2026 at 09:17:03PM -0500, Corey Huinker wrote:

1. I don't think we can guarantee that the expression node text is stable
across major versions, and that would break upgrades, the primary function
of this code.
2. Anyone wanting to modify/hack the exprs values has almost certainly
extracted it using the jsonb_build_object() code in pg_dump, so they
already have all expressions before editing.
3. Array unnest() has proven to give a stable order in all tests so far.
4. We don't decompose mcv into it's parts, so why do that for exprs?

Not including a trace regarding to which expression a row refers to
sounds like a design mistake to me, particularly because JSON is, by
design, JSON, and we don't have ordering requirements. If we don't
include an expression text, I'm OK to give up on this idea. But let's
at least include a negative attribute number with an "attribute"
field. We could cross-check it with the number of expressions defined
in the statext object.

On second though, as we already use negative attribute numbers for
ndistinct and dependencies, perhaps it's not a bad choice to use a
negative number anyway. As the attribute number assigned depends on
the order of the elements in pg_stats_ext.exprs, I'd suggest to tweak
the pg_dump query to rely on that rather than ORDINALITY and the order
where the rows of pg_stats_ext_exprs are scanned. Using the order of
the elements in the definition of the stats object is predictable. A
sequential scan of a catalog offers no real guarantees.
--
Michael

#10

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#9)

Re: Add expressions to pg_restore_extended_stats()

Not including a trace regarding to which expression a row refers to
sounds like a design mistake to me, particularly because JSON is, by
design, JSON, and we don't have ordering requirements.

I'd have no issue including the expr definition in the existing object
container - an optional parameter that allow (no warning) but which has no
effect. We could then compare that optional parameter against the text
representation of the node and issue a non-failing warning on any
differences.

But in a way we're already getting this type checking if the expressions
have different datatypes and either of them have most_common_values,
histogram_bounds, or most_common_elems - as all of those require input
coersion of the array values to datatypes of the expressions, and any
failure in any of those dooms the whole exprs array.

It's true that json jumbles orders of keys, but it does not jumble the
order of elements inside an array. The 2-D text[] export has exactly this
same problem, btw.

If we don't
include an expression text, I'm OK to give up on this idea. But let's
at least include a negative attribute number with an "attribute"
field.

That feels like the same thing as including the expr definition as a
parameter, and I'm as amenable to this as I am to the expr node text.

We could cross-check it with the number of expressions defined
in the statext object.

Yep.

On second though, as we already use negative attribute numbers for
ndistinct and dependencies, perhaps it's not a bad choice to use a
negative number anyway. As the attribute number assigned depends on
the order of the elements in pg_stats_ext.exprs, I'd suggest to tweak
the pg_dump query to rely on that rather than ORDINALITY and the order
where the rows of pg_stats_ext_exprs are scanned. Using the order of
the elements in the definition of the stats object is predictable. A
sequential scan of a catalog offers no real guarantees.

This has come up before. pg_stats, infuriatingly, has an attname but not an
attnum.

pg_stats_ext_exprs, equally infuriatingly, does not expose
pg_statistic_ext.oid, which means to join back from pg_stats_ext_exprs to
pg_statistic_ext, we'd have to re-join the namespace and...yeah, it gets
yucky.

But instead of doing that, we could fetch
the pg_get_statisticsobjdef_expressions() when we first fetch the extstats
definitions, use that as a parameter in the dump query, and unnest that
with ordinality and join back to pg_stats_ext_expr. I think that's the best
way of enforcing the order of the expressions and not just trusting the
unnest() in the view definition.

In other news, pg_get_statisticsobjdef_expressions appears to be
undocumented, though it exists in v14 when extended stats expressions were
introduced, so I think we're safe there.

#11

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#10)

Re: Add expressions to pg_restore_extended_stats()

On Mon, Feb 02, 2026 at 10:34:14PM -0500, Corey Huinker wrote:

I'd have no issue including the expr definition in the existing object
container - an optional parameter that allow (no warning) but which has no
effect. We could then compare that optional parameter against the text
representation of the node and issue a non-failing warning on any
differences.

Well, you are making me doubt that relying on the expression text is a
good idea at all, so your argument regarding its portability across
releases is winning. Let's really add a negative attnum built based
on the position of the expressions in pg_stats_ext.exprs, though.
This should also lead to better error messages on restore depending on
the expression data we expect (I'll check ANALYZE about the
possibility of partial generation of the expression stats data
tomorrow or so).

But in a way we're already getting this type checking if the expressions
have different datatypes and either of them have most_common_values,
histogram_bounds, or most_common_elems - as all of those require input
coersion of the array values to datatypes of the expressions, and any
failure in any of those dooms the whole exprs array.

Yeah, but that does not help if two expressions use the same types,
which is possible. On the contrary, that sounds kind of confusing
to rely on for this portion of the statext data to restore.
--
Michael

#12

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#11)

Re: Add expressions to pg_restore_extended_stats()

This should also lead to better error messages on restore depending on
the expression data we expect (I'll check ANALYZE about the
possibility of partial generation of the expression stats data
tomorrow or so).

compute_expr_stats() is a bit confusing. There's a tcnt counter that looks
like it was meant to be independent of the loop variable, but currently
there are no ways to skip setting the datum at the tnct index and then
incrementing tcnt. I checked the history, and it's like that all the way
back to the introduction of expression stats in a4d75c86bf1522.

But in a way we're already getting this type checking if the expressions
have different datatypes and either of them have most_common_values,
histogram_bounds, or most_common_elems - as all of those require input
coersion of the array values to datatypes of the expressions, and any
failure in any of those dooms the whole exprs array.

Yeah, but that does not help if two expressions use the same types,
which is possible. On the contrary, that sounds kind of confusing
to rely on for this portion of the statext data to restore.

I wouldn't suggest relying on it, but our test cases do currently have
multi-expression stats, and if the order of them was getting jumbled due to
parallel query or something we'd be getting intermittent failures.

Anyway, I added AttrNumber exprnum to the errhint() on the errors
encountered in import_expressions, all EXCEPT for the ones in
statatt_build_stavalues, as the error messages there are localized to the
array values being imported, and I'm not sure how to append an errhint() to
an already existing error_data before rethrowing it.

#13

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#12)

Re: Add expressions to pg_restore_extended_stats()

On Tue, Feb 03, 2026 at 03:27:53AM -0500, Corey Huinker wrote:

This should also lead to better error messages on restore depending on
the expression data we expect (I'll check ANALYZE about the
possibility of partial generation of the expression stats data
tomorrow or so).

compute_expr_stats() is a bit confusing. There's a tcnt counter that looks
like it was meant to be independent of the loop variable, but currently
there are no ways to skip setting the datum at the tnct index and then
incrementing tcnt. I checked the history, and it's like that all the way
back to the introduction of expression stats in a4d75c86bf1522.

I have finally spent some time studying a bit more how these stats are
computed, inserted into the catalogs by analyze, then loaded. Indeed
the we have a computation first of the numbers in
compute_expr_stats(), all that being introduced by a4d75c86bf1522.

The code paths that catch my attention is how the data is inserted,
though, as in serialize_expr_stats(), and we have the following thing
that skips the insertion of a pg_statistic tuple for an expression
with invalid data (I thought this should be the case, now I am sure it
is the case):
if (!stats->stats_valid)
{
astate = accumArrayResult(astate,
(Datum) 0,
true,
typOid,
CurrentMemoryContext);
continue;
}

So depending on what ANALYZE thinks in terms of what data is valid or
not, it is entirely possible that we don't insert any data all all for
some of the expressions.

When the expression data is loaded in examine_variable()@selfuncs.c,
the code expects to find a valid tuple each time it calls
statext_expressions_load(). There is a loop before doing the load
to make sure that there is any stats data available.

That pretty much answers the previous question I had that I could not
answer to yet a few mails ago: the restore function needs to be
flexible with the restore of expressions, and it looks like we should
not expect all the inputs to be set. It is OK to not restore any data
at all for some of the expressions, or skip the all set of expressions
if nothing is given in input because we found no expression stats data.

As far as I can see, the proposed patch is incorrect and inconsistent
with the backend regarding all that. For example:
CREATE TABLE test (name text);
CREATE STATISTICS stats_exprs (dependencies)
ON lower(name), upper(name)
FROM test;

So based on my read of the code, we could expect zero, one or two
pg_statistic tuples based on what ANALYZE thinks. The restore
functions expects always two elements in its inner array. IMO, we
should have the flexibility to pass down 0, 1 or 2 elements in this
1-D array, without failing as long as the elements of the input data
have valid data.

The patch allows an object pattern like this one, for example, with a
second object being empty:
'exprs', '[{"avg_width": "7", [all valid fields] .. }, {}]

Unfortunately, on such input we insert TWO pg_statistic tuple. We
could rely on the order of the items in the 1-D array, but it seems to
me that we will have a much easier life if we shape the input data
based on the following rules, matching with the policy that ANALYZE
can allow:
- An input array can have as many elements as the number of
expressions.
- The expression a single expression refers ought to be tracked in
each element, individually. This means at least the addition of a
negative attribute number in EACH element (I am entirely rejecting my
wilder idea of the expression text at this stage). If the "exprs"
data has no element for a given expression, that's fine, this has the
same meaning as no tuples found in pg_statistic for an expression.
- An element in the input array should have all its key/value pairs
set, if set for an expression.

Does this analysis make sense to you?
--
Michael

#14

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#13)

Re: Add expressions to pg_restore_extended_stats()

The code paths that catch my attention is how the data is inserted,
though, as in serialize_expr_stats(), and we have the following thing
that skips the insertion of a pg_statistic tuple for an expression
with invalid data (I thought this should be the case, now I am sure it
is the case):
if (!stats->stats_valid)
{
astate = accumArrayResult(astate,
(Datum) 0,
true,
typOid,
CurrentMemoryContext);
continue;
}

So depending on what ANALYZE thinks in terms of what data is valid or
not, it is entirely possible that we don't insert any data all all for
some of the expressions.

"valid" in this case could also mean "totally inconclusive and therefore
not worth storing".

That pretty much answers the previous question I had that I could not
answer to yet a few mails ago: the restore function needs to be
flexible with the restore of expressions, and it looks like we should
not expect all the inputs to be set. It is OK to not restore any data
at all for some of the expressions, or skip the all set of expressions
if nothing is given in input because we found no expression stats data.

I'm all for tolerating an undercount in exported statistics. But this got
me wondering how the system view handles lining up stats to its expression
counterpart when there's an undercount.

And, uh, well, it doesn't:

FROM pg_statistic_ext s JOIN pg_class c ON (c.oid = s.stxrelid)
LEFT JOIN pg_statistic_ext_data sd ON (s.oid = sd.stxoid)
LEFT JOIN pg_namespace cn ON (cn.oid = c.relnamespace)
LEFT JOIN pg_namespace sn ON (sn.oid = s.stxnamespace)
JOIN LATERAL (
SELECT unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS
expr,
unnest(sd.stxdexpr)::pg_statistic AS a
) stat ON (stat.expr IS NOT NULL)

Two unnests like that will just null-pad the shorter list, so if there is a
missing pg_statistic, then the system view could be giving us stats from a
different expression. Now, we would probably detect that if the data types
don't line up...but the data types could be compatible enough...ick.

As far as I can see, the proposed patch is incorrect and inconsistent
with the backend regarding all that. For example:
CREATE TABLE test (name text);
CREATE STATISTICS stats_exprs (dependencies)
ON lower(name), upper(name)
FROM test;

So based on my read of the code, we could expect zero, one or two
pg_statistic tuples based on what ANALYZE thinks. The restore
functions expects always two elements in its inner array. IMO, we
should have the flexibility to pass down 0, 1 or 2 elements in this
1-D array, without failing as long as the elements of the input data
have valid data.

The patch allows an object pattern like this one, for example, with a
second object being empty:
'exprs', '[{"avg_width": "7", [all valid fields] .. }, {}]

Unfortunately, on such input we insert TWO pg_statistic tuple. We
could rely on the order of the items in the 1-D array, but it seems to
me that we will have a much easier life if we shape the input data
based on the following rules, matching with the policy that ANALYZE
can allow:
- An input array can have as many elements as the number of
expressions.

- The expression a single expression refers ought to be tracked in
each element, individually. This means at least the addition of a
negative attribute number in EACH element (I am entirely rejecting my
wilder idea of the expression text at this stage). If the "exprs"
data has no element for a given expression, that's fine, this has the
same meaning as no tuples found in pg_statistic for an expression.

I agree in principle, but I also fear that the system view may be wrong, so
even if we lined up the expression text from pg_statistic_ext with the
expression text from pg_stats_ext, the view itself would have already
mis-identified which stats go where, and we have no direct access to
pg_statistic_ext_data because of the security barrier.

And before you ask, there is *nothing* in pg_statistic_ext_data that
indicate which expression it belongs to in the pg_statistic_ext beyond its
position within the array, and we now know that isn't guaranteed.

- An element in the input array should have all its key/value pairs

set, if set for an expression.

That won't fly, because there will be pg_dumps from prior versions that
won't know about future keys which would then be required.

Does this analysis make sense to you?

It does, but I'm not seeing how we line up the right stats to the right
element in the event of an undercount. I've been digging around in the
planner, and it seems like it just takes the
Anum_pg_statistic_ext_data_stxdexpr, expands it, and then
statext_expressions_load just assumes that the array will have something at
that index. I hope I'm misreading this, because we may have uncovered
something broken.

Then again it could be that expressions always have stats built for them,
so the undercount never actually happens, thus the arrays all line up.

#15

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#14)

Re: Add expressions to pg_restore_extended_stats()

On Tue, Feb 03, 2026 at 11:05:58PM -0500, Corey Huinker wrote:

So depending on what ANALYZE thinks in terms of what data is valid or
not, it is entirely possible that we don't insert any data at all for
some of the expressions.

+1

"valid" in this case could also mean "totally inconclusive and therefore
not worth storing".

Yeah, I guess so in the context of ANALYZE.

My initial read of statext_expressions_load() was actually a bit
wrong, examine_variable() will always go through the load(), but if
serialize_expr_stats() has skipped one expression due to invalid data,
we just store a "(Datum) 0" aka NULL in the array. So there is some
data, and that also means that we will always have a number of
elements equal to the number of expressions (accumArrayResult adds the
value). Sorry for the confusion!

It also means that it should be possible to write
import_pg_statistic() so as it does the same thing as the
serialization. How about just allowing a NULL in the JSON array as an
equivalent of no data found for an expression? If you do that, your
initial idea about requiring the same number of elements in the array
would be consistent with the backend, without having to include any
information about the expression text or its attnum in the elements of
the JSON array.

For example, if a statext object has three expressions, the first and
second ones have no data but not the third, then we go like that:
[null,null,{"avg_width": "7", ... }]

A NULL value is appended for the first two ones in the result
ArrayBuildState to mark them as not valid, and the third one has valid
data. When loading this data statext_expressions_load() will see the
"invalid" value and act accordingly.

That would require a couple of extra lines in
import_pg_statistic(), so as accumArrayResult() could be called with
disnull=true to do the same as the backend, with "ok" set to true to
count for the fact that it is a valid pattern.

And before you ask, there is *nothing* in pg_statistic_ext_data that
indicate which expression it belongs to in the pg_statistic_ext beyond its
position within the array, and we now know that isn't guaranteed.

Yeah, I wanted to rely on something external, but it does not seem we
can get that, so. Your idea to rely on the number of elements in the
array to be the same as the number of expressions would enforce that.
If the MCV inputs complain, well, at least the user would get some
information about what is wrong.

- An element in the input array should have all its key/value pairs

set, if set for an expression.

That won't fly, because there will be pg_dumps from prior versions that
won't know about future keys which would then be required.

Okay, point taken.

It does, but I'm not seeing how we line up the right stats to the right
element in the event of an undercount. I've been digging around in the
planner, and it seems like it just takes the
Anum_pg_statistic_ext_data_stxdexpr, expands it, and then
statext_expressions_load just assumes that the array will have something at
that index. I hope I'm misreading this, because we may have uncovered
something broken.

It looks like your take is right and that my first impression was
wrong: if an expression has invalid data accumArrayResult() stores a
NULL value, so we will never be short. So there is some data, and
your patch could be tweaked so as an expression in a set is marked as
invalid by storing a NULL in the result array at its location.
--
Michael

#16

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#15)

Re: Add expressions to pg_restore_extended_stats()

My initial read of statext_expressions_load() was actually a bit
wrong, examine_variable() will always go through the load(), but if
serialize_expr_stats() has skipped one expression due to invalid data,
we just store a "(Datum) 0" aka NULL in the array. So there is some
data, and that also means that we will always have a number of
elements equal to the number of expressions (accumArrayResult adds the
value). Sorry for the confusion!

That is an enormous relief.

It also means that it should be possible to write
import_pg_statistic() so as it does the same thing as the
serialization. How about just allowing a NULL in the JSON array as an
equivalent of no data found for an expression? If you do that, your
initial idea about requiring the same number of elements in the array
would be consistent with the backend, without having to include any
information about the expression text or its attnum in the elements of
the JSON array.

Whew.

That would require a couple of extra lines in
import_pg_statistic(), so as accumArrayResult() could be called with
disnull=true to do the same as the backend, with "ok" set to true to
count for the fact that it is a valid pattern.

- An element in the input array should have all its key/value pairs

set, if set for an expression.

That won't fly, because there will be pg_dumps from prior versions that
won't know about future keys which would then be required.

Okay, point taken.

We also get *much* shorter regression tests if allow values irrelevant to
the test to be excluded.

It does, but I'm not seeing how we line up the right stats to the right
element in the event of an undercount. I've been digging around in the
planner, and it seems like it just takes the
Anum_pg_statistic_ext_data_stxdexpr, expands it, and then
statext_expressions_load just assumes that the array will have something

at

that index. I hope I'm misreading this, because we may have uncovered
something broken.

It looks like your take is right and that my first impression was
wrong: if an expression has invalid data accumArrayResult() stores a
NULL value, so we will never be short. So there is some data, and
your patch could be tweaked so as an expression in a set is marked as
invalid by storing a NULL in the result array at its location.

I'll switch to adding the nulls to the array result, and add tests for both
leading and trailing expr missing.

Mild change of subject, it seems that we can't get the expression fake
attnum context into the errors we re-throw in statatt_build_stavalues - it
might make sense to to bring a version of that function into
extended_stats_funcs where we can add the extra parameters for context (and
avoid the need for a text datum version of some longish strings that we've
already just converted from converting json-string to c-string. If I did
make a new function, then that'd be 2 statatt_* functions that no longer
need to be visible outside of attribute_stats.c. Thoughts on both making
the new function, and maybe sending a few of these statatts back to
static-land?

#17

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#16)

Re: Add expressions to pg_restore_extended_stats()

On Wed, Feb 04, 2026 at 12:14:06AM -0500, Corey Huinker wrote:

I'll switch to adding the nulls to the array result, and add tests for both
leading and trailing expr missing.

OK, fine by me. Thanks. Let's make something happen to finish all
that.

Mild change of subject, it seems that we can't get the expression fake
attnum context into the errors we re-throw in statatt_build_stavalues - it
might make sense to to bring a version of that function into
extended_stats_funcs where we can add the extra parameters for context (and
avoid the need for a text datum version of some longish strings that we've
already just converted from converting json-string to c-string. If I did
make a new function, then that'd be 2 statatt_* functions that no longer
need to be visible outside of attribute_stats.c. Thoughts on both making
the new function, and maybe sending a few of these statatts back to
static-land?

Hmm. I am not sure, that depends. How much additional information
would these extra parameter bring to the errors generated in
statatt_build_stavalues(). We could also set an error context
callback (ErrorContextCallback) within import_pg_statistic() or in the
loop that calls the routine, with some data based on the counter of
"numexprs" to provide more context about where an error is happening.
I have used that in the past to avoid complicating functions across
multiple levels of a stack (for example, see ReindexPartitions() in
indexcmds.c with its ReindexErrorInfo).
--
Michael

#18

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#17)

Re: Add expressions to pg_restore_extended_stats()

Mild change of subject, it seems that we can't get the expression fake
attnum context into the errors we re-throw in statatt_build_stavalues -

it

might make sense to to bring a version of that function into
extended_stats_funcs where we can add the extra parameters for context

(and

avoid the need for a text datum version of some longish strings that

we've

already just converted from converting json-string to c-string. If I did
make a new function, then that'd be 2 statatt_* functions that no longer
need to be visible outside of attribute_stats.c. Thoughts on both making
the new function, and maybe sending a few of these statatts back to
static-land?

Hmm. I am not sure, that depends. How much additional information
would these extra parameter bring to the errors generated in
statatt_build_stavalues().

Just the index of which expression and the name of the key that fed in the
offending value string.

We could also set an error context
callback (ErrorContextCallback) within import_pg_statistic() or in the
loop that calls the routine, with some data based on the counter of
"numexprs" to provide more context about where an error is happening.

If you think it would help, I'd be game for that. It's been a while since I
used one of those callbacks, though.

I have used that in the past to avoid complicating functions across
multiple levels of a stack (for example, see ReindexPartitions() in
indexcmds.c with its ReindexErrorInfo).

This may be another such case.

#19

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#18)

Re: Add expressions to pg_restore_extended_stats()

On Wed, Feb 04, 2026 at 01:13:05AM -0500, Corey Huinker wrote:

Hmm. I am not sure, that depends. How much additional information
would these extra parameter bring to the errors generated in
statatt_build_stavalues().

Just the index of which expression and the name of the key that fed in the
offending value string.

Okay. It really sounds to me like an ErrorContextCallback would help
you here. I don't see a need to complicate statatt_build_stavalues(),
but my intuition can also be wrong.
--
Michael

#20

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Michael Paquier (#19)

Re: Add expressions to pg_restore_extended_stats()

Okay. It really sounds to me like an ErrorContextCallback would help
you here. I don't see a need to complicate statatt_build_stavalues(),
but my intuition can also be wrong.

Sigh. errsave_finish() only calls errfinish() on elevel >= ERROR, and
that's where the context stack is walked. The failed work in progress is
attached too, mostly as a cautionary tale.

v5-0001 includes tests for null expressions on the leading and trailing end.