use of SPI by postgresImportForeignStatistics
Hi,
I'm concerned about the way that postgresImportForeignStatistics is
doing what it does. First, it's using SPI to execute two constant SQL
queries, attimport_sql and attclear_sql. However, it doesn't seem to
set the search_path, and these queries aren't fully robust against a
possibly-unsafe search_path (e.g. the call to
pg_restore_attribute_stats is schema-qualified, but the type names are
not!). Furthermore, the function we're calling takes a schema name and
a relation name as two separate arguments, rather than a single OID
argument. I'm not sure it's completely guaranteed that we'll end up
affecting the same relation that we locked in
postgresImportForeignStatistics, because e.g. what if the containing
schema is being concurrently renamed?
But even if it is absolutely guaranteed that we latch onto the correct
relation, this seems like the fundamentally wrong way to do this kind
of thing. It seems crazy to me that instead of exposing an interface
that would be well-suited to direct use by an FDW, the
statistics-import stuff exposes an interface that can only be
reasonably called via the FunctionCallInfo interface, which then
results in postgres_fdw needing to jump through hoops to construct and
execute SQL statements.
This doesn't look entirely easy to straighten out. A function like
attribute_statistics_update() is just not a good idea -- it mixes
together considerations that only exist when you want to call
something from SQL with general validity checking that's needed
regardless.
But I don't think we should ship this like this. Best case scenario,
it's overly complicated. Medium case scenario, it's also unreliable.
Worst case scenario, there are security vulnerabilities in here.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Robert,
On Tue, Jun 16, 2026 at 2:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
I'm concerned about the way that postgresImportForeignStatistics is
doing what it does. First, it's using SPI to execute two constant SQL
queries, attimport_sql and attclear_sql. However, it doesn't seem to
set the search_path, and these queries aren't fully robust against a
possibly-unsafe search_path (e.g. the call to
pg_restore_attribute_stats is schema-qualified, but the type names are
not!).
Good catch!
Furthermore, the function we're calling takes a schema name and
a relation name as two separate arguments, rather than a single OID
argument. I'm not sure it's completely guaranteed that we'll end up
affecting the same relation that we locked in
postgresImportForeignStatistics, because e.g. what if the containing
schema is being concurrently renamed?
Ugh. As pg_restore_attribute_stats/pg_restore_relation_stats are
volatile functions, SPI executes the queries in read-write mode,
causing an error, like "schema "foo" does not exist", or other
unexpected results in such a case. Not sure what to do about this
issue right now. Will think about it some more.
But even if it is absolutely guaranteed that we latch onto the correct
relation, this seems like the fundamentally wrong way to do this kind
of thing. It seems crazy to me that instead of exposing an interface
that would be well-suited to direct use by an FDW, the
statistics-import stuff exposes an interface that can only be
reasonably called via the FunctionCallInfo interface, which then
results in postgres_fdw needing to jump through hoops to construct and
execute SQL statements.
I thought it would be a good idea to use
pg_restore_attribute_stats/pg_restore_relation_stats, because future
changes in attribute/relation stats would be absorbed by these
functions, which would lower the maintenance cost of this feature.
Thanks for the comments!
Best regards,
Etsuro Fujita
On Tue, Jun 16, 2026 at 8:04 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
I thought it would be a good idea to use
pg_restore_attribute_stats/pg_restore_relation_stats, because future
changes in attribute/relation stats would be absorbed by these
functions, which would lower the maintenance cost of this feature.
I agree that we want to reuse code, but this isn't the right way to do
it. For example, when we want to look up the OID of a relation from
SQL, we can say 'whatever'::regclass::oid. But when we want to do the
same thing from C, we don't construct a SELECT statement and execute
it via SPI. Instead, we have functions like RangeVarGetRelidExtended()
which provide access to the same underlying functionality more
directly.
Another example is converting strings to integers. The user calls
int4in(), which then hands off the call to pg_strtoint32_safe(), which
can also be called via pg_strtoint32(). Hence, C code should prefer to
use pg_strtoint32(), while SQL will go through int4in(). Both
ultimately reach the underlying pg_strtoint32_safe() function, but the
interfaces are different, so that we can have it be suitable both for
SQL access and for C access.
This needs to work more like that. The underlying code that ingests
and updates the stats should be shared, but the stuff that is specific
to a FunctionCallInfo interface needs to be separated out so that we
don't need to go through that when calling from C.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Jun 16, 2026 at 10:34 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 16, 2026 at 8:04 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:I thought it would be a good idea to use
pg_restore_attribute_stats/pg_restore_relation_stats, because future
changes in attribute/relation stats would be absorbed by these
functions, which would lower the maintenance cost of this feature.I agree that we want to reuse code, but this isn't the right way to do
it. For example, when we want to look up the OID of a relation from
SQL, we can say 'whatever'::regclass::oid. But when we want to do the
same thing from C, we don't construct a SELECT statement and execute
it via SPI. Instead, we have functions like RangeVarGetRelidExtended()
which provide access to the same underlying functionality more
directly.Another example is converting strings to integers. The user calls
int4in(), which then hands off the call to pg_strtoint32_safe(), which
can also be called via pg_strtoint32(). Hence, C code should prefer to
use pg_strtoint32(), while SQL will go through int4in(). Both
ultimately reach the underlying pg_strtoint32_safe() function, but the
interfaces are different, so that we can have it be suitable both for
SQL access and for C access.This needs to work more like that. The underlying code that ingests
and updates the stats should be shared, but the stuff that is specific
to a FunctionCallInfo interface needs to be separated out so that we
don't need to go through that when calling from C.
Back when pg_restore_attribute_stats and pg_restore_relation_stats were
being converted from positional to variadic functions, there was a patchset
or two (possibly un-posted, because I couldn't find it just now) where the
variadic functions re-marshalled the arguments into a call to a
fixed-parameter function. If that model were revived, we could have a more
conventional DirectFunctonCall() interface.
Is it possible that we never built a direct function call interface to a
variadic because we never needed one? Perhaps that time is now.
On Tue, Jun 16, 2026 at 2:08 PM Corey Huinker <corey.huinker@gmail.com>
wrote:
On Tue, Jun 16, 2026 at 10:34 AM Robert Haas <robertmhaas@gmail.com>
wrote:On Tue, Jun 16, 2026 at 8:04 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:I thought it would be a good idea to use
pg_restore_attribute_stats/pg_restore_relation_stats, because future
changes in attribute/relation stats would be absorbed by these
functions, which would lower the maintenance cost of this feature.I agree that we want to reuse code, but this isn't the right way to do
it. For example, when we want to look up the OID of a relation from
SQL, we can say 'whatever'::regclass::oid. But when we want to do the
same thing from C, we don't construct a SELECT statement and execute
it via SPI. Instead, we have functions like RangeVarGetRelidExtended()
which provide access to the same underlying functionality more
directly.Another example is converting strings to integers. The user calls
int4in(), which then hands off the call to pg_strtoint32_safe(), which
can also be called via pg_strtoint32(). Hence, C code should prefer to
use pg_strtoint32(), while SQL will go through int4in(). Both
ultimately reach the underlying pg_strtoint32_safe() function, but the
interfaces are different, so that we can have it be suitable both for
SQL access and for C access.This needs to work more like that. The underlying code that ingests
and updates the stats should be shared, but the stuff that is specific
to a FunctionCallInfo interface needs to be separated out so that we
don't need to go through that when calling from C.Back when pg_restore_attribute_stats and pg_restore_relation_stats were
being converted from positional to variadic functions, there was a patchset
or two (possibly un-posted, because I couldn't find it just now) where the
variadic functions re-marshalled the arguments into a call to a
fixed-parameter function. If that model were revived, we could have a more
conventional DirectFunctonCall() interface.Is it possible that we never built a direct function call interface to a
variadic because we never needed one? Perhaps that time is now.
Obviously, even that wouldn't get us to the no-FunctionCallInfo-at-all
goal, but it would get us out of the SPI situation. If we did have a
non-fcinfo function API, that API would either have to pass char *params
almost exclusively, or else each caller would have to do the translation or
float-array strings to double[] and such, which would be a lot of work for
the caller.
Would you be ok with a function like attribute_statistics_update, but
purely with cstring args? Obviously the callers would have to modify their
calls each time a new stat param is added, but that is seemingly preferable
for you than the current situation.
On Tue, Jun 16, 2026 at 3:11 PM Corey Huinker <corey.huinker@gmail.com> wrote:
Obviously, even that wouldn't get us to the no-FunctionCallInfo-at-all goal, but it would get us out of the SPI situation. If we did have a non-fcinfo function API, that API would either have to pass char *params almost exclusively, or else each caller would have to do the translation or float-array strings to double[] and such, which would be a lot of work for the caller.
Would you be ok with a function like attribute_statistics_update, but purely with cstring args? Obviously the callers would have to modify their calls each time a new stat param is added, but that is seemingly preferable for you than the current situation.
I feel pretty strongly that we want a way to identify the relation by
OID rather than by schema and name. I'm not sure whether attributes
should be identified by name or by number in this context. Apart from
those things, if passing strings and leaving it up to the called
function to interpret them makes most sense, that's OK with me.
I agree with you that getting rid of SPI is a much higher priority
than getting rid of the FunctionCallInfo stuff altogether. I could
live with DirectFunctionCall if we don't have a better option. I'm a
bit suspicious that this will lead to code bloat in postgres_fdw or
elsewhere, but maybe not.
--
Robert Haas
EDB: http://www.enterprisedb.com
Would you be ok with a function like attribute_statistics_update, but
purely with cstring args? Obviously the callers would have to modify their
calls each time a new stat param is added, but that is seemingly preferable
for you than the current situation.I feel pretty strongly that we want a way to identify the relation by
OID rather than by schema and name.
Previous versions of pg_set_attribute_stats() were like that, but it went
away as the use-cases folded into pg_restore_attribute_stats(). So it's
do-able, but will make sharing code with the existing function harder in
the near term. We may have to settle for something that takes the oid, but
then plugs into the existing functions until such time as we can refactor
the existing callers.
I'm not sure whether attributes
should be identified by name or by number in this context. Apart from
those things, if passing strings and leaving it up to the called
function to interpret them makes most sense, that's OK with me.
Identifying attributes by attnum is mostly for index expressions, so
attname would be the preference here.
I agree with you that getting rid of SPI is a much higher priority
than getting rid of the FunctionCallInfo stuff altogether. I could
live with DirectFunctionCall if we don't have a better option.
Ok, good to know that I have that as a fallback position.
I think the right first step is to create the functions
import_relation_stats() and import_attribute_stats() which take the oid of
the destination relation and the rest of the parameters are const char
pointers, which is the ideal case for a bunch of values being fetched from
PQgetValue(). Those functions will for now just make the call to
attribute_statistics_update() / relation_statistics_update(), but at least
it will be localized inside relation_stats.c and attribute_stats.c where we
can make internal refactors without disturbing anyone else.
Now, we *could* make these 2 new functions take the oid/oid+attname
followed by a variadic array of keyname+value string pairs like the
existing pg_restore_*_stats() functions, so the functions call signature
would be somewhat future-proofed. Let me know if that violates your vision
of what this should be.
I'm a
bit suspicious that this will lead to code bloat in postgres_fdw or
elsewhere, but maybe not.
postgres_fdw will get slimmer for not having SPI in it. relation_stats.c
and attribute_stats.c will have some temporary bloat until things can be
refactored.
On Tue, Jun 16, 2026 at 6:50 PM Corey Huinker <corey.huinker@gmail.com> wrote:
postgres_fdw will get slimmer for not having SPI in it. relation_stats.c and attribute_stats.c will have some temporary bloat until things can be refactored.
Whatever ends up in core can (indeed must) be reused by every FDW. So
making things simpler for the FDW and more complex for core seems
likely to be the right tradeoff in general.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi Corey,
On Wed, Jun 17, 2026 at 7:50 AM Corey Huinker <corey.huinker@gmail.com> wrote:
I think the right first step is to create the functions import_relation_stats() and import_attribute_stats() which take the oid of the destination relation and the rest of the parameters are const char pointers, which is the ideal case for a bunch of values being fetched from PQgetValue(). Those functions will for now just make the call to attribute_statistics_update() / relation_statistics_update(), but at least it will be localized inside relation_stats.c and attribute_stats.c where we can make internal refactors without disturbing anyone else.
+1
Now, we *could* make these 2 new functions take the oid/oid+attname followed by a variadic array of keyname+value string pairs like the existing pg_restore_*_stats() functions, so the functions call signature would be somewhat future-proofed. Let me know if that violates your vision of what this should be.
IMO I don't think we need to go that far, because 1) the new functions
are provided for FDW authors only, and 2) we don't make changes to the
stats that often.
Thanks!
Best regards,
Etsuro Fujita
IMO I don't think we need to go that far, because 1) the new functions
are provided for FDW authors only, and 2) we don't make changes to the
stats that often.
That would be simpler, to be sure. However, in the past I've received
pushback on functions that had a large number of parameters, and this would
definitely be a large number of parameters, approximately 17, so I thought
I should at least offer the variadic way.
I'm proceeding with the many-parameters model.
On Tue, Jun 16, 2026 at 6:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jun 16, 2026 at 6:50 PM Corey Huinker <corey.huinker@gmail.com>
wrote:postgres_fdw will get slimmer for not having SPI in it. relation_stats.c
and attribute_stats.c will have some temporary bloat until things can be
refactored.Whatever ends up in core can (indeed must) be reused by every FDW. So
making things simpler for the FDW and more complex for core seems
likely to be the right tradeoff in general.
I had assumed we wanted a generic C API to these two functions, but if we
want something that is specific to FDWs, that might change where the
functions land in the header files. It's an easy change to make if we
change our minds, but it would be good to know if we do or do not want
something specific to FDWs. Personally, I think FDWs will be 90-95% of the
usages outside of the existing SQL-defined functions, but 95% is not 100%,
so I'd be inclined to leave them in a statistics/stats utils header.
On Wed, Jun 17, 2026 at 12:03 PM Corey Huinker <corey.huinker@gmail.com> wrote:
I had assumed we wanted a generic C API to these two functions, but if we want something that is specific to FDWs, that might change where the functions land in the header files. It's an easy change to make if we change our minds, but it would be good to know if we do or do not want something specific to FDWs. Personally, I think FDWs will be 90-95% of the usages outside of the existing SQL-defined functions, but 95% is not 100%, so I'd be inclined to leave them in a statistics/stats utils header.
I don't want it to be specific to FDWs, just convenient for FDWs.
Agree they should stay in a statistics header.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Jun 18, 2026 at 1:00 AM Corey Huinker <corey.huinker@gmail.com> wrote:
IMO I don't think we need to go that far, because 1) the new functions
are provided for FDW authors only, and 2) we don't make changes to the
stats that often.That would be simpler, to be sure. However, in the past I've received pushback on functions that had a large number of parameters, and this would definitely be a large number of parameters, approximately 17, so I thought I should at least offer the variadic way.
I'm proceeding with the many-parameters model.
I think that that's acceptable, considering that
heap_create_with_catalog() has 21 parameters.
Thanks!
Best regards,
Etsuro Fujita
On Thu, Jun 18, 2026 at 6:43 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:
On Thu, Jun 18, 2026 at 1:00 AM Corey Huinker <corey.huinker@gmail.com>
wrote:IMO I don't think we need to go that far, because 1) the new functions
are provided for FDW authors only, and 2) we don't make changes to the
stats that often.That would be simpler, to be sure. However, in the past I've received
pushback on functions that had a large number of parameters, and this would
definitely be a large number of parameters, approximately 17, so I thought
I should at least offer the variadic way.I'm proceeding with the many-parameters model.
I think that that's acceptable, considering that
heap_create_with_catalog() has 21 parameters.Thanks!
Best regards,
Etsuro Fujita
Attached are two patches to remove SPI from the postgres_fdw.c, replaced by
local C functions. The division into two patches is entirely for
readability. Separately, each one represents a half-measure that would
benefit no one. The first does the minimal changes needed to get the
relation-stats to stop using SPI, and the second does the same for
attribute stats, removes all vestiges of SPI from postgres_fdw.c, and adds
some tests to make sure that a foreign table has the exact stats of the
source loopback table.
The function import_attribute_stats() is a bit parameter-happy, but that's
mostly necessary. I left the non-key parameters as all type char* because
that avoids even more "bool foo_isnull" parameters, and that allows the
caller to not have to rework the various stat types into their
corresponding C types (float, int4, float[], and the anyarrays that
basically aren't cast-able by any client program anyway). I'm open to
requiring the caller to use the right datatypes for some of the parameters,
but as I said there is no way to "win" with the anyarrays, and even the
pg_restore_attribute_stats() function falls back to type text for those.
To be clear, this solves the SPI problem, but questions about the design of
attribute_statistics_update() and relation_statistics_update() remain, but
those concerns are now isolated within their respective files
attribute_stats.c and relation_stats.c. The inefficiencies therein aren't
really in a critical path, so if we wanted to leave them be until v20 they
could, but if time allows I'd at least like try unwinding some of that. But
first, let's get SPI out of postgres_fdw.c.