Import Statistics in postgres_fdw before resorting to sampling.

Started by Corey Huinker7 months ago54 messages

corey.huinker@gmail.com

7 months ago

Attached is my current work on adding remote fetching of statistics to
postgres_fdw, and opening the possibility of doing so to other foreign data
wrappers.

This involves adding two new options to postgres_fdw at the server and
table level.

The first option, fetch_stats, defaults to true at both levels. If enabled,
it will cause an ANALYZE of a postgres_fdw foreign table to first attempt
to fetch relation and attribute statistics from the remote table. If this
succeeds, then those statistics are imported into the local foreign table,
allowing us to skip a potentially expensive sampling operation.

The second option, remote_analyze, defaults to false at both levels, and
only comes into play if the first fetch succeeds but no attribute
statistics (i.e. the stats from pg_stats) are found. If enabled then the
function will attempt to ANALYZE the remote table, and if that is
successful then a second attempt at fetching remote statistics will be made.

If no statistics were fetched, then the operation will fall back to the
normal sampling operation per settings.

Note patches 0001 and 0002 are already a part of a separate thread
/messages/by-id/CADkLM=cpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR+gJX-Og@mail.gmail.com
regarding a bug (0001) and a nitpick (0002) that came about as a
side-effect to this effort, and but I expect those to be resolved one way
or another soon. Any feedback on those two can be handled there.

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

7 months ago

In reply to: Corey Huinker (#1)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Tue, Aug 12, 2025 at 10:33 PM Corey Huinker <corey.huinker@gmail.com> wrote:

Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doing so to other foreign data wrappers.

This involves adding two new options to postgres_fdw at the server and table level.

The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdw foreign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds, then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive sampling operation.

The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeeds but no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt to ANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made.

If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings.

Note patches 0001 and 0002 are already a part of a separate thread /messages/by-id/CADkLM=cpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR+gJX-Og@mail.gmail.com regarding a bug (0001) and a nitpick (0002) that came about as a side-effect to this effort, and but I expect those to be resolved one way or another soon. Any feedback on those two can be handled there.

I think this is very useful to avoid fetching rows from foreign server
and analyzing them locally.

This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.
Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute. Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?

If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?

--
Best Wishes,
Ashutosh Bapat

Corey Huinker

corey.huinker@gmail.com

7 months ago

In reply to: Ashutosh Bapat (#2)

Re: Import Statistics in postgres_fdw before resorting to sampling.

This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.

Any degree of review is welcome. We're chasing views, reviews, etc.

Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute.

That recap is accurate.

Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?

It might be marginally faster, but it would duplicate a lot of the
pair-checking (must have a most-common-freqs with a most-common-vals, etc)
and type-checking logic (the vals in a most-common vals must all input
coerce to the correct datatype for the _destination_ column, etc), and
we've already got that in pg_restore_attribute_stats. There used to be a
non-fcinfo function that took a long list of Datums and isnull boolean
pairs, but that wasn't pretty to look at and was replaced with the
positional fcinfo version we have today. This use case might be a reason to
bring that back, or expose the existing positional fcinfo function
(presently static) if we want to avoid SPI badly enough. As it is, the SPI
code is fairly future proof in that it isn't required to add new stat types
as they are created. My first attempt at this patch attempted to make a
FunctionCallInvoke() on the variadic pg_restore_attribute_stats, but that
would require a filled out fn_expr, and to get that we'd have to duplicate
a lot of logic from the parser, so the infrastructure isn't available to
easily call a variadic function.

If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?

Glad you agree that we're fetching the right statistics.

That was the only way I could think of to do one client fetch and still get
exactly one row back.

Anything else involved fetching the inh=true first, and if that failed
fetching the inh=false statistics. That adds extra round-trips especially
given that inherited statistics are more rare than non-inherited
statistics. Moreoever, we're making decisions (analyze yes/no, fallback to
sampling yes/no) based on whether or not we got statistics back from the
foreign server at all, and having to consider the result of two fetches
instead of one makes that logic more complicated.

If, however, you were referring to the work we're handing the remote
server, I'm open to queries that you think would be more lightweight.
However, the pg_stats view is a security barrier view, so we reduce
overhead by passing through that barrier as few times as possible.

Corey Huinker

corey.huinker@gmail.com

6 months ago

In reply to: Corey Huinker (#3)

Re: Import Statistics in postgres_fdw before resorting to sampling.

Rebased.

Corey Huinker

corey.huinker@gmail.com

5 months ago

In reply to: Corey Huinker (#4)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Fri, Sep 12, 2025 at 1:17 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

Rebased.

Rebased again.

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Corey Huinker (#5)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote:

Rebased again.

Hearing an opinion from Fujita-san would be interesting here, so I am
adding him in CC. I have been looking a little bit at this patch.

+ ImportStatistics_function ImportStatistics;

All FDW callbacks are documented in fdwhandler.sgml. This new one is
not.

I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages. The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting? Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation. The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows. Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)? This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

Fujita-san, what do you think?
--
Michael

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Michael Paquier (#6)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Mon, Oct 20, 2025 at 03:45:14PM +0900, Michael Paquier wrote:

On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote:

Rebased again.

Hearing an opinion from Fujita-san would be interesting here, so I am
adding him in CC. I have been looking a little bit at this patch.

By the way, as far as I can see this patch is still in the past commit
fest:
https://commitfest.postgresql.org/patch/5959/

You may want to move it if you are planning to continue working on
that.
--
Michael

Corey Huinker

corey.huinker@gmail.com

4 months ago

In reply to: Michael Paquier (#6)

Re: Import Statistics in postgres_fdw before resorting to sampling.

I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages. The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

They don't try to retrieve the same data. AnalyzeForeignTable tries to
retrieve a table sample which it feeds into the normal ANALYZE process. If
that sample is going to be any good, it has to be a lot of rows, that
that's a lot of network traffic.

ImportStatistics just grabs the results that ANALYZE computed for the
remote table, using a far better sample than we'd want to pull across the
wire.

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?

To do that, we would have to somehow generate fake data locally from the
pg_stats data that we did pull over the wire, just to have ANALYZE compute
it back down to the data we already had.

Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation.

That sounds like an argument for expanding ANALYZE to have syntax that will
digest pre-computed rows, essentially taking over what
pg_restore_relation_stats and pg_restore_attribute_stats already do, and
that idea was dismissed fairly early on in development, though I wasn't
against it at the time.

As it is, those two functions were developed to match what ANALYZE does.
pg_restore_relation_stats even briefly had inplace updates because that's
what ANALYZE did.

The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows. Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)? This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

I wanted to make minimal disruption to the existing callbacks, but that may
have been misguided.

One problem I do see, however, is that the queries for fetching relation
(pg_class) stats should never fail and always return one row, even if the
values returned are meaningless. It's the query against pg_stats that lets
us know if 1) the database on the other side is a real-enough postgres and
2) the remote table is itself analyzed. Only once we're happy that we have
good attribute stats should we bother with the relation stats. All of this
logic is specific to postgres, so it was confined to postgres_fdw. I don't
know that other FDWs would be that much different, but minimizing the
generic impact was a goal.

I'll look at this again, but I'm not sure I'm going to come up with much
different.

There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

I think the big win is the network traffic savings.

Fujita-san, what do you think?

Very interested to know as well.

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

4 months ago

In reply to: Corey Huinker (#8)

Re: Import Statistics in postgres_fdw before resorting to sampling.

Hi Corey,

This is an important feature. Thanks for working on it.

On Mon, Nov 3, 2025 at 2:26 PM Corey Huinker <corey.huinker@gmail.com> wrote:

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?

To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the wire, just to have ANALYZE compute it back down to the data we already had.

Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation.

That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially taking over what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly early on in development, though I wasn't against it at the time.

As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had inplace updates because that's what ANALYZE did.

To me it seems like a good idea to rely on pg_restore_relation_stats
and pg_restore_attribute_stats, rather than doing some rework on
analyze.c; IMO I don't think it's a good idea to do such a thing for
something rather special like this.

The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows. Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)? This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

+1 for the minimal disruption.

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

On the other hand:

This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

That's it for now.

My apologies for the delayed response.

Best regards,
Etsuro Fujita

#10

Corey Huinker

corey.huinker@gmail.com

4 months ago

In reply to: Etsuro Fujita (#9)

Re: Import Statistics in postgres_fdw before resorting to sampling.

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch
statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

Obviously there is no way to know the quality/freshness of remote stats if
they are found.

The analyze option was borne of feedback from other postgres hackers while
brainstorming on what this option might look like. I don't think we *need*
this extra option for the feature to be a success, but it's relative
simplicity did make me want to put it out there to see who else liked it.

On the other hand:

This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

The stats fetch query is pretty light, but I can see fetching the relkind
along with the relstats, and making decisions on whether to continue from
there, only applying the relstats after attrstats have been successfully
applied.

That's it for now.

I'll see what I can do to make that work.

My apologies for the delayed response.

Valuable responses are worth waiting for.

#11

Corey Huinker

corey.huinker@gmail.com

4 months ago

In reply to: Corey Huinker (#10)

Re: Import Statistics in postgres_fdw before resorting to sampling.

My apologies for the delayed response.

Valuable responses are worth waiting for.

I've reorganized some things a bit, mostly to make resource cleanup
simpler.

#12

Chao Li

li.evan.chao@gmail.com

4 months ago

In reply to: Corey Huinker (#11)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Nov 23, 2025, at 15:27, Corey Huinker <corey.huinker@gmail.com> wrote:

My apologies for the delayed response.

Valuable responses are worth waiting for.

I've reorganized some things a bit, mostly to make resource cleanup simpler.
<v4-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch>

Few comments:

1 - commit message
```
effort and the user is better of setting fetch_stats to false for that
```

I guess “of” should be “off”

2 - postgres-fdw.sgml
```
+ server, determines wheter an <command>ANALYZE</command> on a foreign
```

Typo: wheter => whether

3 - postgres-fdw.sgml
```
+ data sampling if no statistics were availble. This option is only
```

Typo: availble => available

4 - option.c
```
+		/* fetch_size is available on both server and table */
+		{"fetch_stats", ForeignServerRelationId, false},
+		{"fetch_stats", ForeignTableRelationId, false},
```

In the comment, I guess fetch_size should be fetch_stats.

5 - analyze.c
```
+ * XXX: Should this be it's own fetch type? If not, then there might be
```

Typo: “it’s own” => “its own”

6 - analyze.c
```
+				case FDW_IMPORT_STATS_NOTFOUND:
+					ereport(INFO,
+							errmsg("Found no remote statistics for \"%s\"",
+								   RelationGetRelationName(onerel)));
```

`Found no remote statistics for \"%s\””` could be rephrased as `""No remote statistics found for foreign table \"%s\””`, sounds better wording in server log.

Also, I wonder if this message at INFO level is too noisy?

7 - postgres_fdw.c
```
+		default:
+			ereport(INFO,
+					errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					errmsg("Remote table %s.%s does not support statistics.",
+						   quote_identifier(remote_schemaname),
+						   quote_identifier(remote_relname)),
+					errdetail("Remote relation if of relkind \"%c\"", relkind));
```

I think “if of” should be “is of”.

8 - postgres_fdw.c
```
+						errmsg("Attribute statistics found for %s.%s but no columns matched",
+							   quote_identifier(schemaname),
+							   quote_identifier(relname))));
```

Instead of using "%s.%s” and calling quote_identifier() twice, there is a simple function to use: quote_qualified_identifier(schemaname, relname).

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#13

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Corey Huinker (#10)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

Obviously there is no way to know the quality/freshness of remote stats if they are found.

The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might look like. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did make me want to put it out there to see who else liked it.

Actually, I have some concerns about the ANALYZE and fall-back
options. As for the former, if the remote user didn't have the
MAINTAIN privilege on the remote table, remote ANALYZE would be just a
waste effort. As for the latter, imagine the situation where a user
ANALYZEs a foreign table whose remote table is significantly large.
When the previous attempts fail, the user might want to re-try to
import remote stats after ANALYZEing the remote table in the remote
side in some way, rather than postgres_fdw automatically falling back
to the normal lengthy processing. I think just throwing an error if
the first attempt fails would make the system not only simple but
reliable while providing some flexibility to users.

On the other hand:

This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

The stats fetch query is pretty light, but I can see fetching the relkind along with the relstats, and making decisions on whether to continue from there, only applying the relstats after attrstats have been successfully applied.

Good idea! I would vote for throwing an error if the relkind is view,
making the user set fetch_stats to false for the foreign table.

Best regards,
Etsuro Fujita

#14

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Corey Huinker (#11)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:

I've reorganized some things a bit, mostly to make resource cleanup simpler.

Thanks for updating the patch! I will look into it.

Best regards,
Etsuro Fujita

#15

Corey Huinker

corey.huinker@gmail.com

3 months ago

In reply to: Etsuro Fujita (#14)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:

On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

I've reorganized some things a bit, mostly to make resource cleanup

simpler.

Thanks for updating the patch! I will look into it.

Best regards,
Etsuro Fujita

Rebase, no changes.

#16

Chao Li

li.evan.chao@gmail.com

3 months ago

In reply to: Corey Huinker (#15)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Dec 12, 2025, at 05:59, Corey Huinker <corey.huinker@gmail.com> wrote:

On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:

I've reorganized some things a bit, mostly to make resource cleanup simpler.

Thanks for updating the patch! I will look into it.

Best regards,
Etsuro Fujita

Rebase, no changes.
<v5-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch>

A kind reminder, I don’t see my comments are addressed:

/messages/by-id/F9C73EF2-F977-46E4-9F61-B6CF72BF1A69@gmail.com

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#17

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Etsuro Fujita (#13)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Thu, Nov 27, 2025 at 9:46 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

Obviously there is no way to know the quality/freshness of remote stats if they are found.

The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might look like. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did make me want to put it out there to see who else liked it.

Actually, I have some concerns about the ANALYZE and fall-back
options. As for the former, if the remote user didn't have the
MAINTAIN privilege on the remote table, remote ANALYZE would be just a
waste effort. As for the latter, imagine the situation where a user
ANALYZEs a foreign table whose remote table is significantly large.
When the previous attempts fail, the user might want to re-try to
import remote stats after ANALYZEing the remote table in the remote
side in some way, rather than postgres_fdw automatically falling back
to the normal lengthy processing. I think just throwing an error if
the first attempt fails would make the system not only simple but
reliable while providing some flexibility to users.

I think I was narrow-minded here; as for the ANALYZE option, if the
remote user has the privilege, it would work well, so +1 for it, but I
don't think it's a must-have option, so I'd vote for making it a
separate patch. As for the fall-back behavior, however, sorry, I
still think it reduces flexibility.

Best regards,
Etsuro Fujita

#18

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Corey Huinker (#15)

Re: Import Statistics in postgres_fdw before resorting to sampling.

On Fri, Dec 12, 2025 at 6:59 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Rebase, no changes.

Thanks for rebasing! I reviewed the changes made to the core:

@@ -196,13 +196,56 @@ analyze_rel(Oid relid, RangeVar *relation,
    {
        /*
         * For a foreign table, call the FDW's hook function to see whether it
-        * supports analysis.
+        * supports statistics import and/or analysis.
         */
        FdwRoutine *fdwroutine;
        bool        ok = false;

fdwroutine = GetFdwRoutineForRelation(onerel, false);

+       if (fdwroutine->ImportStatistics != NULL)
+       {
+           FdwImportStatsResult    res;
+
+           /*
+            * Fetching pre-existing remote stats is not guaranteed to
be a quick
+            * operation.
+            *
+            * XXX: Should this be it's own fetch type? If not, then
there might be
+            * confusion when a long stats-fetch fails, followed by a
regular analyze,
+            * which would make it look like the table was analyzed twice.
+            */
+           pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
+                                         RelationGetRelid(onerel));
+
+           res = fdwroutine->ImportStatistics(onerel);
+
+           pgstat_progress_end_command();
+
+           /*
+            * If we were able to import statistics, then there is no
need to collect
+            * samples for local analysis.
+            */
+           switch(res)
+           {
+               case FDW_IMPORT_STATS_OK:
+                   relation_close(onerel, ShareUpdateExclusiveLock);
+                   return;
+               case FDW_IMPORT_STATS_DISABLED:
+                   break;
+               case FDW_IMPORT_STATS_NOTFOUND:
+                   ereport(INFO,
+                           errmsg("Found no remote statistics for \"%s\"",
+                                  RelationGetRelationName(onerel)));
+                   break;
+               case FDW_IMPORT_STATS_FAILED:
+               default:
+                   ereport(INFO,
+                           errmsg("Fetching remote statistics from
\"%s\" failed",
+                                  RelationGetRelationName(onerel)));
+           }
+       }

Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.

IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.

bool
StatisticsAreImportable(Relation relation)

Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.

void
ImportStatistics(Relation relation, List *va_cols, int elevel)

Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.

I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.

That's it for now. I'll continue to review the patch.

Best regards,
Etsuro Fujita

#19

Corey Huinker

corey.huinker@gmail.com

3 months ago

In reply to: Chao Li (#16)

Re: Import Statistics in postgres_fdw before resorting to sampling.

A kind reminder, I don’t see my comments are addressed:

My apologies. Will get into the next rev.

#20

Corey Huinker

corey.huinker@gmail.com

3 months ago

In reply to: Etsuro Fujita (#18)

Re: Import Statistics in postgres_fdw before resorting to sampling.

CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for
me when I started designing this feature.

Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.

Perhaps I'm not understanding completely, but I believe what we're doing
now should be ok.

The local table of type 'f' can be a member of a partition, but can't be a
partition itself, so whatever stats we get for it, we're storing them as
inh = false.

On the remote side, the table could be an inheritance parent, in which case
we ONLY want the inh=true stats, but we will still store them locally as
inh = false.

The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of
the query ensures that we get inh=true stats if they're there in preference
to the inh=false steps.

I will grant you that in an old-style inheritance (i.e. not formal
partitions) situation we'd probably want some weighted mix of the inherited
and non-inherited stats, but 1) very few people use old-style inheritance
anymore, 2) few of those export tables via a FDW, and 3) there's no way to
do that weighting so we should fall back to sampling anyway.

None of this takes away from your suggestions down below.

IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.

bool
StatisticsAreImportable(Relation relation)

Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.

void
ImportStatistics(Relation relation, List *va_cols, int elevel)

Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.

I think I'm ok with this design as the decision, as it still leaves open
the fdw-specific options of how to handle initially finding no remote stats.

I can still see a situation where a local table expects the remote table to
eventually have proper statistics on it, but until that happens it will
fall back to table samples. This design decision means that either the user
lives without any statistics for a while, or alters the foreign table
options and hopefully remembers to set them back. While I understand the
desire to first implement something very simple, I think that adding the
durability that fallback allows for will be harder to implement if we don't
build it in from the start. I'm interested to hear with Nathan and/or
Jonathan have to say about that.

I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.

Good catch, I forgot about that one.

Going to think some more on this before I work incorporating your

#21

Corey Huinker

corey.huinker@gmail.com

3 months ago

In reply to: Corey Huinker (#20)

#22

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Corey Huinker (#21)

#23

Corey Huinker

corey.huinker@gmail.com

3 months ago

In reply to: Etsuro Fujita (#22)

#24

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

3 months ago

In reply to: Corey Huinker (#23)

#25

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

2 months ago

In reply to: Etsuro Fujita (#22)

#26

Corey Huinker

corey.huinker@gmail.com

2 months ago

In reply to: Etsuro Fujita (#25)

#27

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

2 months ago

In reply to: Corey Huinker (#26)

#28

Corey Huinker

corey.huinker@gmail.com

2 months ago

In reply to: Etsuro Fujita (#27)

#29

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Corey Huinker (#28)

#30

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

2 months ago

In reply to: Matheus Alcantara (#29)

#31

Corey Huinker

corey.huinker@gmail.com

2 months ago

In reply to: Etsuro Fujita (#27)

#32

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Corey Huinker (#31)

#33

Corey Huinker

corey.huinker@gmail.com

2 months ago

In reply to: Matheus Alcantara (#32)

#34

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Corey Huinker (#33)

#35

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: Corey Huinker (#31)

#36

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

2 months ago

In reply to: Corey Huinker (#31)

#37

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#36)

#38

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Corey Huinker (#37)

#39

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 2 months ago

In reply to: Corey Huinker (#37)

#40

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#39)

#41

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 2 months ago

In reply to: Corey Huinker (#40)

#42

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#41)

#43

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Corey Huinker (#42)

#44

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Corey Huinker (#43)

#45

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 2 months ago

In reply to: Corey Huinker (#44)

#46

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 2 months ago

In reply to: Ashutosh Bapat (#45)

#47

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#45)

#48

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 2 months ago

In reply to: Corey Huinker (#47)

#49

Corey Huinker

corey.huinker@gmail.com

about 2 months ago

In reply to: Ashutosh Bapat (#48)

#50

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 1 month ago

In reply to: Corey Huinker (#49)

#51

Corey Huinker

corey.huinker@gmail.com

about 1 month ago

In reply to: Ashutosh Bapat (#50)

#52

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 1 month ago

In reply to: Corey Huinker (#51)

#53

Corey Huinker

corey.huinker@gmail.com

about 1 month ago

In reply to: Ashutosh Bapat (#52)

#54

Corey Huinker

corey.huinker@gmail.com

26 days ago

In reply to: Corey Huinker (#53)

Import Statistics in postgres_fdw before resorting to sampling.

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: