Is there value in having optimizer stats for joins/foreignkeys?

Started by Corey Huinkerabout 1 month ago6 messages

corey.huinker@gmail.com

about 1 month ago

Threads like [1]/messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com and [2]/messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com have gotten me thinking that there may be some
value in storing statistics about joins.

For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3.
In such a situation I can envision the following statistics being collected:

* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but
associated with t1 and weighted according to the frequency of that row
being referenced, which means that values of unreferenced rows are filtered
out entirely.
* That's about it for direct statistics, but I could see creating extended
statistics for correlations between a local column value and a remote
column, or expressions on the remote columns, etc.

The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.

That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep our
tablesamples, but that wouldn't necessarily help in this case because the
rows referenced, and their relative weightings would change since the last
sampling. In a worst-case scenario, We would have to sample the joined-to
tables as well,and that's an additional burden on an already IO intensive
operation.

In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and
use them for estimates. However, that makes some assumptions that the stats
on t2 are approximately as fresh as the stats on t1, and I don't think that
will be the case most of the time.

CCing people who have wondered out loud about this topic within earshot of
me.

Thoughts?

[1]: /messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com
/messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com
[2]: /messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com
/messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com

tomas@vondra.me

about 1 month ago

In reply to: Corey Huinker (#1)

Re: Is there value in having optimizer stats for joins/foreignkeys?

On 12/1/25 21:10, Corey Huinker wrote:

Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.

For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2,
t2c3. In such a situation I can envision the following statistics being
collected:

* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3,
but associated with t1 and weighted according to the frequency of that
row being referenced, which means that values of unreferenced rows are
filtered out entirely.
* That's about it for direct statistics, but I could see creating
extended statistics for correlations between a local column value and a
remote column, or expressions on the remote columns, etc.

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Of course, it also depends on how expensive this would be. Maybe it's
cheap enough? No idea.

But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.

FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does

... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.

mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.

The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.

That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep
our tablesamples, but that wouldn't necessarily help in this case
because the rows referenced, and their relative weightings would change
since the last sampling. In a worst-case scenario, We would have to
sample the joined-to tables as well,and that's an additional burden on
an already IO intensive operation.

Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1]https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf on how to do index-based join
sampling efficiently.

In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2
and use them for estimates. However, that makes some assumptions that
the stats on t2 are approximately as fresh as the stats on t1, and I
don't think that will be the case most of the time.

CCing people who have wondered out loud about this topic within earshot
of me.

Thoughts?

I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

regards

[1]: https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf

--
Tomas Vondra

tgl@sss.pgh.pa.us

about 1 month ago

In reply to: Tomas Vondra (#2)

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> writes:

On 12/1/25 21:10, Corey Huinker wrote:

Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Yeah, I think that automated choices about this are unlikely to work
well. We chose the syntax for CREATE STATISTICS with an eye to
allowing users to declaratively tell us to collect stats about
specific joins, and I still think that's a more promising approach.
But nobody's yet worked out any details.

regards, tom lane

corey.huinker@gmail.com

about 1 month ago

In reply to: Tomas Vondra (#2)

Re: Is there value in having optimizer stats for joins/foreignkeys?

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Not every foreign key, they'd be declared like CREATE STATISTICS, but would
be anchored to the constraint, not to the table.

But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.

That's the intention, but the stats stored don't quite "fit" in the buckets
that extended stats create. The attribute statistics seem much better
suited, as this isn't about combinations, there's only ever the one
combination, but rather about what can be known about the attributes in the
far table before doing the actual join.

FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does

... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.

mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.

Right. We'd do the join from t1 to t2 as described earlier, and then we'd
judge the null_frac, mcv, etc for each column of t2 (as defined by the
scope of the stats declaration) according to the join. More commonly
referenced values would show up as more frequent, hence "weighted".

Just so I have an example to refer to later, say we have a table of colors:

CREATE TABLE color(id bigint primary key, color_name text unique,
color_family text null)

and there's hundreds of colors in the table that are color_family='red'
('fire engine red', 'candy apple red', 'popular muppet red', etc). Some
colors don't belong to any color_family.

And we have a table of toys:

CREATE TABLE toy(id bigint primary key, min_child_age integer, name text,
color_id bigint REFERENCES color)

And we declare a join stat on toy->color for the color_family attribute.
We'd sample rows from the toy table, left join those to color, and then
calculate the attribute stats of color_family as if it were a column in
toys. Some toys might not have a color_id, and some color_ids might not
belong to a color_family, so we'd want the null_frac to reflect those
combined conditions. For the values that do join, and the colors that do
belong to a family, we'd want to see regular MCV stats showing "red" as the
most common color_family.

But those stats aren't really a correlation or a dependency, they're just
plain old attribute stats.

I understand wanting to know the correlation between toys.min_child_age and
colors.color_family, so that makes perfect sense for extended statistics,
but color_family on its own just doesn't fit. Am I missing something?

Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1] on how to do index-based join
sampling efficiently.

Thanks, now I've got some light reading for the flight home.

I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

That's the other half of this - if the stats existed, do we have an obvious
way to put them to use?

corey.huinker@gmail.com

about 1 month ago

In reply to: Tom Lane (#3)

Re: Is there value in having optimizer stats for joins/foreignkeys?

Yeah, I think that automated choices about this are unlikely to work
well. We chose the syntax for CREATE STATISTICS with an eye to
allowing users to declaratively tell us to collect stats about
specific joins, and I still think that's a more promising approach.
But nobody's yet worked out any details.

Per other response, no, I didn't envision stats on all possible joins or
even all possible foreign keys, just the ones we declare as interesting,
and even then only for the attributes that we say are interesting on the
far side of the join.

alexandra.wang.oss@gmail.com

about 1 month ago

In reply to: Corey Huinker (#5)

Re: Is there value in having optimizer stats for joins/foreignkeys?

Hi there,

Thanks for raising this topic! I am currently working on a POC patch that
adds extended statistics for joins. I am polishing the patch now and will
post it soon with performance numbers, since there are interests!

On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:

I think adding joins to extended stats would not be all that hard

(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.

On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

That's the other half of this - if the stats existed, do we have an
obvious way to put them to use?

I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

In my POC patch, I've made the following catalog changes:
- Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
*pg_statistic_ext*
- Use the existing *stxkeys (int2vector)* to store the stats object
attributes of *stxotherrel*
- Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
- Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*

To use them, we can let the planner detect patterns like this:

/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter
columns in other_rel */
List *filter_values; /* list of Datum constant values
being filtered */
Oid collation; /* collation for comparisons */

/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;

and add the detection logic in clauselist_selectivity_ext() and
get_foreign_key_join_selectivity().

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I have experimented with two ways to define the join statistics:

1. Use CREATE STATISTICS:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
table_name1 JOIN table_name2 ON table_name1.column_name3 =
table_name2.column_name4

Examples:
-- Create join MCV statistics on a single filter column (keyword)
CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

-- Create join MCV statistics on multiple filter columns (keyword +
phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

2. Auto join stats creation for Foreign Key + Functional Dependency stats

Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.

I've been using the Join Order Benchmark (JOB) [1]https://www.vldb.org/pvldb/vol9/p204-leis.pdf to measure
performance gain. I will post the POC patch and performance numbers in
a followup email.

[1]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf

Best,
Alex

--
Alexandra Wang
EDB: https://www.enterprisedb.com