Is there value in having optimizer stats for joins/foreignkeys?

Started by Corey Huinker5 months ago17 messageshackers
Jump to latest
#1Corey Huinker
corey.huinker@gmail.com

Threads like [1]/messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com and [2]/messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com have gotten me thinking that there may be some
value in storing statistics about joins.

For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3.
In such a situation I can envision the following statistics being collected:

* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but
associated with t1 and weighted according to the frequency of that row
being referenced, which means that values of unreferenced rows are filtered
out entirely.
* That's about it for direct statistics, but I could see creating extended
statistics for correlations between a local column value and a remote
column, or expressions on the remote columns, etc.

The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.

That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep our
tablesamples, but that wouldn't necessarily help in this case because the
rows referenced, and their relative weightings would change since the last
sampling. In a worst-case scenario, We would have to sample the joined-to
tables as well,and that's an additional burden on an already IO intensive
operation.

In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and
use them for estimates. However, that makes some assumptions that the stats
on t2 are approximately as fresh as the stats on t1, and I don't think that
will be the case most of the time.

CCing people who have wondered out loud about this topic within earshot of
me.

Thoughts?

[1]: /messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com
/messages/by-id/6fdc4dc5-8881-4987-9858-a9b484953185@joeconway.com
[2]: /messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com
/messages/by-id/tencent_3018762E7D4C9BC470C821C829C1BF2F650A@qq.com

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Corey Huinker (#1)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 12/1/25 21:10, Corey Huinker wrote:

Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.

For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2,
t2c3. In such a situation I can envision the following statistics being
collected:

* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3,
but associated with t1 and weighted according to the frequency of that
row being referenced, which means that values of unreferenced rows are
filtered out entirely.
* That's about it for direct statistics, but I could see creating
extended statistics for correlations between a local column value and a
remote column, or expressions on the remote columns, etc.

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Of course, it also depends on how expensive this would be. Maybe it's
cheap enough? No idea.

But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.

FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does

... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.

mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.

The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.

That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep
our tablesamples, but that wouldn't necessarily help in this case
because the rows referenced, and their relative weightings would change
since the last sampling. In a worst-case scenario, We would have to
sample the joined-to tables as well,and that's an additional burden on
an already IO intensive operation.

Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1]https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf on how to do index-based join
sampling efficiently.

In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2
and use them for estimates. However, that makes some assumptions that
the stats on t2 are approximately as fresh as the stats on t1, and I
don't think that will be the case most of the time.

CCing people who have wondered out loud about this topic within earshot
of me.

Thoughts?

I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

regards

[1]: https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf

--
Tomas Vondra

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#2)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> writes:

On 12/1/25 21:10, Corey Huinker wrote:

Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Yeah, I think that automated choices about this are unlikely to work
well. We chose the syntax for CREATE STATISTICS with an eye to
allowing users to declaratively tell us to collect stats about
specific joins, and I still think that's a more promising approach.
But nobody's yet worked out any details.

regards, tom lane

#4Corey Huinker
corey.huinker@gmail.com
In reply to: Tomas Vondra (#2)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Not every foreign key, they'd be declared like CREATE STATISTICS, but would
be anchored to the constraint, not to the table.

But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.

That's the intention, but the stats stored don't quite "fit" in the buckets
that extended stats create. The attribute statistics seem much better
suited, as this isn't about combinations, there's only ever the one
combination, but rather about what can be known about the attributes in the
far table before doing the actual join.

FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does

... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.

mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.

Right. We'd do the join from t1 to t2 as described earlier, and then we'd
judge the null_frac, mcv, etc for each column of t2 (as defined by the
scope of the stats declaration) according to the join. More commonly
referenced values would show up as more frequent, hence "weighted".

Just so I have an example to refer to later, say we have a table of colors:

CREATE TABLE color(id bigint primary key, color_name text unique,
color_family text null)

and there's hundreds of colors in the table that are color_family='red'
('fire engine red', 'candy apple red', 'popular muppet red', etc). Some
colors don't belong to any color_family.

And we have a table of toys:

CREATE TABLE toy(id bigint primary key, min_child_age integer, name text,
color_id bigint REFERENCES color)

And we declare a join stat on toy->color for the color_family attribute.
We'd sample rows from the toy table, left join those to color, and then
calculate the attribute stats of color_family as if it were a column in
toys. Some toys might not have a color_id, and some color_ids might not
belong to a color_family, so we'd want the null_frac to reflect those
combined conditions. For the values that do join, and the colors that do
belong to a family, we'd want to see regular MCV stats showing "red" as the
most common color_family.

But those stats aren't really a correlation or a dependency, they're just
plain old attribute stats.

I understand wanting to know the correlation between toys.min_child_age and
colors.color_family, so that makes perfect sense for extended statistics,
but color_family on its own just doesn't fit. Am I missing something?

Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1] on how to do index-based join
sampling efficiently.

Thanks, now I've got some light reading for the flight home.

I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

That's the other half of this - if the stats existed, do we have an obvious
way to put them to use?

#5Corey Huinker
corey.huinker@gmail.com
In reply to: Tom Lane (#3)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Yeah, I think that automated choices about this are unlikely to work
well. We chose the syntax for CREATE STATISTICS with an eye to
allowing users to declaratively tell us to collect stats about
specific joins, and I still think that's a more promising approach.
But nobody's yet worked out any details.

Per other response, no, I didn't envision stats on all possible joins or
even all possible foreign keys, just the ones we declare as interesting,
and even then only for the attributes that we say are interesting on the
far side of the join.

#6Alexandra Wang
alexandra.wang.oss@gmail.com
In reply to: Corey Huinker (#5)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Hi there,

Thanks for raising this topic! I am currently working on a POC patch that
adds extended statistics for joins. I am polishing the patch now and will
post it soon with performance numbers, since there are interests!

On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:

I think adding joins to extended stats would not be all that hard

(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.

On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

That's the other half of this - if the stats existed, do we have an
obvious way to put them to use?

I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

In my POC patch, I've made the following catalog changes:
- Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
*pg_statistic_ext*
- Use the existing *stxkeys (int2vector)* to store the stats object
attributes of *stxotherrel*
- Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
- Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*

To use them, we can let the planner detect patterns like this:

/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter
columns in other_rel */
List *filter_values; /* list of Datum constant values
being filtered */
Oid collation; /* collation for comparisons */

/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;

and add the detection logic in clauselist_selectivity_ext() and
get_foreign_key_join_selectivity().

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I have experimented with two ways to define the join statistics:

1. Use CREATE STATISTICS:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
table_name1 JOIN table_name2 ON table_name1.column_name3 =
table_name2.column_name4

Examples:
-- Create join MCV statistics on a single filter column (keyword)
CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

-- Create join MCV statistics on multiple filter columns (keyword +
phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

2. Auto join stats creation for Foreign Key + Functional Dependency stats

Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.

I've been using the Join Order Benchmark (JOB) [1]https://www.vldb.org/pvldb/vol9/p204-leis.pdf to measure
performance gain. I will post the POC patch and performance numbers in
a followup email.

[1]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf

Best,
Alex

--
Alexandra Wang
EDB: https://www.enterprisedb.com

#7Alexandra Wang
alexandra.wang.oss@gmail.com
In reply to: Alexandra Wang (#6)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Hi hackers,

As promised in my previous email, I'm sharing a proof-of-concept patch
exploring join statistics for correlated columns across relations.
This is a POC at this point, but I hope the performance numbers below
give a better idea of both the potential usefulness of join statistics
and the complexity of implementing them.

Join Order Benchmark (JOB)
---------------------------------------

I got interested in this work after reading a 2015 VLDB paper [1]https://www.vldb.org/pvldb/vol9/p204-leis.pdf that
introduced the Join Order Benchmark (JOB) [2]https://github.com/gregrahn/join-order-benchmark. JOB uses the IMDB
dataset, with queries having 3-16 joins (average 8). Unlike standard
benchmarks such as TPC-H which use randomly generated uniform and
independent data, JOB uses real-world data with correlations and
skewed distributions, which make cardinality estimation much harder.

The paper tested PostgreSQL among other databases. I reran the
benchmark on current PG 19devel to see where we stand today. (The
original data source is gone, but CedarDB hosts a mirror [3]https://cedardb.com/docs/example_datasets/job/; I
created a fork [4]https://github.com/l-wang/join-order-benchmark and added some scripts. The schema DDLs and queries
are unchanged.)

Test setup:
MacBook, Apple M3 Pro
Postgres master branch (19devel), built with -O3
cold and warm runs, 3 runs each

Results (average of 3 runs):

Postgres 19devel branch:
Total execution time cold run: 140.75 s (±2.12 s)
Total execution time warm run: 117.72 s (±0.76 s)

Postgres 19devel branch, SET enable_nestloop = off:
Total execution time cold run: 87.16 s (±0.17 s)
Total execution time warm run: 86.68 s (±0.42 s)

These results suggest that the planner is frequently choosing nested
loop joins where hash or merge joins would be better.

(As a side note: with nestloops disabled, plans are dominated by hash
joins, so cold vs warm runs are very similar due to limited cache
reuse.)

Problem
--------

The slowest query in this workload is 16b.sql (attached). Its plan
contains 7 nested loop joins.

The innermost join is estimated as 34 rows but actually produces
41,840 rows — a ~1230x underestimation:

Nested Loop (cost=6.80..3813.14 rows=34 width=4) (actual
time=2.336..304.160 rows=41840 loops=1)
-> Seq Scan on keyword k (cost=0.00..2685.11 rows=1 width=4) (actual
time=0.402..6.161 rows=1.00 loops=1)
Filter: (keyword = 'character-name-in-title')
-> Bitmap Heap Scan on movie_keyword mk (cost=6.80..1124.98 rows=305
width=8) (actual time=1.933..295.493 rows=41840.00 loops=1)
Recheck Cond: (k.id = keyword_id)

This underestimation cascades to the outermost join: ~3K estimated vs
~3.7M actual rows.

The problem can be reduced to the following query:

SELECT * FROM movie_keyword mk, keyword k
WHERE k.keyword = 'character-name-in-title'
AND k.id = mk.keyword_id;

The tables:

keyword(id PK, keyword, phonetic_code)
movie_keyword(id PK, movie_id, keyword_id)

These tables are strongly correlated via keyword_id -> keyword.id. We
can even add:

ALTER TABLE movie_keyword
ADD CONSTRAINT movie_keyword_keyword_id_fkey
FOREIGN KEY (keyword_id) REFERENCES keyword(id);
CREATE STATISTICS keyword_stats (dependencies)
ON id, keyword FROM keyword;
ANALYZE keyword, movie_keyword;

However, even with the FK constraint and extended statistics on the
keyword table, the estimate remains unchanged:

SELECT * FROM check_estimated_rows(
'SELECT * FROM movie_keyword mk, keyword k
WHERE k.keyword = ''character-name-in-title''
AND k.id = mk.keyword_id');
estimated | actual
-----------+--------
34 | 41840

This indicates that the planner cannot connect the skewed distribution
of the join key on the referencing table with the filter selectivity
on the referenced table as it applies to the join result. This is the
gap that join statistics aim to fill.

Proposed Solution
------------------

I've attached two patches:

0001: Extend CREATE STATISTICS for join MCV statistics
0002: Automatically create join MCV statistics from FK constraints

0001 is the core patch. 0002 is an optional convenience feature
that detects FK relationships during ANALYZE and auto-creates
join stats.

Syntax:

CREATE STATISTICS <name> (mcv)
ON <other_table>.<filter_col> [, ...]
FROM <primary_table> alias
JOIN <other_table> alias ON (<join_condition>);

Example -- single filter column:

CREATE STATISTICS mk_keyword_stats (mcv)
ON k.keyword
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

Example -- multiple filter columns:

CREATE STATISTICS mk_multi_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

Catalog changes:

pg_statistic_ext:
- New stats kind 'c' (join MCV) in stxkind
- New field stxotherrel (Oid): the other table in the join
- New field stxjoinkeys (int2vector): join column pair [primary_joinkey,
other_joinkey]
- Existing stxkeys stores the filter column(s) on stxotherrel
- New index pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)

pg_statistic_ext_data:
- New field stxdjoinmcv (pg_join_mcv_list): serialized join MCV data

How stats are collected:

Join statistics are collected during ANALYZE of the primary table
(stxrelid). The current approach assumes a dependency relationship
between the join key column and the filter column(s) on the other
table. Specifically, during ANALYZE, we reuse the already-computed
single-column MCV stats for the primary table's join key column. For
each MCV value, we look up the matching row in the other table via the
join key and extract the filter column values, carrying over the
primary-side MCV frequency. This is not accurate in all cases, but
works reasonably well for the POC, especially for foreign-key-like
joins where the primary table's join key distribution is
representative of the join result.

For example, the catalog contents look like:

stxrelid | stxotherrel | stxjoinkeys | stxkeys | stxkind
--------------+-------------+-------------+---------+--------
movie_keyword | keyword | 2 1 | 2 | {c}

index | values | nulls | frequency
------+--------------+-------+----------
0 | {keyword_1} | {f} | 0.06
1 | {keyword_2} | {f} | 0.06
2 | {keyword_3} | {f} | 0.06
... | ... | ... | ...

How the planner uses join stats:

I added the join pattern detection logic in
clauselist_selectivity_ext() and get_foreign_key_join_selectivity().
The planner looks for a pattern of:

- An equijoin clause between two relations (the join key pair), and
- Equality or IN filter clauses on columns of one relation (the filter
columns)

When this pattern is found and matching join MCV stats exist, the
planner compares the filter values against the stored MCV items to
compute the join selectivity, replacing the default estimate.

JOB Benchmark Results
---------------------

I created a single join statistics object for the (movie_keyword,
keyword) pair and reran the JOB benchmark:

CREATE STATISTICS movie_keyword_keyword_stats (mcv)
ON k.keyword
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);

Total execution times (average of 3 runs):

Cold run Warm run
Default (baseline) 140.75s (+/-2.1) 117.72s (+/-0.8)
enable_nestloop=off 87.16s (+/-0.2) 86.68s (+/-0.4)
With join stats 85.81s (+/-3.4) 65.24s (+/-0.5)

With a single join statistics object:
- Cold run: -39% vs baseline (140.75s -> 85.81s),
comparable to enable_nestloop=off
- Warm run: -45% vs baseline (117.72s -> 65.24s),
-25% vs enable_nestloop=off (86.68s -> 65.24s)

Per-query times for the 10 slowest (out of 113) queries on baseline
(master branch, enable_nestloop = on):

Query | Baseline | No Nestloop | Join Stats | Best
------+-----------+-------------+------------+----------
16b | 10,673 ms | 2,789 ms | 2,775 ms | 3.8x JS
17e | 7,709 ms | 2,260 ms | 2,149 ms | 3.6x JS
17f | 7,506 ms | 2,304 ms | 2,460 ms | 3.3x NL
17a | 7,288 ms | 1,453 ms | 2,388 ms | 5.0x NL
17b | 6,490 ms | 1,580 ms | 5,413 ms | 4.1x NL
17c | 6,240 ms | 1,095 ms | 5,268 ms | 5.7x NL
17d | 6,261 ms | 1,263 ms | 5,291 ms | 5.0x NL
6d | 6,234 ms | 988 ms | 1,565 ms | 6.3x NL
25c | 6,133 ms | 1,848 ms | 1,738 ms | 3.5x JS
6f | 5,728 ms | 1,785 ms | 1,556 ms | 3.7x JS

(JS = join stats fastest, NL = no-nestloop fastest)

All 10 queries improved over baseline. For 16b, 3 nested loop joins
were replaced with 2 hash joins and 1 merge join. Some queries (e.g.
17b) kept the same plan shape and joins but gained a Memoize node from
better cardinality estimates.

Current Limitations
-------------------

This is a proof of concept. Known limitations include:

1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins.

2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key. This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.

3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).

4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.

Conclusion
----------

Even with all the current limitations, I think this experiment
suggests that join-level statistics can significantly improve plans,
as a single join stats object can materially improve workload
performance.

If there's interest, I'm happy to continue iterating on the design.

In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,
- the catalog design,
- the stats collection approach,
- the planner integration strategy,
- and scope (what kinds of joins / predicates are worth supporting).

Best,
Alex

[1]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
[2]: https://github.com/gregrahn/join-order-benchmark
[3]: https://cedardb.com/docs/example_datasets/job/
[4]: https://github.com/l-wang/join-order-benchmark

--
Alexandra Wang
EDB: https://www.enterprisedb.com

Attachments:

16b_nl.txttext/plain; charset=US-ASCII; name=16b_nl.txtDownload
v1-0002-Automatically-create-join-MCV-statistics-from-FK-.patchapplication/octet-stream; name=v1-0002-Automatically-create-join-MCV-statistics-from-FK-.patchDownload+894-5
v1-0001-Extend-CREATE-STATISTICS-syntax-for-join-MCV-stat.patchapplication/octet-stream; name=v1-0001-Extend-CREATE-STATISTICS-syntax-for-join-MCV-stat.patchDownload+2943-194
16b.sqlapplication/octet-stream; name=16b.sqlDownload
16b_js.txttext/plain; charset=US-ASCII; name=16b_js.txtDownload
16b_baseline.txttext/plain; charset=US-ASCII; name=16b_baseline.txtDownload
#8Corey Huinker
corey.huinker@gmail.com
In reply to: Alexandra Wang (#6)
Re: Is there value in having optimizer stats for joins/foreignkeys?

I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

That was a big question I had, and I agree that we should only add
statistics as uses for them become apparent.

In my POC patch, I've made the following catalog changes:
- Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
*pg_statistic_ext*
- Use the existing *stxkeys (int2vector)* to store the stats object
attributes of *stxotherrel*
- Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
- Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*

I like all these changes. Maybe "outer" rel rather than "other" rel, but it
really doesn't matter this early on.

To use them, we can let the planner detect patterns like this:

/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */
List *filter_values; /* list of Datum constant values being filtered */
Oid collation; /* collation for comparisons */

/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;

and add the detection logic in clauselist_selectivity_ext() and
get_foreign_key_join_selectivity().

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I agree, we have to perform the joins, as the rows collected are inherently
dependent and skewed, and yes, it's going to be a maintenance overhead hit.

Initially I thought this could be mitigated somewhat by retaining
rowsamples for each table, but those row samples would be independent of
the joins, and the values we need are inherently dependent.

I have experimented with two ways to define the join statistics:

1. Use CREATE STATISTICS:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
table_name1 JOIN table_name2 ON table_name1.column_name3 =
table_name2.column_name4

We'll need to support aliases because there could be a self-join :(

Examples:
-- Create join MCV statistics on a single filter column (keyword)
CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

-- Create join MCV statistics on multiple filter columns (keyword +
phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

This is where the existing CREATE STATISTICS syntax does not serve our
purposes well. We definitely want MCV stats for both of those k.* columns
with the skew of values that join to move_keyword on that defined foreign
key, but we'd end up getting the _combinations_ of keyword, phonetic_code,
which we don't necessarily care about.

We might want an alternate syntax

CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON keyword, phonetic_code
[FROM movie_keyword]
USING movie_keyword_fk;

In this case, the FROM clause is redundant and therefore optional, the
columns listed must exist on the confrelid and the object is keyed to the
conrelid. Having said that, I think people don't really use constraint
names and so the join syntax will likely be used more often.

2. Auto join stats creation for Foreign Key + Functional Dependency stats

Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.

I'm not excited about this, and others have expressed concern that it would
lead to an explosion of mediocre statistics objects.

#9Corey Huinker
corey.huinker@gmail.com
In reply to: Alexandra Wang (#7)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Current Limitations
-------------------

This is a proof of concept. Known limitations include:

I really like this proof of concept.

1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins

I think the asymmetry comes with the territory, and we will be creating the
join statistics that prove useful. If that means that we create one object
ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b JOIN a...
then so be it.

.

2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key. This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.

Unfortunately, I think we will have to join the remote table_b to the row
sample on table_a to get accurate join statistics, and the best time to do
that when we already have the row sample from table_a. We can further batch
up statistics objects that happen to join table_a to table_b by the same
join criteria to avoid rescans.

Will what you have work when we want to do an MCV on a mix of local and
remote columns, or will that require more work?

3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).

I suspect that n-way joins would have very limited utility and could be
adequately covered with multiple join-stats objects.

4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.

I think this is like indexing, just because you can make all possible
columns indexed doesn't mean you should. Tooling will emerge to determine
what join stats objects are worth their weight, and create only those
objects.

If there's interest, I'm happy to continue iterating on the design.

In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,

Yes. Very much so.

- the catalog design,

I had somehow gotten the impression that you were going to take the
extended statistics format, but store individual columns in a modified
pg_statistic. That's working for now, but I wonder if that will still be
the case when we start to try this for columns that are arrays, ranges,
multiranges, tsvectors, etc. pg_statistic has the infrastructure to handle
those, but there may be good reason to keep pg_statistic focused on local
attributes and instead just keep adding new kinds to extended statistic and
let it be the grab-bag it was perhaps always meant to be.

In my own musings on how to implement this (which you have far exceeded
with this proof-of-concept), I had wondered how to stats for k.keyword and
k.phonetic_code individually from the definition
of movie_keywords2_multi_stats, but looking at what you've done I think
we're better of with defining each statistic object very narrowly, and that
means we define one object per remote column and one object per interesting
combination of columns, then so be it. So long as we can calculate them all
from the same join of the two tables, we'll avoid the nasty overhead.

- and scope (what kinds of joins / predicates are worth supporting).

between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
mind.

element_mcv for array types might be next, but that's well down the road.

#10Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexandra Wang (#7)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 29/1/26 06:04, Alexandra Wang wrote:

Hi hackers,

As promised in my previous email, I'm sharing a proof-of-concept patch
exploring join statistics for correlated columns across relations.
This is a POC at this point, but I hope the performance numbers below
give a better idea of both the potential usefulness of join statistics
and the complexity of implementing them.

I wonder why you chose the JOIN operator only?

It seems to me that any relational operator produces relational output
that can be treated as a table. The extended statistics code may be
adopted to such relations.
I think it may be a VIEW that you can declare (manually or
automatically) and allow Postgres to build statistics on this 'virtual'
table. So, the main focus may shift to the question: how to provably
match a query subtree to a specific statistic.

--
regards, Andrei Lepikhov,
pgEdge

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexandra Wang (#7)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Hi Alexandra!

On 1/29/26 06:04, Alexandra Wang wrote:

Hi hackers,

As promised in my previous email, I'm sharing a proof-of-concept patch
exploring join statistics for correlated columns across relations.
This is a POC at this point, but I hope the performance numbers below
give a better idea of both the potential usefulness of join statistics
and the complexity of implementing them.

Great! Thanks for doing this, I think it's very worthy improvement and
I'm grateful you're interested in working on this.

Join Order Benchmark (JOB)
---------------------------------------

I got interested in this work after reading a 2015 VLDB paper [1] that
introduced the Join Order Benchmark (JOB) [2].  JOB uses the IMDB
dataset, with queries having 3-16 joins (average 8). Unlike standard
benchmarks such as TPC-H which use randomly generated uniform and
independent data, JOB uses real-world data with correlations and
skewed distributions, which make cardinality estimation much harder.

The paper tested PostgreSQL among other databases. I reran the
benchmark on current PG 19devel to see where we stand today. (The
original data source is gone, but CedarDB hosts a mirror [3]; I
created a fork [4] and added some scripts. The schema DDLs and queries
are unchanged.)

Test setup:
MacBook, Apple M3 Pro
Postgres master branch (19devel), built with -O3
cold and warm runs, 3 runs each

Results (average of 3 runs):

Postgres 19devel branch:
Total execution time cold run: 140.75 s (±2.12 s)
Total execution time warm run: 117.72 s (±0.76 s)

Postgres 19devel branch, SET enable_nestloop = off:
Total execution time cold run: 87.16 s (±0.17 s)
Total execution time warm run: 86.68 s (±0.42 s)

These results suggest that the planner is frequently choosing nested
loop joins where hash or merge joins would be better.

(As a side note: with nestloops disabled, plans are dominated by hash
joins, so cold vs warm runs are very similar due to limited cache
reuse.)

Right. I have not investigated plans for the JOB queries, but a common
issue is that we pick a nested loop very early in the plan due to a poor
estimate, and that leads to a sequence of nested loops. Which just
amplifies the problem.

Is it possible to measure the accuracy of the estimates, not just the
execution time? I think it would be useful to know how much "closer" we
get to the actual cardinality. There may well be queries where the
execution time did not change much, but in a slightly different setting
it could easily be an issue.

Problem
--------

The slowest query in this workload is 16b.sql (attached). Its plan
contains 7 nested loop joins.

The innermost join is estimated as 34 rows but actually produces
41,840 rows — a ~1230x underestimation:

Nested Loop  (cost=6.80..3813.14 rows=34 width=4) (actual
time=2.336..304.160 rows=41840 loops=1)
  -> Seq Scan on keyword k (cost=0.00..2685.11 rows=1 width=4) (actual
time=0.402..6.161 rows=1.00 loops=1)
     Filter: (keyword = 'character-name-in-title')
  -> Bitmap Heap Scan on movie_keyword mk (cost=6.80..1124.98 rows=305
width=8) (actual time=1.933..295.493 rows=41840.00 loops=1)
     Recheck Cond: (k.id <http://k.id&gt; = keyword_id)

This underestimation cascades to the outermost join: ~3K estimated vs
~3.7M actual rows.

The problem can be reduced to the following query:

SELECT * FROM movie_keyword mk, keyword k
WHERE k.keyword = 'character-name-in-title'
AND k.id <http://k.id&gt; = mk.keyword_id;

The tables:

keyword(id PK, keyword, phonetic_code)
movie_keyword(id PK, movie_id, keyword_id)

These tables are strongly correlated via keyword_id -> keyword.id
<http://keyword.id&gt;. We
can even add:

ALTER TABLE movie_keyword
  ADD CONSTRAINT movie_keyword_keyword_id_fkey
  FOREIGN KEY (keyword_id) REFERENCES keyword(id);
CREATE STATISTICS keyword_stats (dependencies)
  ON id, keyword FROM keyword;
ANALYZE keyword, movie_keyword;

However, even with the FK constraint and extended statistics on the
keyword table, the estimate remains unchanged:

SELECT * FROM check_estimated_rows(
  'SELECT * FROM movie_keyword mk, keyword k
   WHERE k.keyword = ''character-name-in-title''
     AND k.id <http://k.id&gt; = mk.keyword_id');
 estimated | actual
-----------+--------
        34 |  41840

This indicates that the planner cannot connect the skewed distribution
of the join key on the referencing table with the filter selectivity
on the referenced table as it applies to the join result. This is the
gap that join statistics aim to fill.

Agreed, I've seen various cases like this too.

Proposed Solution
------------------

I've attached two patches:

0001: Extend CREATE STATISTICS for join MCV statistics

+1 to do this (I'm not sure about restricting it to MCV, but that's a
detail we can discuss later)

0002: Automatically create join MCV statistics from FK constraints

I think this is very unlikely, for reasons I already explained earlier
in this thread. Essentially, building and processing the stats is a
cost, and most FKs probably don't actually need it. We don't build
indexes on FKs for similar reasons. And it's not clear to me how we
would know which columns to define the statistics on.

Of course, it's up to you to keep working on this, and try to convince
us it's a good idea. I personally think it's unlikely to get accepted,
but perhaps I'm wrong.

0001 is the core patch. 0002 is an optional convenience feature
that detects FK relationships during ANALYZE and auto-creates
join stats.

Syntax:

CREATE STATISTICS <name> (mcv)
  ON <other_table>.<filter_col> [, ...]
  FROM <primary_table> alias
  JOIN <other_table> alias ON (<join_condition>);

Example -- single filter column:

CREATE STATISTICS mk_keyword_stats (mcv)
  ON k.keyword
  FROM movie_keyword mk
  JOIN keyword k ON (mk.keyword_id = k.id <http://k.id&gt;);
ANALYZE movie_keyword;

Example -- multiple filter columns:

CREATE STATISTICS mk_multi_stats (mcv)
  ON k.keyword, k.phonetic_code
  FROM movie_keyword mk
  JOIN keyword k ON (mk.keyword_id = k.id <http://k.id&gt;);
ANALYZE movie_keyword;

Catalog changes:

pg_statistic_ext:
- New stats kind 'c' (join MCV) in stxkind
- New field stxotherrel (Oid): the other table in the join
- New field stxjoinkeys (int2vector): join column pair [primary_joinkey,
  other_joinkey]
- Existing stxkeys stores the filter column(s) on stxotherrel
- New index pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)

- Why do we need a new stats kind? I'd have expected we'd use the same
'm' value as for a single-relation stats, simply because a join result
is just a relation too.

- For a PoC it's OK to have stxotherrel, but the final patch should not
be restricted to two relations. I think there needs to be an array of
the joined rel OIDs, or something like that.

FWIW I recognize figuring out the catalog changes is not easy - in fact
I always found it so intimidation I ended up procrastinating and not
making any progress on join statistics. So I really appreciate you
taking a stab at this.

pg_statistic_ext_data:
- New field stxdjoinmcv (pg_join_mcv_list): serialized join MCV data

I don't understand why we need this. AFAICS we should just treat the
join result as a relation, and use the current pg_statistic_ext_data fields.

How stats are collected:

Join statistics are collected during ANALYZE of the primary table
(stxrelid). The current approach assumes a dependency relationship
between the join key column and the filter column(s) on the other
table. Specifically, during ANALYZE, we reuse the already-computed
single-column MCV stats for the primary table's join key column. For
each MCV value, we look up the matching row in the other table via the
join key and extract the filter column values, carrying over the
primary-side MCV frequency. This is not accurate in all cases, but
works reasonably well for the POC, especially for foreign-key-like
joins where the primary table's join key distribution is
representative of the join result.

For the PoC it's good enough, but I don't think we can rely on the
per-column MCV lists like this.

For example, the catalog contents look like:

stxrelid      | stxotherrel | stxjoinkeys | stxkeys | stxkind
--------------+-------------+-------------+---------+--------
movie_keyword | keyword     | 2 1         | 2       | {c}

index | values       | nulls | frequency
------+--------------+-------+----------
    0 | {keyword_1}  | {f}   |      0.06
    1 | {keyword_2}  | {f}   |      0.06
    2 | {keyword_3}  | {f}   |      0.06
  ... | ...          | ...   |       ...

How the planner uses join stats:

I added the join pattern detection logic in
clauselist_selectivity_ext() and get_foreign_key_join_selectivity().
The planner looks for a pattern of:

- An equijoin clause between two relations (the join key pair), and
- Equality or IN filter clauses on columns of one relation (the filter
  columns)

When this pattern is found and matching join MCV stats exist, the
planner compares the filter values against the stored MCV items to
compute the join selectivity, replacing the default estimate.

JOB Benchmark Results
---------------------

I created a single join statistics object for the (movie_keyword,
keyword) pair and reran the JOB benchmark:

CREATE STATISTICS movie_keyword_keyword_stats (mcv)
  ON k.keyword
  FROM movie_keyword mk
  JOIN keyword k ON (mk.keyword_id = k.id <http://k.id&gt;);

Total execution times (average of 3 runs):

                      Cold run          Warm run
Default (baseline)  140.75s (+/-2.1)  117.72s (+/-0.8)
enable_nestloop=off   87.16s (+/-0.2)   86.68s (+/-0.4)
With join stats       85.81s (+/-3.4)   65.24s (+/-0.5)

With a single join statistics object:
- Cold run: -39% vs baseline (140.75s -> 85.81s),
  comparable to enable_nestloop=off
- Warm run: -45% vs baseline (117.72s -> 65.24s),
  -25% vs enable_nestloop=off (86.68s -> 65.24s)

Per-query times for the 10 slowest (out of 113) queries on baseline
(master branch, enable_nestloop = on):

  Query | Baseline  | No Nestloop | Join Stats | Best
  ------+-----------+-------------+------------+----------
  16b   | 10,673 ms |   2,789 ms  |  2,775 ms  | 3.8x JS
  17e   |  7,709 ms |   2,260 ms  |  2,149 ms  | 3.6x JS
  17f   |  7,506 ms |   2,304 ms  |  2,460 ms  | 3.3x NL
  17a   |  7,288 ms |   1,453 ms  |  2,388 ms  | 5.0x NL
  17b   |  6,490 ms |   1,580 ms  |  5,413 ms  | 4.1x NL
  17c   |  6,240 ms |   1,095 ms  |  5,268 ms  | 5.7x NL
  17d   |  6,261 ms |   1,263 ms  |  5,291 ms  | 5.0x NL
  6d    |  6,234 ms |     988 ms  |  1,565 ms  | 6.3x NL
  25c   |  6,133 ms |   1,848 ms  |  1,738 ms  | 3.5x JS
  6f    |  5,728 ms |   1,785 ms  |  1,556 ms  | 3.7x JS

  (JS = join stats fastest, NL = no-nestloop fastest)

All 10 queries improved over baseline. For 16b, 3 nested loop joins
were replaced with 2 hash joins and 1 merge join. Some queries (e.g.
17b) kept the same plan shape and joins but gained a Memoize node from
better cardinality estimates.

Encouraging results, considering the limitations of the PoC patch. As
mentioned earlier, I'm curious how much the estimates improved, not just
the execution time.

Current Limitations
-------------------

This is a proof of concept. Known limitations include:

1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins.

Yeah, that's not ideal. I think we'd like a catalog struct that works
for joins of more than 2 tables too.

2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key.  This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.

Agreed. I don't think we can piggyback on the single-column MCV lists
for the reasons you mention.

3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).

4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.

Conclusion
----------

Even with all the current limitations, I think this experiment
suggests that join-level statistics can significantly improve plans,
as a single join stats object can materially improve workload
performance.

If there's interest, I'm happy to continue iterating on the design.

In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,

Yes, please!

- the catalog design,

I think we want a catalog that works for joins of more than 2 rels, etc.

- the stats collection approach,

The best approach I'm aware of is the "Index-Based Join Sampling" paper
I mentioned earlier in this thread. I think it's OK to allow building
statistics only for joins compatible with that sampling approach.

It's not clear to me what should trigger building the stats. I assume
it'd be done by ANALYZE, but that's per-table. So would it be triggered
by ANALYZE on any of the joined tables, or just the "main" one? Would it
require a new ANALYZE option, or something else? Not sure.

- the planner integration strategy,

I don't have a clear opinion on this yet. clauselist_selectivity_ext
seems like the right place, probably. I vaguely recall the challenge was
we either saw the per-relation restriction clauses or join clauses, but
not both at the same time?

- and scope (what kinds of joins / predicates are worth supporting).

IMHO it's fine to keep the scope rather narrow, at least for v1. As long
as we can later expand it to other cases, it's fine. The extended stats
initially did not support expressions, for example.

It's a bit orthogonal, but do you have an opinion on improving estimates
for some joins by considering per-table extended statistics in the
existing selectivity estimators (e.g. in eqjoinsel_inner)? I think it
might help cases with multi-clause join clauses, or cases where we have
a multi-column MCV list on the join key + WHERE conditions.

It's separate from what this patch aims to do, but I have imagined we'd
do that as the first step, before doing the proper join stats. I'm not
suggesting you have to do that too (or before this patch).

regards

--
Tomas Vondra

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Corey Huinker (#9)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 1/30/26 08:29, Corey Huinker wrote:

Current Limitations
-------------------

This is a proof of concept. Known limitations include:

I really like this proof of concept.

 

1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins

I think the asymmetry comes with the territory, and we will be creating
the join statistics that prove useful. If that means that we create one
object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b
JOIN a...  then so be it.
 

.

2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key.  This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.

Unfortunately, I think we will have to join the remote table_b to the
row sample on table_a to get accurate join statistics, and the best time
to do that when we already have the row sample from table_a. We can
further batch up statistics objects that happen to join table_a to
table_b by the same join criteria to avoid rescans.

IIRC the index-based sampling (described in the paper I mentioned) works
something like this - sample leading table, then use indexes to lookup
data from the other tables to build a statistically correct sample of
the whole join.

Will what you have work when we want to do an MCV on a mix of local and
remote columns, or will that require more work?
 

3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).

I suspect that n-way joins would have very limited utility and could be
adequately covered with multiple join-stats objects.
 

I see no clear reason why that would be the case, and it's not that hard
to construct joins on 3+ tables where stats on 2-way joins won't be good
enough. Whether those cases are sufficiently common I don't know, but I
also don't see a good reason to not address them - it should not be much
more complex than 2-way joins (famous last words, I know).

4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.

I think this is like indexing, just because you can make all possible
columns indexed doesn't mean you should. Tooling will emerge to
determine what join stats objects are worth their weight, and create
only those objects.

Agreed.

If there's interest, I'm happy to continue iterating on the design.

In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,

Yes. Very much so.
 

- the catalog design,

I had somehow gotten the impression that you were going to take the
extended statistics format, but store individual columns in a modified
pg_statistic. That's working for now, but I wonder if that will still be
the case when we start to try this for columns that are arrays, ranges,
multiranges, tsvectors, etc. pg_statistic has the infrastructure to
handle those, but there may be good reason to keep pg_statistic focused
on local attributes and instead just keep adding new kinds to extended
statistic and let it be the grab-bag it was perhaps always meant to be.

In my own musings on how to implement this (which you have far exceeded
with this proof-of-concept), I had wondered how to stats for k.keyword
and k.phonetic_code individually from the definition
of movie_keywords2_multi_stats, but looking at what you've done I think
we're better of with defining each statistic object very narrowly, and
that means we define one object per remote column and one object per
interesting combination of columns, then so be it. So long as we can
calculate them all from the same join of the two tables, we'll avoid the
nasty overhead.
 

Not sure I understand this. Why would this use pg_statistic at all? I
imagined we'd just treat the join as a relation, and store the stats in
pg_statistic_data_ext. The only difference is we need to store info
about the join itself (which rels / conditions), which would go into
pg_statistic_ext.

- and scope (what kinds of joins / predicates are worth supporting).

between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
mind.

element_mcv for array types might be next, but that's well down the road.

Maybe. In v1 we should focus on mimicking what we have for per-relation
stats, and only then try doing something more.

regards

--
Tomas Vondra

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrei Lepikhov (#10)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 1/31/26 12:18, Andrei Lepikhov wrote:

On 29/1/26 06:04, Alexandra Wang wrote:

Hi hackers,

As promised in my previous email, I'm sharing a proof-of-concept patch
exploring join statistics for correlated columns across relations.
This is a POC at this point, but I hope the performance numbers below
give a better idea of both the potential usefulness of join statistics
and the complexity of implementing them.

I wonder why you chose the JOIN operator only?

It seems to me that any relational operator produces relational output
that can be treated as a table. The extended statistics code may be
adopted to such relations.
I think it may be a VIEW that you can declare (manually or
automatically) and allow Postgres to build statistics on this 'virtual'
table. So, the main focus may shift to the question: how to provably
match a query subtree to a specific statistic.

Because for each "supported" operator we need to know two things:

(1) how to sample it efficiently

(2) how to apply it in selectivity estimation

We can't add support for everything at once, and for some cases we may
not even know answers to (1) and/or (2).

We can't simply store an opaque VIEW, and build the stats by simply
executing it (and sampling the results). The whole premise of extended
stats is that people define them to fix incorrect estimates. And with
incorrect estimates the plan may be terrible, and the VIEW may not even
complete.

regards

--
Tomas Vondra

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Corey Huinker (#8)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 1/30/26 07:26, Corey Huinker wrote:

I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

That was a big question I had, and I agree that we should only add
statistics as uses for them become apparent.
 

I don't have a clear opinion on this, and maybe we don't need to build
some of the stats. But I think there's a subtle point here - these stats
may not be useful for estimation of the join itself, but could be useful
for estimating the upper part of the query.

For example, imagine you have a join, with an aggregation on top.

SELECT t1.a, t2.b, COUNT(*) FROM t1 JOIN t2 ON (...) GROUP BY 1, 2;

How would you estimate the number of groups for the aggregate without
having the ndistinct estimate on the join result?

In my POC patch, I've made the following catalog changes:
- Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to /
pg_statistic_ext/
- Use the existing *stxkeys (int2vector)* to store the stats object
attributes of /stxotherrel/
- Create *pg_statistic_ext_otherrel_index* on /(stxrelid, stxotherrel)/
- Add stxdjoinmcv*(pg_join_mcv_list)* to /pg_statistic_ext_data/

I like all these changes. Maybe "outer" rel rather than "other" rel, but
it really doesn't matter this early on.

Already commented on this earlier - I don't think we want to restrict
the joins to 2-way joins. So this "outerrel" thing won't work.

To use them, we can let the planner detect patterns like this:

/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter columns in
other_rel */
List *filter_values; /* list of Datum constant values being filtered */
Oid collation; /* collation for comparisons */

/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;

and add the detection logic in clauselist_selectivity_ext() and
get_foreign_key_join_selectivity(). 

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well 
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I agree, we have to perform the joins, as the rows collected are
inherently dependent and skewed, and yes, it's going to be a maintenance
overhead hit.

Initially I thought this could be mitigated somewhat by retaining
rowsamples for each table, but those row samples would be independent of
the joins, and the values we need are inherently dependent.
 

Joining per-table samples don't work (which doesn't mean having samples
would not be useful, but not for joins). But I already mentioned the
paper I think describes how to build a sample for a join. Now that I
think of it I might have even posted a PoC patch implementing it using
SPI or something like that - but that'd be years ago, at this point.

regards

--
Tomas Vondra

#15Corey Huinker
corey.huinker@gmail.com
In reply to: Tomas Vondra (#11)
Re: Is there value in having optimizer stats for joins/foreignkeys?

Right. I have not investigated plans for the JOB queries, but a common
issue is that we pick a nested loop very early in the plan due to a poor
estimate, and that leads to a sequence of nested loops. Which just
amplifies the problem.

Even if the plan still uses nested loops, there are other things that
benefit from better row estimates. For instance, I've seen a case where the
planner decided to apply jit to the query because it expected 40 million
rows of results...and it got 4.

Catalog changes:

pg_statistic_ext:
- New stats kind 'c' (join MCV) in stxkind
- New field stxotherrel (Oid): the other table in the join
- New field stxjoinkeys (int2vector): join column pair [primary_joinkey,
other_joinkey]
- Existing stxkeys stores the filter column(s) on stxotherrel
- New index pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)

- Why do we need a new stats kind? I'd have expected we'd use the same
'm' value as for a single-relation stats, simply because a join result
is just a relation too.

- For a PoC it's OK to have stxotherrel, but the final patch should not
be restricted to two relations. I think there needs to be an array of
the joined rel OIDs, or something like that.

A parallel array of joined rel oids could lead to some refetching of
relation tuples unless we insisted that the keys be sorted the way we
currently sort stxkeys.

However, a multi-relation extended statistic object runs into a practical
issue when we fetch the remote rows that join to the local rowsample. We
could do that with an EphemeralNamedRelation and SPI (yuck), or more likely
we would leverage the index defined for the foreign key that we're
requiring (we already do this for referential integrity lookups on
inserts), and fishing out those foreign keys from a multi-table select,
which while do-able doesn't sound like fun. If we go that route then the
inability to find a supporting foreign key (and the index that goes with
it) should fail the statistic object creation.

So an array of RI constraint Oids (InvalidOid means "The attnum in the same
array position represents a local column/expression" would give us the
ability to do MCV combinations of columns from N tables along with the
local table, provided that there are fk constraints for each of the joins.

FWIW I recognize figuring out the catalog changes is not easy - in fact
I always found it so intimidation I ended up procrastinating and not
making any progress on join statistics. So I really appreciate you
taking a stab at this.

+1

pg_statistic_ext_data:
- New field stxdjoinmcv (pg_join_mcv_list): serialized join MCV data

I don't understand why we need this. AFAICS we should just treat the
join result as a relation, and use the current pg_statistic_ext_data
fields.

As I said above, we'd need a parallel array of constraint keys, rel ids, or
both to make that work.

How stats are collected:

Join statistics are collected during ANALYZE of the primary table
(stxrelid). The current approach assumes a dependency relationship
between the join key column and the filter column(s) on the other
table. Specifically, during ANALYZE, we reuse the already-computed
single-column MCV stats for the primary table's join key column. For
each MCV value, we look up the matching row in the other table via the
join key and extract the filter column values, carrying over the
primary-side MCV frequency. This is not accurate in all cases, but
works reasonably well for the POC, especially for foreign-key-like
joins where the primary table's join key distribution is
representative of the join result.

For the PoC it's good enough, but I don't think we can rely on the
per-column MCV lists like this.

Agreed. We must go to the source.

The best approach I'm aware of is the "Index-Based Join Sampling" paper
I mentioned earlier in this thread. I think it's OK to allow building
statistics only for joins compatible with that sampling approach.

It's not clear to me what should trigger building the stats. I assume
it'd be done by ANALYZE, but that's per-table. So would it be triggered
by ANALYZE on any of the joined tables, or just the "main" one? Would it
require a new ANALYZE option, or something else? Not sure.

Just the main one. Madness lies in the alternative

#16Andrei Lepikhov
lepihov@gmail.com
In reply to: Tomas Vondra (#13)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 1/2/26 17:39, Tomas Vondra wrote:

We can't simply store an opaque VIEW, and build the stats by simply
executing it (and sampling the results). The whole premise of extended
stats is that people define them to fix incorrect estimates. And with
incorrect estimates the plan may be terrible, and the VIEW may not even
complete.

Ok, I got the point.
I think linking to a join or foreign key seems restrictive. In my mind,
extended statistics may go the following way:

CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3;

Suppose t1.x,t2.y, and t3.z have a common equality operator.

Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y =
t3.z), and potentially (t1.x = t2.y = t3.z).

But I don't frequently detect problems with JOIN estimation using a
single join clause. Usually, we have problems with (I) join trees
(clauses spread across joins) and (II) a single multi-clause join.
We can't solve (I) here (kinda statistics on a VIEW might help, I
think), but may ease (II) using:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2;

or even more bravely:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2
WHERE (t1.z <> t2.z);

--
regards, Andrei Lepikhov,
pgEdge

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrei Lepikhov (#16)
Re: Is there value in having optimizer stats for joins/foreignkeys?

On 2/2/26 10:53, Andrei Lepikhov wrote:

On 1/2/26 17:39, Tomas Vondra wrote:

We can't simply store an opaque VIEW, and build the stats by simply
executing it (and sampling the results). The whole premise of extended
stats is that people define them to fix incorrect estimates. And with
incorrect estimates the plan may be terrible, and the VIEW may not even
complete.

Ok, I got the point.
I think linking to a join or foreign key seems restrictive. In my mind,
extended statistics may go the following way:

I agree we don't need to restrict to joins on foreign keys. I assume the
PoC patch requires f-keys because it makes building the join sample much
simpler / easier to think about.

The paper "sampling done right" paper I mentioned explains how to sample
general joins, as long as there are appropriate indexes. Foreign keys
always have those, but the constraint itself is not needed.

FWIW I think it's perfectly acceptable to allow extended stats only on
joins with appropriate indexes. Because without that we can't do the
sampling efficiently (or possibly at all).

But I'd leave this for later. For now it's perfectly fine to limit the
scope to FK joins, and then maybe expand the scope once we figure out
the other pieces.

CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3;

Suppose t1.x,t2.y, and t3.z have a common equality operator.

Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y =
t3.z), and potentially (t1.x = t2.y = t3.z).

If I understand correctly you suggest we generate all "possible" joins
joining on the columns specified in the ON clause.

I think we shouldn't do that, as it's confused about the purpose of the
ON clause. That's meant to specify the list of columns on which to build
the extended statistic, but now it would be generating join clauses.

It has to be possible to have non-join attributes in the ON clause,
because that's how we can track correlation between the tables. Which is
the whole point, I think. It's not about the join clause selectivity, or
at least not just about it.

Moreover, wouldn't it be rather inefficient? Imagine you have a join
with two tables and two join clauses. (t1.a = t2.a) AND (t1.b = t2.b).
But with your syntax it'd be just

CREATE STATISTICS s ON (t1.a, t1.b, t2.a, t2.b) FROM t1, t2;

And we'd have to build stats for (at least) 2 joins, because we have no
idea if t1.a joins to t2.a or t2.b.

So -1 to this, IMHO we need the "full" syntax with

CREATE STATISTICS s ON (t1.c, t2.d)
FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b);

We may need some additional statistics to track the selectivity of the
join clauses, in addition to the existing MCV stats built on the join
result.

But I don't frequently detect problems with JOIN estimation using a
single join clause. Usually, we have problems with (I) join trees
(clauses spread across joins) and (II) a single multi-clause join.
We can't solve (I) here (kinda statistics on a VIEW might help, I
think), but may ease (II) using:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2;

or even more bravely:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2
WHERE (t1.z <> t2.z);

I honestly don't see why this would be better / simpler than the CREATE
STATISTICS grammar that simply allows joins in the FROM part.

regards

--
Tomas Vondra