should we have a fast-path planning for OLTP starjoins?
Hi,
While running benchmarks for my "performance over 20 years" talk [1]https://www.postgresql.eu/events/pgconfeu2024/schedule/session/5585-performance-archaeology/,
I've been been also looking for common cases that don't perform well
(and thus might be a good topic for optimization, with significant
speedup helping a lot of deployments).
One such simple example that I ran into is "OLTP starjoin". You're
probably familiar with star schema in the DSS field [2]https://en.wikipedia.org/wiki/Star_schema, as a large fact
table with many small-ish dimensions. The OLTP variant is exactly the
same thing, but with selective WHERE conditions on the fact table.
So you can imagine it as a query of this shape:
SELECT * FROM fact_table f
JOIN dim1 ON (f.id1 = dim1.id)
JOIN dim2 ON (f.id2 = dim2.id)
JOIN dim3 ON (f.id3 = dim3.id)
...
WHERE f.id = 2398723;
This is a surprisingly common query pattern in OLTP applications, thanks
to normalization. For example the "fact" may be a table of transactions
with some basic common details, dimensions are additional "details" for
special types of transactions. When loading info about a transaction of
unknown type, this allows you to load everything at once.
Or maybe the fact table is "users" and the dimensions have all kinds of
info about the user (address, primary e-mail address, balance, ...).
Anyway, this pattern is quite common, yet it performs quite poorly.
Let's join a fact table with 10 dimensions - see the attached create
script to build such schema, and the test.sql script for pgbench.
On my new ryzen machine, this peaks at about ~16k tps with 16 clients.
The machine can easily do 1M tps in read-only pgbench, for example. And
if you increase the join_collapse_limit to 12 (because the default 8 is
not enough for the 10 dimensions), the throughput drops to ~2k tps.
That's not great.
AFAIK this is a consequence of the star joins allowing arbitrary join
order of the dimensions - those only have join conditions to the fact
relation, so it allows many join orders. So exploring them takes a lot
of time, of course.
But for starjoins, a lot of this is not really needed. In the simplest
case (no conditions on dimensions etc) it does not really matter in what
order we join those, and filters on dimensions make it only a little bit
more complicated (join the most selective first).
So I've been wondering how difficult would it be to have a special
fast-path mode for starjoins, completely skipping most of this. I
cobbled together a WIP/PoC patch (attached) on the way from FOSDEM, and
it seems to help quite a bit.
I definitely don't claim the patch is correct for all interesting cases,
just for the example query. And I'm sure there's plenty of things to fix
or improve (e.g. handling of outer joins, more complex joins, ...).
But these are the rough results for 1 and 16 clients:
build 1 16
--------------------------------------
master 1600 16000
patched 4400 46000
So that about triples the throughput. If you bump join_collapse_limit to
12, it gets even clearer
build 1 16
--------------------------------------
master 200 2000
patched 4500 48000
That's a 20x improvement - not bad. Sure, this is on a tiny dataset, and
with larger data sets it might need to do I/O, diminishing the benefits.
It's just an example to demonstrate the benefits.
If you want to try the patch, there's a new GUC enable_starjoin to
enable this optimization (off by default).
The patch does roughly this:
1) It tries to detect a "star join" before doing the full join order
search. It simply looks for the largest relation (not considering the
conditions), and assumes it's a fact. And then it searches for relations
that only join to the fact - those are the dimensions.
2) With the relations found in (1) it just builds the join relations
directly (one per level), without exploring all the possibilities. This
is where the speedup comes from.
3) If there are additional relations, those are then left to the regular
join order search algorithm.
There's a lot of stuff that could / should be improved on the current
patch. For (1) we might add support for more complex cases with
snowflake schemas [3]https://en.wikipedia.org/wiki/Snowflake_schema or with multiple fact tables. At the same time (1)
needs to be very cheap, so that it does not regress every non-starjoin
query.
For (2) it might pick a particular order we join the dimensions (by
size, selectivity, ...), and it might consider whether to join them
before/after the other relations.
FWIW I suspect there's a fair amount of research papers looking at
starjoins and what is the optimal plan for such queries, but I didn't
have time to look at that yet. Pointers welcome!
But the bigger question is whether it makes sense to have such fast-path
modes for certain query shapes. The patch "hard-codes" the planning for
starjoin queries, but we clearly can't do that for every possible join
shape (because then why have dynamic join search at all?).
I do think starjoins might be sufficiently unique / special to justify
this, but maybe it would be possible to instead improve the regular join
order search to handle this case better? I don't have a very clear idea
what would that look like, though :-(
I did check what do some other databases do, and they often have some
sort of "hint" to nudge the let the optimizer know this is a starjoin.
I also looked at what are the main bottlenecks with the simpler starjoin
planning enabled - see the attached flamegraphs. The optimizations seem
to break the stacktraces a bit, so there's a svg for "-O0 -ggdb3" too,
that doesn't have this issue (the shape is different, but the conclusion
are about the same).
In both cases about 40% of the time is spent in initial_cost_mergejoin,
which seems like a lot - and yes, disabling mergejoin doubles the
throughput. And most of the cost is in get_actual_variable_range,
looking up the range in the btrees. That seems like a lot, considering
the indexes are perfectly clean (we used to have problems with deleted
tuples, but this is not the case). I wonder if maybe we could start
caching this kind of info somewhere.
regards
[1]: https://www.postgresql.eu/events/pgconfeu2024/schedule/session/5585-performance-archaeology/
https://www.postgresql.eu/events/pgconfeu2024/schedule/session/5585-performance-archaeology/
[2]: https://en.wikipedia.org/wiki/Star_schema
[3]: https://en.wikipedia.org/wiki/Snowflake_schema
--
Tomas Vondra
Attachments:
0001-WIP-simplified-planning-of-starjoins.patchtext/x-patch; charset=UTF-8; name=0001-WIP-simplified-planning-of-starjoins.patchDownload
From f049875acde5d4959c2dfef65313ad2b3ab6a8d4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 3 Feb 2025 16:45:24 +0100
Subject: [PATCH] WIP: simplified planning of starjoins
---
src/backend/optimizer/path/allpaths.c | 118 ++++++++++++++++++-
src/backend/optimizer/path/joinrels.c | 157 +++++++++++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 10 ++
src/include/optimizer/paths.h | 5 +
4 files changed, 288 insertions(+), 2 deletions(-)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1115ebeee29..6848055dbcc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -77,6 +77,7 @@ typedef enum pushdown_safe_type
/* These parameters are set by GUC */
bool enable_geqo = false; /* just in case GUC doesn't set it */
+bool enable_starjoin = false;
int geqo_threshold;
int min_parallel_table_scan_size;
int min_parallel_index_scan_size;
@@ -3389,6 +3390,114 @@ make_rel_from_joinlist(PlannerInfo *root, List *joinlist)
}
}
+static int
+starjoin_join_search(PlannerInfo *root, List *initial_rels, int level)
+{
+ if (!enable_starjoin)
+ return level;
+
+ {
+ ListCell *lc;
+ List *rels = plan_star_join(root, initial_rels);
+ RelOptInfo *fact = NULL;
+ RelOptInfo *rel = NULL;
+
+ /*
+ * add the dimensions one by one, and adjust the start level
+ *
+ * XXX The first element is the fact table.
+ */
+ foreach(lc, rels)
+ {
+ ListCell *lc2;
+ RelOptInfo *old_rel = NULL;
+
+ rel = lfirst(lc);
+
+ /* us the first element as fact table, jump to the next one,
+ * which is the first dimension */
+ if (fact == NULL)
+ {
+ fact = rel;
+ continue;
+ }
+
+ /* we're adding join for the first dimension, so set the level */
+ root->join_cur_level = level;
+
+ /*
+ * XXX Subset of join_search_one_level. The main difference is
+ * we don't need to walk any of the lower levels, because for
+ * level=2 we already have the fact table, and for higher
+ * levels there should be only a single joinrel.
+ */
+
+ if (level == 2)
+ old_rel = fact;
+ else
+ old_rel = (RelOptInfo *) linitial(root->join_rel_level[level - 1]);
+
+ /* there should be no join relation yet */
+ Assert(root->join_rel_level[level] == NIL);
+
+ make_rel_by_clause_joins(root, old_rel, rel);
+
+ /*
+ * If everything went fine, we should have exactly one join relation
+ * for the current level.
+ *
+ * XXX This could happen if the current starjoin logic fails to
+ * consider something that prevents creating the join, e.g. some
+ * sort of join restriction. Not sure if that should be treated
+ * as a bug, or something expected (in which case we could just
+ * fallback to the regular planning).
+ */
+ Assert(root->join_rel_level[startlev] != NIL);
+ Assert(list_length(root->join_rel_level[startlev]) == 1);
+
+ /* generate/set paths for the join relation we just created */
+
+ /*
+ * Run generate_partitionwise_join_paths() and
+ * generate_useful_gather_paths() for each just-processed joinrel. We
+ * could not do this earlier because both regular and partial paths
+ * can get added to a particular joinrel at multiple times within
+ * join_search_one_level.
+ *
+ * After that, we're done creating paths for the joinrel, so run
+ * set_cheapest().
+ */
+ foreach(lc2, root->join_rel_level[level])
+ {
+ rel = (RelOptInfo *) lfirst(lc2);
+
+ /* Create paths for partitionwise joins. */
+ generate_partitionwise_join_paths(root, rel);
+
+ /*
+ * Except for the topmost scan/join rel, consider gathering
+ * partial paths. We'll do the same for the topmost scan/join rel
+ * once we know the final targetlist (see grouping_planner's and
+ * its call to apply_scanjoin_target_to_paths).
+ */
+ if (!bms_equal(rel->relids, root->all_query_rels))
+ generate_useful_gather_paths(root, rel, false);
+
+ /* Find and save the cheapest paths for this rel */
+ set_cheapest(rel);
+
+ #ifdef OPTIMIZER_DEBUG
+ pprint(rel);
+ #endif
+ }
+
+ level++;
+ }
+ }
+
+ return level;
+}
+
/*
* standard_join_search
* Find possible joinpaths for a query by successively finding ways
@@ -3422,6 +3531,7 @@ RelOptInfo *
standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
{
int lev;
+ int startlev = 2;
RelOptInfo *rel;
/*
@@ -3445,7 +3555,13 @@ standard_join_search(PlannerInfo *root, int levels_needed, List *initial_rels)
root->join_rel_level[1] = initial_rels;
- for (lev = 2; lev <= levels_needed; lev++)
+ /*
+ * Try simplified planning for starjoin. If it succeeds, we should
+ * continue at level startlev.
+ */
+ startlev = starjoin_join_search(root, initial_rels, 2);
+
+ for (lev = startlev; lev <= levels_needed; lev++)
{
ListCell *lc;
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index c2eb300ea9b..e604f863078 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -224,7 +224,6 @@ join_search_one_level(PlannerInfo *root, int level)
foreach(r, joinrels[level - 1])
{
RelOptInfo *old_rel = (RelOptInfo *) lfirst(r);
-
make_rels_by_clauseless_joins(root,
old_rel,
joinrels[1]);
@@ -255,6 +254,19 @@ join_search_one_level(PlannerInfo *root, int level)
}
}
+void
+make_rel_by_clause_joins(PlannerInfo *root,
+ RelOptInfo *old_rel,
+ RelOptInfo *other_rel)
+{
+ if (!bms_overlap(old_rel->relids, other_rel->relids) &&
+ (have_relevant_joinclause(root, old_rel, other_rel) ||
+ have_join_order_restriction(root, old_rel, other_rel)))
+ {
+ (void) make_join_rel(root, old_rel, other_rel);
+ }
+}
+
/*
* make_rels_by_clause_joins
* Build joins between the given relation 'old_rel' and other relations
@@ -1963,3 +1975,146 @@ get_matching_part_pairs(PlannerInfo *root, RelOptInfo *joinrel,
*parts2 = lappend(*parts2, child_rel2);
}
}
+
+/*
+ * Try to identify a starjoin in the list of relations. Pick the largest
+ * relation, and the smaller dimensions.
+ *
+ * Happens in two steps. First, we find the largest relation and consider
+ * it to be the "fact" of the star schema. Then we walk the rest of the
+ * relations and check which can be treated as dimensions for the fact.
+ * This is possible only if the relation has join clause only to the fact
+ * and no other relations.
+ *
+ * XXX It can happen the largest table is not the fact, in which case we
+ * should just try the second largest one, etc. Or maybe there are multiple
+ * facts, in which case we detect the should try to build a group for each
+ * fact (fact + dimensions).
+ *
+ * Returns the list of relations to join, in the join order, with the fact
+ * table as the first element, followed by the dimensions.
+ */
+List *
+plan_star_join(PlannerInfo *root, List *rels)
+{
+ ListCell *lc;
+ RelOptInfo *fact = NULL;
+ List *dimensions = NIL;
+
+ /*
+ * We need at least 3 relations for a star join, to have a chance to
+ * gain anything by simpler join order planning.
+ */
+ if (list_length(rels) < 3)
+ return NIL;
+
+ /*
+ * Find the largest relation, we'll try to use it as "fact" table.
+ */
+ foreach(lc, rels)
+ {
+ RelOptInfo *rel = (RelOptInfo *) lfirst(lc);
+
+ /* first relation */
+ if (fact == NULL)
+ {
+ fact = rel;
+ continue;
+ }
+
+ /*
+ * We look at total relation sizes, not the estimated cardinality
+ * with conditions applied.
+ */
+ if (fact->tuples < rel->tuples)
+ {
+ fact = rel;
+ continue;
+ }
+ }
+
+ /*
+ * If the "fact" does not have any join clauses, we're done.
+ *
+ * XXX Seems has_join_restriction is not what we want to require for the
+ * fact table - it checks for restrictions on join order, but that's not
+ * what we want for the fact. Maybe we should do the exact opposite, i.e.
+ * require that a fact table does not have that? Although, if we want to
+ * support multiple "partial star joins" (query on multiple fact tables,
+ * each with it's own dimensions).
+ */
+ //if (!has_join_restriction(root, fact))
+ // return NIL;
+
+ /* the fact must have no restrictions */
+ if (has_join_restriction(root, fact))
+ return NIL;
+
+ /*
+ * Now go and try to detect dimensions, i.e. relations that have a join
+ * with the fact table, and no other relations. We will order them by
+ * selectivity (rows / tuples), because we prefer to reduce the join
+ * size early.
+ */
+ foreach(lc, rels)
+ {
+ RelOptInfo *rel = (RelOptInfo *) lfirst(lc);
+
+ /* ignore the fact table */
+ if (rel == fact)
+ continue;
+
+ // elog(WARNING, "> has_join_restriction %d", has_join_restriction(root, rel));
+ // elog(WARNING, "> has_legal_joinclause %d", has_legal_joinclause(root, rel));
+
+ /* ignore rels without any join clause */
+ // if (!has_join_restriction(root, rel))
+ // continue;
+
+ /*
+ * XXX Do not allow join restrictions for dimensions either, just like
+ * for fact, although for dims we should be able to allow this ...
+ */
+ if (has_join_restriction(root, rel))
+ continue;
+
+ /*
+ * Must have join clause with the fact table. This is a subset of
+ * has_legal_joinclause for a single (fact) table. We always look
+ * at initial rels, so the relids overlap checks are not needed.
+ */
+ if (have_relevant_joinclause(root, fact, rel))
+ {
+ Relids joinrelids;
+ SpecialJoinInfo *sjinfo;
+ bool reversed;
+
+ /* join_is_legal needs relids of the union */
+ joinrelids = bms_union(fact->relids, rel->relids);
+
+ if (join_is_legal(root, fact, rel, joinrelids,
+ &sjinfo, &reversed))
+ {
+ /* Yes, this will work */
+ // bms_free(joinrelids);
+
+ // FIXME this should also check the rel does not have join
+ // clauses to any other relation;
+ dimensions = lappend(dimensions, rel);
+ }
+
+ bms_free(joinrelids);
+ }
+ }
+
+ /*
+ * repeat the check we actually found a star join with at least 3 rels
+ * (so two dimensions)
+ */
+ if (list_length(dimensions) < 2)
+ return NIL;
+
+ /* FIXME sort the dimensions by selectivity */
+
+ return list_concat(list_make1(fact), dimensions);
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71448bb4fdd..941545d9125 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1018,6 +1018,16 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_starjoin", PGC_USERSET, QUERY_TUNING_GEQO,
+ gettext_noop("Enables starjoin optimization."),
+ gettext_noop("This algorithm attempts to do faster planning for star joins."),
+ GUC_EXPLAIN
+ },
+ &enable_starjoin,
+ false,
+ NULL, NULL, NULL
+ },
{
/*
* Not for general use --- used by SET SESSION AUTHORIZATION and SET
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 46955d128f0..2eed48b560e 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -21,6 +21,7 @@
* allpaths.c
*/
extern PGDLLIMPORT bool enable_geqo;
+extern PGDLLIMPORT bool enable_starjoin;
extern PGDLLIMPORT int geqo_threshold;
extern PGDLLIMPORT int min_parallel_table_scan_size;
extern PGDLLIMPORT int min_parallel_index_scan_size;
@@ -111,6 +112,10 @@ extern bool have_dangerous_phv(PlannerInfo *root,
extern void mark_dummy_rel(RelOptInfo *rel);
extern void init_dummy_sjinfo(SpecialJoinInfo *sjinfo, Relids left_relids,
Relids right_relids);
+extern List *plan_star_join(PlannerInfo *root, List *rels);
+extern void make_rel_by_clause_joins(PlannerInfo *root,
+ RelOptInfo *old_rel,
+ RelOptInfo *other_rel);
/*
* equivclass.c
--
2.47.1
starjoin-optim.pngimage/png; name=starjoin-optim.pngDownload
�PNG
IHDR � $dA<