Quadratic planning time for ordered paths over partitioned tables

Started by Alexander Kuzmenkov12 months ago9 messages
#1Alexander Kuzmenkov
akuzmenkov@timescale.com
1 attachment(s)

Hi hackers,

There's currently an unfortunate CPU sink in the planning for
partitioned tables. It happens both for the declarative partitioning,
and for the partitioning through inheritance like we use in
TimescaleDB.

The gist of it is that there's a linear search of the EM corresponding
to the given child relation in find_ec_member_matching_pathkeys(). For
a partitioned table, this adds up to a time quadratic in the number of
partitions, which can make some queries unusable after you hit 1k
partitions or so. The usual callers are prepare_sort_from_pathkeys()
or generate_join_implied_equalities().

There's a very simple fix for this, that makes use of the fact that
these EMs are normally looked up in the partitioning order. We can
cache the position of the previously found EM in a static variable in
this function, which makes this search constant-time for the typical
scenarios.

This is arguably a hack, but a simple one. The problem is really
unfortunate and is a subject of many complaints from our customers.
There was another attempt to fix this in [1], but it's been going on
for years without reaching a committable shape. This patch proposes a
quick and dirty alternative that can be adopted until the more
principled solution is worked out.

I did a test with 10k partitions, and this patch makes planning there
3.5 times faster (2.4 s -> 0.7 s):

create table t(x int) partition by range(x);

select format('create table t_%s partition of t for values from (%s)
to (%s)', x, x * 10000, (x + 1) * 10000) from generate_series(0, 9999)
x \gexec

insert into t select generate_series(0, 10000 * 9999);
vacuum analyze t;

set max_parallel_workers_per_gather = 0;
set enable_partitionwise_aggregate = 1;
set enable_hashagg = 0;
insert into t select generate_series(0, 10000 * 9999);

explain select count(*) from t group by x;

The query is supposed to generate a partitionwise aggregation over
sort like this:
Append
-> GroupAggregate
Group Key: t.x
-> Sort
Sort Key: t.x
-> Seq Scan on t_0 t
-> GroupAggregate
Group Key: t_1.x
-> Sort
Sort Key: t_1.x
-> Seq Scan on t_1
...

Would be glad to hear your opinions on this. CCing some committers who
were recently involved with partitioning.

References:
1: /messages/by-id/CAJ2pMkZNCgoUKSE+_5LthD+KbXKvq6h2hQN8Esxpxd+cxmgomg@mail.gmail.com

---
Alexander Kuzmenkov
Timescale

Attachments:

v1-0001-Fix-quadratic-equivalence-member-search-for-parti.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-quadratic-equivalence-member-search-for-parti.patchDownload
From 4bc0c6d0ef14006b2dea8db871ba0e4a247bec71 Mon Sep 17 00:00:00 2001
From: Alexander Kuzmenkov <36882414+akuzm@users.noreply.github.com>
Date: Wed, 22 Jan 2025 16:33:40 +0100
Subject: [PATCH v1] Fix quadratic equivalence member search for partitioned
 tables

Currently the find_ec_member_matching_expr() uses a linear search. When
creating ordered paths for a partitioned table, the total time taken by
this linear search becomes effectively quadratic in the number of
partitions, which starts seriously affecting the planning performance
after we hit 1k partitions. Ideally, ec_members should be a hash table,
but as a quick fix, we can make use of the fact that this search usually
looks up the either the same EM as before, or the next one, and start
the search from the previous position. This speeds up the common case.
---
 src/backend/optimizer/path/equivclass.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7cafaca33c..9d8fb5e40d 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -764,16 +764,26 @@ find_ec_member_matching_expr(EquivalenceClass *ec,
 							 Expr *expr,
 							 Relids relids)
 {
-	ListCell   *lc;
-
 	/* We ignore binary-compatible relabeling on both ends */
 	while (expr && IsA(expr, RelabelType))
 		expr = ((RelabelType *) expr)->arg;
 
-	foreach(lc, ec->ec_members)
+	/*
+	 * When creating ordered paths for a partitioned table, the total time taken
+	 * by this linear search becomes effectively quadratic in the number of
+	 * partitions, which starts seriously affecting the planning performance
+	 * after we hit 1k partitions. Ideally, ec_members should be a hash table,
+	 * but as a quick fix, we can make use of the fact that this search usually
+	 * looks up the either the same EM as before, or the next one, and start the
+	 * search from the previous position. This speeds up the common case.
+	 */
+	static int previous_em_found_at = 0;
+	int n = list_length(ec->ec_members);
+	for(int search_offset = 0; search_offset < n; search_offset++)
 	{
-		EquivalenceMember *em = (EquivalenceMember *) lfirst(lc);
-		Expr	   *emexpr;
+		const int current_position = (search_offset + previous_em_found_at) % n;
+		EquivalenceMember *em = list_nth_node(EquivalenceMember, ec->ec_members,
+			current_position);
 
 		/*
 		 * We shouldn't be trying to sort by an equivalence class that
@@ -792,12 +802,15 @@ find_ec_member_matching_expr(EquivalenceClass *ec,
 		/*
 		 * Match if same expression (after stripping relabel).
 		 */
-		emexpr = em->em_expr;
+		Expr *emexpr = em->em_expr;
 		while (emexpr && IsA(emexpr, RelabelType))
 			emexpr = ((RelabelType *) emexpr)->arg;
 
 		if (equal(emexpr, expr))
+		{
+			previous_em_found_at = current_position;
 			return em;
+		}
 	}
 
 	return NULL;
-- 
2.34.1

#2Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alexander Kuzmenkov (#1)
Re: Quadratic planning time for ordered paths over partitioned tables

Hello,

On 2025-Jan-22, Alexander Kuzmenkov wrote:

Hi hackers,

There's currently an unfortunate CPU sink in the planning for
partitioned tables. It happens both for the declarative partitioning,
and for the partitioning through inheritance like we use in
TimescaleDB.

The gist of it is that there's a linear search of the EM corresponding
to the given child relation in find_ec_member_matching_pathkeys().

I think this is closely related to the work Yuya Watari has been doing
at
/messages/by-id/CAJ2pMkZZHrhgQ5UV0y+STKqx7XVGzENMhL98UbKM-OArvK9dmA@mail.gmail.com
Perhaps you could contribute by reviewing that patch series.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Los cuentos de hadas no dan al niño su primera idea sobre los monstruos.
Lo que le dan es su primera idea de la posible derrota del monstruo."
(G. K. Chesterton)

#3Alexander Kuzmenkov
akuzmenkov@timescale.com
In reply to: Alvaro Herrera (#2)
Re: Quadratic planning time for ordered paths over partitioned tables

On Wed, Jan 22, 2025 at 5:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I think this is closely related to the work Yuya Watari has been doing
at
/messages/by-id/CAJ2pMkZZHrhgQ5UV0y+STKqx7XVGzENMhL98UbKM-OArvK9dmA@mail.gmail.com
Perhaps you could contribute by reviewing that patch series.

Yeah, I'm referencing this one in my email, but it's a series of
patches that does a major refactoring changing thousands of lines. I'm
not sure when or if it's going to land, do you think applying a quick
fix first would make sense? It makes trivial changes in one function.

#4Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alexander Kuzmenkov (#3)
Re: Quadratic planning time for ordered paths over partitioned tables

On 2025-Jan-22, Alexander Kuzmenkov wrote:

On Wed, Jan 22, 2025 at 5:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I think this is closely related to the work Yuya Watari has been doing
at
/messages/by-id/CAJ2pMkZZHrhgQ5UV0y+STKqx7XVGzENMhL98UbKM-OArvK9dmA@mail.gmail.com
Perhaps you could contribute by reviewing that patch series.

Yeah, I'm referencing this one in my email, but it's a series of
patches that does a major refactoring changing thousands of lines. I'm
not sure when or if it's going to land, do you think applying a quick
fix first would make sense? It makes trivial changes in one function.

I think it might go faster if you try it out and review it.

Which problems did you notice that make you think it's not committable?
Maybe you can propose solutions for those problems.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Kuzmenkov (#3)
Re: Quadratic planning time for ordered paths over partitioned tables

Alexander Kuzmenkov <akuzmenkov@timescale.com> writes:

Yeah, I'm referencing this one in my email, but it's a series of
patches that does a major refactoring changing thousands of lines. I'm
not sure when or if it's going to land, do you think applying a quick
fix first would make sense? It makes trivial changes in one function.

That "quick fix" seems extremely horrid: since there's only one static
cache variable for all ECs, it's going to help only one very specific
call pattern, which could easily get broken by unrelated changes.
Also, maybe my performance instincts are rooted in old hardware, but
I don't trust integer division with a variable divisor to be cheap.
So ISTM this could as easily make things worse as better. If you
offered something that was less obviously a kluge, maybe we could
use it.

Really I'd think the right place to be fixing this is at a higher
level. Where are the repetitive find_ec_member_matching_expr calls
coming from?

regards, tom lane

#6Alexander Kuzmenkov
akuzmenkov@timescale.com
In reply to: Tom Lane (#5)
Re: Quadratic planning time for ordered paths over partitioned tables

On Wed, Jan 22, 2025 at 6:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Really I'd think the right place to be fixing this is at a higher
level. Where are the repetitive find_ec_member_matching_expr calls
coming from?

I'm not aware of the repeated calls for the same child, and I mostly
see it called once for every partition from
prepare_sort_from_pathkeys(). For this case, fixing it at the high
level would require something like MergeAppend building a mapping of
Pathkey -> EM for all its children upfront, then passing this down to
the prepare_sort_from_pathkeys().

The other patch series focuses on this being called from the
generate_join_implied_equalities(). That call site could probably use
the same approach with building a mapping before generating the
partitionwise join paths.

#7Aleksander Alekseev
aleksander@timescale.com
In reply to: Alexander Kuzmenkov (#3)
Re: Quadratic planning time for ordered paths over partitioned tables

Hi,

I think this is closely related to the work Yuya Watari has been doing
at
/messages/by-id/CAJ2pMkZZHrhgQ5UV0y+STKqx7XVGzENMhL98UbKM-OArvK9dmA@mail.gmail.com
Perhaps you could contribute by reviewing that patch series.

Yeah, I'm referencing this one in my email, but it's a series of
patches that does a major refactoring changing thousands of lines. I'm
not sure when or if it's going to land, do you think applying a quick
fix first would make sense? It makes trivial changes in one function.

It looks like the author keeps the patch set up to date. Although he
proposes a complicated refactoring this is better than "quick and
dirty fix" as you put it, IMO. So I suggest focusing on this patch
set. On top of that somehow I doubt we will find a committer willing
to be responsible for merging anything quick and dirty anyway.

Did you consider checking if the referenced patchset addresses the
issue you described?

--
Best regards,
Aleksander Alekseev

#8Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Aleksander Alekseev (#7)
Re: Quadratic planning time for ordered paths over partitioned tables

On 2025-Jan-24, Aleksander Alekseev wrote:

Did you consider checking if the referenced patchset addresses the
issue you described?

I ran Kuzmenkov's test case with Watari-san's patch. Planning time goes
from 2700ms to 600ms or so.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#9Alexander Kuzmenkov
akuzmenkov@timescale.com
In reply to: Alvaro Herrera (#8)
Re: Quadratic planning time for ordered paths over partitioned tables

On Fri, Jan 24, 2025 at 2:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I ran Kuzmenkov's test case with Watari-san's patch. Planning time goes
from 2700ms to 600ms or so.

Thank you, good to know that it helps this case as well.