[PATCH] Teach planner to further optimize sort in distinct

Started by Ankit Kumar Pandeyalmost 3 years ago5 messages

itsankitkp@gmail.com

almost 3 years ago

1 attachment(s)

Hi, this is extension of `teach planner to evaluate multiple windows in
the optimal order` work applied to distinct operation.

Based on discussions before
(/messages/by-id/CAApHDvr7rSCVXzGfVa1L9pLpkKj6-s8NynK8o+98X9sKjejnQQ@mail.gmail.com
,

All I imagine you need to do for it
is to invent a function in pathkeys.c which is along the lines of what
pathkeys_count_contained_in() does, but returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1. In
create_final_distinct_paths(), you can then perform an incremental
sort on any input_path which has a non-empty return list and in
create_incremental_sort_path(), you'll pass presorted_keys as the
number of pathkeys in the path, and the required pathkeys the
input_path->pathkeys + the pathkeys returned from the new function.

There is bit confusion in wording here:

"returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1."

You mean extract common keys without ordering right?

Example: keys1 = (a,b,c), keys2 = (b,a)

returns (a,b)

and

keys1 = (a,b,c), keys = (d)

returns = ()

which translates to

needed_pathkeys = (a,b,c) = key2

input_pathkeys = (b,a) key1

returns (b,a) = common_keys

new needed_pathkeys = unique(common_keys + old needed_pathkeys)

=> new needed_pathkeys = (b,a,c)

The new needed_pathkeys matches input_pathkeys.

This is what I implemented in the patch.

The patched version yields the following plans:

set enable_hashagg=0;
set enable_seqscan=0;

explain (costs off) select distinct relname,relkind,count(*) over
(partition by
relkind) from pg_Class;
                       QUERY PLAN
---------------------------------------------------------
Unique
   -> Incremental Sort
         Sort Key: relkind, relname, (count(*) OVER (?))
         Presorted Key: relkind
         -> WindowAgg
               -> Sort
                     Sort Key: relkind
                     -> Seq Scan on pg_class
(8 rows)

explain (costs off) select distinct a, b, count(*) over (partition by b,
a) from abcd;
                       QUERY PLAN
--------------------------------------------------------
Unique
   -> Incremental Sort
         Sort Key: b, a, (count(*) OVER (?))
         Presorted Key: b, a
         -> WindowAgg
               -> Incremental Sort
                     Sort Key: b, a
                     Presorted Key: b
                     -> Index Scan using b_idx on abcd
(9 rows)

explain (costs off) select distinct a, b, count(*) over (partition by c,
d) from abcd;
                       QUERY PLAN
--------------------------------------------------------
Unique
   -> Sort
         Sort Key: a, b, (count(*) OVER (?))
         -> WindowAgg
               -> Incremental Sort
                     Sort Key: c, d
                     Presorted Key: c
                     -> Index Scan using c_idx on abcd
(8 rows)

Issue with index path still remains as pathkeys get purged by
truncate_useless_pathkeys

and hence are not available in create_final_distinct_paths for the above
optimizations.

I have attached a patch for the reference.

Thanks,

Ankit

Attachments:

distinct_sort_optimizations.patchtext/x-patch; charset=UTF-8; name=distinct_sort_optimizations.patchDownload

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 609df93dc9..13f6006577 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1968,3 +1968,32 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel)
 		return true;			/* might be able to use them for ordering */
 	return false;				/* definitely useless */
 }
+
+/*
+ * extract_common_pathkeys
+ *		returns a List of pathkeys
+ *	which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
+ * that does not exist as a pathkey in keys1 
+ */
+List *
+extract_common_pathkeys(List* keys1, List *keys2)
+{
+	List *new_pk = NIL;
+	ListCell	*l1;
+	ListCell	*l2;
+	foreach(l1, keys1)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(l1);
+		foreach(l2, keys2)
+		{
+			PathKey    *pathkey2 = (PathKey *) lfirst(l2);
+			if (pathkey1 == pathkey2)
+			{
+				new_pk = lappend(new_pk, pathkey1);
+				break;
+			}
+		}
+	}
+	return new_pk;
+
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 044fb24666..1802d28e75 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4844,11 +4844,28 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *sorted_path;
 			bool		is_sorted;
 			int			presorted_keys;
+			List		*common_keys;
 
 			is_sorted = pathkeys_count_contained_in(needed_pathkeys,
 													input_path->pathkeys,
 													&presorted_keys);
 
+			/*
+			 * Check if there are common pathkeys (regardless of ordering)
+			 */
+			common_keys = extract_common_pathkeys(input_path->pathkeys, needed_pathkeys);
+			
+			if (common_keys)
+			{
+				/*
+				 * Now that we have common keys, we can add these to path
+				 */
+				needed_pathkeys = list_concat_unique(common_keys, needed_pathkeys);
+				is_sorted = pathkeys_count_contained_in(needed_pathkeys,
+													input_path->pathkeys,
+													&presorted_keys);
+			}
+
 			if (is_sorted)
 				sorted_path = input_path;
 			else
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 65a3c35611..b1b700e067 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -248,6 +248,7 @@ extern List *truncate_useless_pathkeys(PlannerInfo *root,
 									   RelOptInfo *rel,
 									   List *pathkeys);
 extern bool has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel);
+extern List *extract_common_pathkeys(List* keys1, List *keys2);
 extern List *append_pathkeys(List *target, List *source);
 extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   EquivalenceClass *eclass, Oid opfamily,

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Ankit Kumar Pandey (#1)

Re: [PATCH] Teach planner to further optimize sort in distinct

On Wed, 18 Jan 2023 at 08:27, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:

There is bit confusion in wording here:

"returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1."

You mean extract common keys without ordering right?

I think you should write a function like:

bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2,
List **reorderedkeys, int *n_common)

which works very similarly to pathkeys_count_contained_in, but
populates *reorderedkeys so it contains all of the keys in keys1, but
put the matching ones in the same order as they are in keys2. If you
find a keys2 that does not exist in keys1 then just add the additional
unmatched keys1 keys to *reorderedkeys. Set *n_common to the number
of common keys excluding any that come after a key2 key that does not
exist as a key1 key.

You can just switch to using that function in
create_final_distinct_paths(). You'll need to consider if the query is
a DISTINCT ON query and not try the unordered version of the function
in that case.

I also just noticed that in build_index_paths() we'll leave the index
path's pathkeys empty if we deem the pathkeys as useless. I'm not
sure what the repercussions of setting those to the return value of
build_index_pathkeys() if useful_pathkeys is otherwise empty. It's
possible that truncate_useless_pathkeys() needs to be modified to
check if the pathkeys might be useful for DISTINCT, but now that I see
we don't populate the IndexPath's pathkeys when we deem them not
useful makes me wonder if this entire patch is a good idea. When I
thought about it I assumed that we always set IndexPath's pathkeys to
whatever (if any) sort order that the index provides.

David

Ankit Kumar Pandey

itsankitkp@gmail.com

almost 3 years ago

In reply to: David Rowley (#2)

1 attachment(s)

Re: [PATCH] Teach planner to further optimize sort in distinct

On 19/01/23 18:49, David Rowley wrote:

I think you should write a function like:

bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2,
List **reorderedkeys, int *n_common)

which works very similarly to pathkeys> _count_contained_in, but
populates *reorderedkeys so it contains all of the keys in keys1, but
put the matching ones in the same order as they are in keys2. If you
find a keys2 that does not exist in keys1 then just add the additional
unmatched keys1 keys to *reorderedkeys. Set *n_common to the number
of common keys excluding any that come after a key2 key that does not
exist as a key1 key.

You can just switch to using that function in
create_final_distinct_paths(). You'll need to consider if the query is
a DISTINCT ON query and not try the unordered version of the function
in that case.

Tried this, it worked as expected. Tests are green as well.

I also just noticed that in build_index_paths() we'll leave the index
path's pathkeys empty if we deem the pathkeys as useless. I'm not
sure what the repercussions of setting those to the return value of
build_index_pathkeys() if useful_pathkeys is otherwise empty.

This is very rigid indeed.

It's possible that truncate_useless_pathkeys() needs to be modified to
check if the pathkeys might be useful for DISTINCT

We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering.

Can we not have pathkeys_useful_for_distinct?

Also, pathkeys_useful_for_ordering calls pathkeys_count_contained_in.

We can add code path on similar lines.

When I
thought about it I assumed that we always set IndexPath's pathkeys to
whatever (if any) sort order that the index provides.

Can we not added original path keys in IndexPath? It could be useful

at other places as well. Atleast, I can see it useful in sorting cases.

makes me wonder if this entire patch is a good idea.

We are still getting some benefit even without index paths for now.

I have attached patch with pathkeys_count_contained_in_unordered

and corresponding changes in create_final_distinct_paths for reference.

Thanks,

Ankit

Attachments:

better-distinct-v2.patchtext/x-patch; charset=UTF-8; name=better-distinct-v2.patchDownload

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index d2e241c983..f0bc977eff 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -2014,3 +2014,52 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel)
 		return true;			/* might be able to use them for ordering */
 	return false;				/* definitely useless */
 }
+
+/*
+ * pathkeys_count_contained_in_unordered
+ *		reorders keys1 to match keys2
+ * Populates *reorderedkeys so it contains all of the keys in keys1, but
+ * put the matching ones in the same order as they are in keys2.  If keys
+ * in keys1 which do not match keys2 are appended in the end.
+ * n_common denotes number of matched keys.
+ */
+bool
+pathkeys_count_contained_in_unordered(List *keys1, List *keys2,
+									  List **reorderedkeys, int *n_common)
+{
+	ListCell*	lc1;
+	ListCell*	lc2;
+	int			n = 0;
+	List*		unmatched_keys = NIL;
+
+	foreach(lc1, keys1)
+	{
+		bool matched = false;
+		PathKey    *pathkey1 = (PathKey *) lfirst(lc1);
+		foreach(lc2, keys2)
+		{
+			PathKey    *pathkey2 = (PathKey *) lfirst(lc2);
+			if (pathkey1 == pathkey2)
+			{
+				*reorderedkeys = lappend(*reorderedkeys, pathkey2);
+				n++;
+				matched = true;
+				break;
+			}
+
+		}
+		/*
+		 * If no match is found, collect path key for appending it later
+		 */
+		if (!matched)
+			unmatched_keys = lappend(unmatched_keys, pathkey1);
+	}
+
+	Assert(n == list_length(*reorderedkeys));
+
+	*reorderedkeys = list_concat_unique(*reorderedkeys, unmatched_keys);
+	Assert(list_length(*reorderedkeys) == list_length(keys1));
+	*n_common = n;
+
+	return *n_common == list_length(keys1);
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 05f44faf6e..d9904b046a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4936,10 +4936,28 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *sorted_path;
 			bool		is_sorted;
 			int			presorted_keys;
+			List*		reordered_keys = NIL;
 
-			is_sorted = pathkeys_count_contained_in(needed_pathkeys,
+			/*
+			 * Attempt to optimize non distinct-on queries
+			 * if sorted keys are present but not in required order.
+			 * We can modify needed_path keys as per as input path key
+			 * ordering and reuse input sort.
+			 */
+			if (!parse->hasDistinctOn)
+			{
+				is_sorted = pathkeys_count_contained_in_unordered(needed_pathkeys,
 													input_path->pathkeys,
+													&reordered_keys,
 													&presorted_keys);
+				needed_pathkeys = reordered_keys;
+			}
+			else
+			{
+				is_sorted = pathkeys_count_contained_in(needed_pathkeys,
+													input_path->pathkeys,
+													&presorted_keys);
+			}
 
 			if (is_sorted)
 				sorted_path = input_path;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9b38627efd..532174aeb5 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -253,6 +253,10 @@ extern List *truncate_useless_pathkeys(PlannerInfo *root,
 									   RelOptInfo *rel,
 									   List *pathkeys);
 extern bool has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel);
+bool pathkeys_count_contained_in_unordered(List *keys1,
+										   List *keys2,
+									       List **reorderedkeys,
+										   int *n_common);
 extern List *append_pathkeys(List *target, List *source);
 extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   EquivalenceClass *eclass, Oid opfamily,

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Ankit Kumar Pandey (#3)

Re: [PATCH] Teach planner to further optimize sort in distinct

On Fri, 20 Jan 2023 at 08:26, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:

On 19/01/23 18:49, David Rowley wrote:
You can just switch to using that function in
create_final_distinct_paths(). You'll need to consider if the query is
a DISTINCT ON query and not try the unordered version of the function
in that case.

Tried this, it worked as expected. Tests are green as well.

Looking at the patch, you've not added any additional tests. If the
existing tests are all passing then that just tells me that either the
code is not functioning as intended or we have no tests that look at
the EXPLAIN output which can make use of this optimization. If the
former is true, then the patch needs to be fixed. If it's the latter
then you need to write new tests.

It's possible that truncate_useless_pathkeys() needs to be modified to
check if the pathkeys might be useful for DISTINCT

We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering.

Can we not have pathkeys_useful_for_distinct?

I don't know all the repercussions. If you look at add_path() then
you'll see we do a pathkey comparison when the costs are not fuzzily
different from an existing path so that we try to keep a path with the
best pathkeys. If we start keeping paths around with other weird
pathkeys that are not along the lines of the query_pathkeys requires,
then add_path might start throwing away paths that are actually good
for something. It seems probable that could cause some regressions.

I have attached patch with pathkeys_count_contained_in_unordered
and corresponding changes in create_final_distinct_paths for reference.

Does this patch actually work? I tried:

create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,1000)
a,generate_series(1,1000) b;
analyze ab;
create index on ab(a);
set enable_hashagg=0;
explain select distinct b,a from ab where a < 10;
QUERY PLAN
------------------------------------------------------------------------------------
Unique (cost=729.70..789.67 rows=7714 width=8)
-> Sort (cost=729.70..749.69 rows=7996 width=8)
Sort Key: b, a
-> Index Scan using ab_a_idx on ab (cost=0.42..211.36
rows=7996 width=8)
Index Cond: (a < 10)
(5 rows)

I'd have expected an incremental sort here. I don't see that you're
adjusting IndexPath's pathkeys anywhere. The nested loop in
pathkeys_count_contained_in_unordered() seems to be inside out. The
reordered_pathkeys must have the common pathkeys in the order they
appear in keys2. In your patch, they'll be ordered according to the
keys1 list, which is wrong. Testing would tell you this, so all the
more reason to make the patch work and write some queries to ensure it
does actually work, then some tests to ensure that remains in a
working state.

Feel free to take the proper time to write a working patch which
contains new tests to ensure it's functioning as intended. It's
disheartening to review patches that don't seem to work. If it wasn't
meant to work, then you didn't make that clear. I'll likely not be
able to do any further reviews until the March commitfest, so it might
be better to only post if you're stuck. Please don't rush out the
next patch. Take your time and study the code and see if you can build
up your own picture for what the repercussions might be of IndexPaths
having additional pathkeys when they were previously empty. If you're
uncertain of aspects of the patch you've written feel free to leave
XXX type comments to indicate this. That way the reviewer will know
you might need more guidance on that and you'll not forget yourself
when you come back and look again after a few weeks.

David

Ankit Kumar Pandey

itsankitkp@gmail.com

almost 3 years ago

In reply to: David Rowley (#4)

Re: [PATCH] Teach planner to further optimize sort in distinct

On 20/01/23 06:07, David Rowley wrote:

Looking at the patch, you've not added any additional tests. If the
existing tests are all passing then that just tells me that either the
code is not functioning as intended or we have no tests that look at
the EXPLAIN output which can make use of this optimization. If the
former is true, then the patch needs to be fixed. If it's the latter
then you need to write new tests.

I definitely need to add tests because this scenario is missing.

I don't know all the repercussions. If you look at add_path() then
you'll see we do a pathkey comparison when the costs are not fuzzily
different from an existing path so that we try to keep a path with the
best pathkeys. If we start keeping paths around with other weird
pathkeys that are not along the lines of the query_pathkeys requires,
then add_path might start throwing away paths that are actually good
for something. It seems probable that could cause some regressions.

Okay, in that case I think it is better idea to store original pathkeys
(apart from what get assigned by useful_pathkeys). I need to dig deeper for this.

Does this patch actually work? I tried:
I don't see that you're
adjusting IndexPath's pathkeys anywhere.

I had removed the changes for indexPath (it was in v1) because I hadn't investigated
repercussions. But I failed to mention this.

The nested loop in
pathkeys_count_contained_in_unordered() seems to be inside out. The
reordered_pathkeys must have the common pathkeys in the order they
appear in keys2. In your patch, they'll be ordered according to the
keys1 list, which is wrong. Testing would tell you this, so all the
more reason to make the patch work and write some queries to ensure it
does actually work, then some tests to ensure that remains in a
working state.
Feel free to take the proper time to write a working patch which
contains new tests to ensure it's functioning as intended. It's
disheartening to review patches that don't seem to work. If it wasn't
meant to work, then you didn't make that clear.
Please don't rush out the next patch. Take your time and study the code
and see if you can build up your own picture for what the repercussions
might be of IndexPaths having additional pathkeys when they were previously empty.
If you're uncertain of aspects of the patch you've written feel free to leave
XXX type comments to indicate this. That way the reviewer will know
you might need more guidance on that and you'll not forget yourself
when you come back and look again after a few weeks.

I deeply regret this. I will be mindful of my patches and ensure that they are
complete by themselves.
Thanks for your pointers as well, I can see errors in my approach which I will address.

I'll likely not be
able to do any further reviews until the March commitfest, so it might
be better to only post if you're stuck.

Yes sure, I will work on patches and limit posts to discussion only (if blocked).

Thanks,
Ankit