Sort optimizations: Making in-memory sort cache-aware

Started by Ankit Kumar Pandeyalmost 3 years ago4 messages
#1Ankit Kumar Pandey
itsankitkp@gmail.com

Hi all,

While working on sort optimization for window function, it was seen that
performance of sort where

all tuples are in memory was bad when number of tuples were very large [1]/messages/by-id/CAApHDvqh+qOHk4sbvvy=Qr2NjPqAAVYf82oXY0g=Z2hRpC2Vmg@mail.gmail.com

Eg: work_mem = 4 GB, sort on 4 int columns on table having 10 million
tuples.

Issues we saw were as follows:

1. The comparetup function re-compares the first key again in case of
tie-break.

2. Frequent cache misses

Issue #1 is being looked in separate patch. I am currently looking at #2.

Possible solution was to batch tuples into groups (which can fit into L3
cache) before pushing them to sort function.

After looking at different papers on this (multi-Quicksort, memory-tuned
quicksort, Samplesort and various distributed sorts),

although they look promising (especially samplesort), I would like to
get more inputs as changes look bit too steep and

may or may not be in of scope of solving actual problem in hand.

Please let me know your opinions, do we really need to re-look at
quicksort for this use-case or we can

perform optimization without major change in core sorting algorithm? Are
we are open for trying new algorithms for sort?

Any suggestions to narrow down search space for this problem are welcomed.

[1]: /messages/by-id/CAApHDvqh+qOHk4sbvvy=Qr2NjPqAAVYf82oXY0g=Z2hRpC2Vmg@mail.gmail.com
/messages/by-id/CAApHDvqh+qOHk4sbvvy=Qr2NjPqAAVYf82oXY0g=Z2hRpC2Vmg@mail.gmail.com

Thanks,

Ankit

#2Andres Freund
andres@anarazel.de
In reply to: Ankit Kumar Pandey (#1)
Re: Sort optimizations: Making in-memory sort cache-aware

Hi,

On 2023-02-11 17:49:02 +0530, Ankit Kumar Pandey wrote:

2. Frequent cache misses

Issue #1 is being looked in separate patch. I am currently looking at #2.

Possible solution was to batch tuples into groups (which can fit into L3
cache) before pushing them to sort function.

After looking at different papers on this (multi-Quicksort, memory-tuned
quicksort, Samplesort and various distributed sorts), although they look
promising (especially samplesort), I would like to get more inputs as
changes look bit too steep and may or may not be in of scope of solving
actual problem in hand.

Please let me know your opinions, do we really need to re-look at quicksort
for this use-case or we can perform optimization without major change in
core sorting algorithm? Are we are open for trying new algorithms for sort?

I think it'll require some experimentation to know what we can and should
do. Clearly we're not going to do anything fundamental if the gains are a few
percent. But it's not hard to imagine that the gains will be substantially
larger.

I believe that a significant part of the reason we have low cache hit ratios
once the input gets larger, is that we kind of break the fundamental benefit
of qsort:

The reason quicksort is a good sorting algorithm, despite plenty downsides, is
that it has pretty decent locality, due to its divide and conquer
approach. However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

The only reason that doesn't completely kill us is that SortTuple contains
datum1 inline and that abbreviated keys reduce the cost of by-reference datums
in the first column.

There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.

One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples. E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.

I have *not* looked at a whole lot of papers of cache optimized sorts, and the
little I did was not recently. Partially because I am not sure that they are
that applicable to our scenarios: Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I think:

Possible solution was to batch tuples into groups (which can fit into L3
cache) before pushing them to sort function.

is the most general solution to the issue outlined above. I wouldn't try to
implement this via a full new sorting algorithm though.

My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

Even if we just use the existing code for the overall sort after that, I'd
expect that to yield noticable benefits.

It's very likely we can do better than just doing a plain sort of everything
after that.

You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a
heapsort between the presorted blocks.

A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.

We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts. I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.

While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.

Greetings,

Andres Freund

#3Ankit Kumar Pandey
itsankitkp@gmail.com
In reply to: Andres Freund (#2)
Re: Sort optimizations: Making in-memory sort cache-aware

On 12/02/23 01:59, Andres Freund wrote:

However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

This probably explains the issue.

One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples. E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.

I get the second point (to reorder MinimalTuples by sorting on first column) but
not sure how can keep `track of the distinctness of the first column`?

Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I definitely agree with this.

My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

This should be easy hack and we can easily profile benefits from this.

It's very likely we can do better than just doing a plain sort of everything
after that.
You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a
heapsort between the presorted blocks.

This is very interesting. It is actually what few papers had suggested, to
do sort in blocks and then merge (while sorting) presorted blocks.
I am bit fuzzy on implementation of this (if it is same as what I am thinking)
but this is definitely what I was looking for.

A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.
We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts. I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

This Looks doable, try this.

I think we also can replace the individual freeing of tuples in
tuplesort_heap_replace_top(), by allowing a number of dead tuples to
accumulate (up to work_mem/2 maybe), and then copying the still living tuples
into new memory context, freeing the old one.

While that doesn't sound cheap, in bounded sorts the number of tuples commonly
is quite limited, the pre-check before copying the tuple will prevent this
from occurring too often, the copy will result in higher locality, and, most
importantly, the reduced palloc() overhead (~25% or so with aset.c) will
result in a considerably higher cache hit ratio / lower memory usage.

I would try this as well.

There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.

Thanks a lot Andres, these are lots of pointers to work on (without major overhaul of sort
implementation and with potentially good amount of improvements). I will give these a try
and see if I could get some performance gains.

Thanks,
Ankit

#4Ankit Kumar Pandey
itsankitkp@gmail.com
In reply to: Ankit Kumar Pandey (#3)
7 attachment(s)
Re: Sort optimizations: Making in-memory sort cache-aware

Hi Andres,

I took a stab at naive version of this but been stuck for sometime now.

I have added logic to sort on first column at first pass,

realloc all tuples and do full sort at second pass, but I am not seeing

any benefit (it is actually regressing) at all.

Tried doing above both at bulk and at chunks of data.

You effectively end up with a bounded number of pre-sorted blocks, so

the most

obvious thing to try is to build a heap of those blocks and

effectively do a

heapsort between the presorted blocks.

I am not very clear about implementation for this. How can we do
heapsort between

 the presorted blocks? Do we keep changing state->bound=i, i+n, i+2n
something like

this and keep calling make_bounded_heap/sort_bounded_heap?

A related, but separate, improvement is to reduce / remove the memory
allocation overhead.

This is still pending from my side.

I have attached some benchmarking results with script and POC

patches (which includes GUC switch to enable optimization for ease of
testing) for the same.

Tested on WORK_MEM=3 GB for 1 and 10 Million rows data.

Please let me know things which I can fix and re-attempt.

Thanks,

Ankit

Attachments:

master(1mil) and patched(1mil) .pngimage/png; name="master(1mil) and patched(1mil) .png"Download
master(10mil) and patched(10mil).pngimage/png; name="master(10mil) and patched(10mil).png"Download
master(1mil) and patched(1mil) for BATCHES.pngimage/png; name="master(1mil) and patched(1mil) for BATCHES.png"Download
master(10mil) and patched(10mil) for BATCHES.pngimage/png; name="master(10mil) and patched(10mil) for BATCHES.png"Download
bench_sort_optimization.sh.txttext/plain; charset=UTF-8; name=bench_sort_optimization.sh.txtDownload
inmem-sort-opt-batch.patchtext/x-patch; charset=UTF-8; name=inmem-sort-opt-batch.patchDownload
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..6585164ca4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -846,6 +846,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_sort_optimization", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the sort optimizations."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_sort_optimization,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_incremental_sort", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of incremental sort steps."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..1b49f48457 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -389,6 +389,7 @@
 #enable_presorted_aggregate = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_sort_optimization = on
 #enable_tidscan = on
 
 # - Planner Cost Constants -
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9ca9835aab..cbaf54437c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,8 @@
 bool		trace_sort = false;
 #endif
 
+bool		enable_sort_optimization = true;
+
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
 #endif
@@ -404,6 +406,7 @@ struct Sharedsort
 #define SERIAL(state)		((state)->shared == NULL)
 #define WORKER(state)		((state)->shared && (state)->worker != -1)
 #define LEADER(state)		((state)->shared && (state)->worker == -1)
+#define CACHESIZE 2*1024*1024L
 
 /*
  * NOTES about on-tape representation of tuples:
@@ -2725,6 +2728,8 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 		{
 			if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
 			{
+				SortSupport oldonlyKey = state->base.onlyKey;
+				state->base.onlyKey = state->base.sortKeys;
 				qsort_tuple_unsigned(state->memtuples,
 									 state->memtupcount,
 									 state);
@@ -2740,11 +2745,63 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 			}
 #endif
 			else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
-			{
-				qsort_tuple_int32(state->memtuples,
+			{	
+				if (enable_sort_optimization)
+				{
+					/* backup onlyKey and set it as NON NULL so that tie breaker won't be called
+					 * aim is to sort on first col only now 
+					 */
+					SortSupport oldonlyKey = state->base.onlyKey;
+					state->base.onlyKey = state->base.sortKeys;
+					qsort_tuple_int32(state->memtuples,
+									state->memtupcount,
+									state);
+					int BATCHSIZE = ceil(CACHESIZE/sizeof(state->memtuples[0].tuple));
+					if (state->memtupcount<BATCHSIZE)
+					{
+						BATCHSIZE = state->memtupcount;
+					}
+					int i=0;
+					int j=0;
+					while(i<state->memtupcount)
+					{
+
+						if ( (i + BATCHSIZE) > state->memtupcount)
+						{
+							BATCHSIZE = state->memtupcount - i;
+						}
+						SortTuple  *memtuples = &state->memtuples[i];
+						qsort_tuple_int32(memtuples,
+									BATCHSIZE,
+									state);
+						SortTuple  *tempmemtuples = (SortTuple *) palloc(BATCHSIZE * sizeof(SortTuple));
+						j=i;
+						for (; j < i+BATCHSIZE; j++)
+						{
+							void* temp = state->memtuples[j].tuple;
+							state->memtuples[j].tuple = heap_copy_minimal_tuple((MinimalTuple) state->memtuples[j].tuple);
+							pfree(temp);
+						}
+
+
+							i += BATCHSIZE;
+					}
+					/* revert onlyKey for pass 2 */
+					state->base.onlyKey = oldonlyKey;
+
+					qsort_tuple_int32(state->memtuples,
+									state->memtupcount,
+									state);
+					return;
+				}
+				else
+				{
+					qsort_tuple_int32(state->memtuples,
 								  state->memtupcount,
 								  state);
 				return;
+
+				}
 			}
 		}
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ba89d013e6..2ae1c90420 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -281,6 +281,7 @@ extern PGDLLIMPORT int tcp_keepalives_idle;
 extern PGDLLIMPORT int tcp_keepalives_interval;
 extern PGDLLIMPORT int tcp_keepalives_count;
 extern PGDLLIMPORT int tcp_user_timeout;
+extern PGDLLIMPORT bool enable_sort_optimization;
 
 #ifdef TRACE_SORT
 extern PGDLLIMPORT bool trace_sort;
inmem-sort-opt.patchtext/x-patch; charset=UTF-8; name=inmem-sort-opt.patchDownload
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..6585164ca4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -846,6 +846,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_sort_optimization", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the sort optimizations."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_sort_optimization,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_incremental_sort", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of incremental sort steps."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..1b49f48457 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -389,6 +389,7 @@
 #enable_presorted_aggregate = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_sort_optimization = on
 #enable_tidscan = on
 
 # - Planner Cost Constants -
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9ca9835aab..33b8f804dd 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -127,6 +127,8 @@
 bool		trace_sort = false;
 #endif
 
+bool		enable_sort_optimization = true;
+
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
 #endif
@@ -404,6 +406,7 @@ struct Sharedsort
 #define SERIAL(state)		((state)->shared == NULL)
 #define WORKER(state)		((state)->shared && (state)->worker != -1)
 #define LEADER(state)		((state)->shared && (state)->worker == -1)
+#define CACHESIZE 2*1024*1024L
 
 /*
  * NOTES about on-tape representation of tuples:
@@ -2725,6 +2728,8 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 		{
 			if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
 			{
+				SortSupport oldonlyKey = state->base.onlyKey;
+				state->base.onlyKey = state->base.sortKeys;
 				qsort_tuple_unsigned(state->memtuples,
 									 state->memtupcount,
 									 state);
@@ -2740,11 +2745,51 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 			}
 #endif
 			else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
-			{
-				qsort_tuple_int32(state->memtuples,
+			{	
+				if (enable_sort_optimization)
+				{
+					/* backup onlyKey and set it as NON NULL so that tie breaker won't be called
+					 * aim is to sort on first col only at first pass
+					 */
+					SortSupport oldonlyKey = state->base.onlyKey;
+					state->base.onlyKey = state->base.sortKeys;
+					qsort_tuple_int32(state->memtuples,
+									state->memtupcount,
+									state);
+
+					/* revert onlyKey for pass 2 */
+					state->base.onlyKey = oldonlyKey;
+					/* realloc tuples */
+					SortTuple  *tempmemtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
+					int i;
+					for (i = 0; i < state->memtupcount; i++)
+					{	
+						tempmemtuples[i].datum1 = state->memtuples[i].datum1;
+						/*
+						 * could have just copied realloc temp tuple to memtuples.tuple
+						 * instead of allocating full memtuples but then have to call
+						 * pfree for each tuple, in current approach whole block can be freed in one go
+						 */					
+						tempmemtuples[i].tuple = heap_copy_minimal_tuple((MinimalTuple) state->memtuples[i].tuple);
+						tempmemtuples[i].isnull1 = state->memtuples[i].isnull1;
+						tempmemtuples[i].srctape = state->memtuples[i].srctape;
+					}
+					pfree(state->memtuples);
+					state->memtuples = tempmemtuples;
+
+					qsort_tuple_int32(state->memtuples,
+									state->memtupcount,
+									state);
+					return;
+				}
+				else
+				{
+					qsort_tuple_int32(state->memtuples,
 								  state->memtupcount,
 								  state);
 				return;
+
+				}
 			}
 		}
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ba89d013e6..2ae1c90420 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -281,6 +281,7 @@ extern PGDLLIMPORT int tcp_keepalives_idle;
 extern PGDLLIMPORT int tcp_keepalives_interval;
 extern PGDLLIMPORT int tcp_keepalives_count;
 extern PGDLLIMPORT int tcp_user_timeout;
+extern PGDLLIMPORT bool enable_sort_optimization;
 
 #ifdef TRACE_SORT
 extern PGDLLIMPORT bool trace_sort;