branch-free tuplesort partitioning

Started by John Naylorover 1 year ago4 messageshackers
Jump to latest
#1John Naylor
john.naylor@enterprisedb.com

Attached is a very rough and limited proof of concept of $subject when
tuplesort uses abbreviated keys. It only works with int64:

Demo:

--setup
drop table if exists test;
create table test (a bigint);
insert into test select (1_000_000_000 * random())::bigint from
generate_series(1,1_000_000,1) i;
vacuum freeze test;
create extension if not exists pg_prewarm ;
select pg_prewarm('test');

set work_mem = '64MB';
\timing

select * from test order by a offset 1_000_000_000;
Time: 238.925 ms

set debug_branchless_sort = 'on'; select * from test order by a offset
1_000_000_000;
Time: 202.977 ms

With the patch, this case is probably held back by our simple
insertion sort, which I haven't changed.

The algorithm with pythonic pseudocode is from [1]https://orlp.net/blog/branchless-lomuto-partitioning/. There is also Rust
code under a permissive license by the same authors elsewhere.

To evaluate this technique further, it'll need some work to handle
duplicates well. Then we can run tests on various input distributions.
If that's promising, we'll need to proceed with an idea I had years
ago to separate null/not-null into separate arrays as a prerequisite.
Then, we can probably simplify tuplesort.c some, including removing
each separate template invocation for integer-like keys. I'm also not
using the sort template in the PoC, to simplify development, and it's
not clear this work can be folded back into there without adding a lot
of complexity.

[1]: https://orlp.net/blog/branchless-lomuto-partitioning/

--
John Naylor
Amazon Web Services

Attachments:

v1-0001-Branchless-Lomuto-partitioning.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Branchless-Lomuto-partitioning.patchDownload+184-4
In reply to: John Naylor (#1)
Re: branch-free tuplesort partitioning

On Mon, Nov 25, 2024 at 7:14 AM John Naylor <johncnaylorls@gmail.com> wrote:

To evaluate this technique further, it'll need some work to handle
duplicates well.

I suggest using a test program for this that Tom wrote nearly 20 years
ago to validate changes that were made to the Bentley & McIlroy qsort,
available from here:

/messages/by-id/18732.1142967137@sss.pgh.pa.us

It generates most of the standardized inputs described by the B&M
paper. For example, it will generate "Sawtooth" inputs. (Though I
don't see "organ pipe" input -- that one was a more adversarial case,
described by their paper, which might also be interesting.)

--
Peter Geoghegan

#3John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#2)
Re: branch-free tuplesort partitioning

On Mon, Nov 25, 2024 at 10:20 PM Peter Geoghegan <pg@bowt.ie> wrote:

I suggest using a test program for this that Tom wrote nearly 20 years
ago to validate changes that were made to the Bentley & McIlroy qsort,
available from here:

/messages/by-id/18732.1142967137@sss.pgh.pa.us

It generates most of the standardized inputs described by the B&M
paper. For example, it will generate "Sawtooth" inputs. (Though I
don't see "organ pipe" input -- that one was a more adversarial case,
described by their paper, which might also be interesting.)

Thanks for that reference, that will be useful!

--
John Naylor
Amazon Web Services

#4John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#1)
Re: branch-free tuplesort partitioning

I benchmarked this a couple weeks ago, and just now got around to
putting the results in a presentable form.

I took some of the more interesting cases from the B&M test Peter
shared upthread (best and worst performers in a quick test) and ran
them with a range of "m" values.

This is exploratory work, so the attached script doesn't quite
correspond to what's in the spreadsheet, but it gives some idea of the
details. I used 1 million records for everything, with prewarm and
turbo off.

-Random with modulus
This is the best case, beating our current sort on all but the lowest
cardinality inputs (which are only slightly slower).

-Sawtooth
This is the worst case, worse for all saw sizes. The case with 30 saws
is particularly bad. I tried adding a recursion stopping heap sort,
but that made things worse. I had hoisted the presorted check out of
the recursion steps because it used the full comparator (for
simplicity), and adding back a specialized check for every recursion
helped somewhat, but not enough to matter.

-Stagger [value = (i*multiplier + i) % num_records]
No clear winner here.

-Miscellaneous cases
Random and ascending are repeated in the spreadsheet for reference.
-Zipfian is from the numpy random number generator using 1.01 as the
parameter. Patch looks good here.
-"s95" is 95% ascending with 5% random tail. This regresses a bit.
-"p5" is a best case for our current sort: 95% zeros, 2.5% random
negative values, and 2.5% random positive values. The first pivot is
pretty much guaranteed to be zero, so in master all zeros are moved to
the middle partition upon first recursion. The patch takes two passes
to bucket all zeros together (because branchless comparisons with
single machine instructions must be 2-way and not 3-way), but in the
grand scheme of things it amounts to only a small regression.
-Organ regresses substantially.
-Descending regresses but not as much as I expected since descending
input is supposed to be a bad case for Lomuto partitioning schemes.

These are mixed results, so I'm not inclined to pursue this further
this cycle. There is one thing which could tip the scales in its favor
in the future: The planner doesn't yet know about the specialized
comparators using single machine instructions for the first key. The
patch has a fairly tight range for (almost) unique inputs. Whether
it's faster or slower than master, these inputs very often measure
about the same. This is not surprising because there are no branches
to mispredict during partitioning. Having predictable worst cases
would good for the planner.

One thing I didn't test is multiple keys (or resolving ties of an
abreviated first key). When the patch buckets duplicates together, it
must (if necessary) recurse to that partition to resolve the
tiebreaks. That was simple to code, but untested for either
correctness or performance. It's a bit like the multi-key sort
proposal in this regard, but with only two distinct steps:

/messages/by-id/PH7P220MB1533DA211DF219996760CBB7D9EB2@PH7P220MB1533.NAMP220.PROD.OUTLOOK.COM

--
John Naylor
Amazon Web Services

Attachments:

branchless-lomuto-20250210.odsapplication/vnd.oasis.opendocument.spreadsheet; name=branchless-lomuto-20250210.odsDownload
BL-perf-test-examples.shapplication/x-shellscript; name=BL-perf-test-examples.shDownload
v2-0001-Branchless-Lomuto-partitioning.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Branchless-Lomuto-partitioning.patchDownload+274-4