tuple radix sort

Started by John Naylor6 months ago46 messageshackers
Jump to latest
#1John Naylor
john.naylor@enterprisedb.com

First, a quick demonstration of what this PoC can do on 1 million
random not-NULL bigints:

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
240ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
140ms

Background: Peter Geoghegan recently mentioned to me off-list an
interesting set of techniques for sorting in the context of databases.
I'm not yet sure how to approach certain aspects of that architecture,
so I won't go into the full picture at this point. However, there is
one piece that already fits well within our existing architecture, and
that is using radix sort on datum1. The basic sequence is:

1. Partition tuples on first key NULL and not-NULL, according to NULLS
FIRST or NULLS LAST.
2. Do normal qsort on the NULL partition using the tiebreak comparator.
3. Create a "conditioned" or "normalized" datum that encodes datum1
such that unsigned comparison is order-preserving, accounting for ASC
/ DESC as well. I've reused space now unused during in-memory not-NULL
sorts:

typedef struct
{
void *tuple; /* the tuple itself */
Datum datum1; /* value of first key column */

union
{
struct
{
bool isnull1; /* is first key column NULL? */
int srctape; /* source tape number */
};
Datum cond_datum1; /* sort key for radix sort */
};
} SortTuple;

4. Radix sort on cond_datum1. For the PoC I've based it on the
implementation in "ska sort" [1]https://github.com/skarupke/ska_sort/tree/master (C++, Boost license). For
medium-sized sorts it uses "American flag sort" (there is a paper [3]http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf
co-authored by M. D. McIlroy, same as in the paper we reference for
quicksort). For larger sorts it's similar, but performs multiple
passes, which takes better advantage of modern CPUs. Upon recursion,
sorts on small partitions divert to quicksort. Any necessary tiebreaks
are handled by quicksort, either after the end of radix sort, or when
diverting to small quicksort.
5. Reset isnull1 to "false" before returning to the caller. This also
must be done when diverting to quicksort.

Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:

drop table if exists test;
create unlogged table test (a bigint);
insert into test select
(1_000_000_000 * random())::bigint % 8 -- mod
-- (1_000_000_000 * random())::bigint -- random, for the case at the top
from generate_series(1,1_000_000,1) i;
vacuum freeze test;

select pg_prewarm('test');
set work_mem = '64MB';

set wip_radix_sort = 'off'; select * from test order by a offset 1_000_000_000;
95ms

set wip_radix_sort = 'on'; select * from test order by a offset 1_000_000_000;
84ms

[1]: https://github.com/skarupke/ska_sort/tree/master
[2]: https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/
[3]: http://static.usenix.org/publications/compsystems/1993/win_mcilroy.pdf

--
John Naylor
Amazon Web Services

Attachments:

v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchapplication/x-patch; name=v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchDownload+641-21
#2Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#1)
Re: tuple radix sort

On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:

I suspect the challenge
will be multikey sorts when the first key has low cardinality.

As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves that:

```
evantest=# drop table if exists test_multi;
evantest=# create unlogged table test_multi (category int, name text);
— first column has only 4 distinct values
evantest=# insert into test_multi select (random() * 4)::int as category, md5(random()::text) || md5(random()::text) as name from generate_series(1, 1000000);
evantest=# vacuum freeze test_multi;
evantest=# select count(*) from test_multi;
evantest=# set work_mem = '64MB’;

evantest-# \timing on
Timing is on.
evantest=# set wip_radix_sort = 'off';
Time: 0.403 ms
evantest=# \o /dev/null
evantest=# select * from test_multi order by category, name;
Time: 5607.336 ms (00:05.607)
evantest=# select * from test_multi order by category, name;
Time: 5703.555 ms (00:05.704)
evantest=# select * from test_multi order by category, name;
Time: 5692.644 ms (00:05.693)

evantest=# set wip_radix_sort = 'on';
Time: 0.859 ms
evantest=# select * from test_multi order by category, name;
Time: 5822.979 ms (00:05.823)
evantest=# select * from test_multi order by category, name;
Time: 5881.256 ms (00:05.881)
evantest=# select * from test_multi order by category, name;
Time: 5976.351 ms (00:05.976)
```

Roughly 5% slower for this corner case.

However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

```
evantest=# \o
evantest=# drop table if exists test_multi;
DROP TABLE
evantest=# create unlogged table test_multi (category int, name text);
CREATE TABLE
evantest=# insert into test_multi
evantest-# select (random() * 1000000)::int as category, md5(random()::text) || md5(random()::text) as name from generate_series(1, 1000000);
INSERT 0 1000000
evantest=# vacuum freeze test_multi;
VACUUM
evantest=# select count(*) from test_multi;
count
---------
1000000
(1 row)

evantest=# select * from test_multi limit 5;
category | name
----------+------------------------------------------------------------------
607050 | c555126a5afea9f5ffe3880248c89944d211bc378f8c3b6d125b4360fe8619b7
843579 | 69b5a1dba76f52ff238566a3f88315a7425116d5d271fb54701b6e49d4afd8ce
106298 | a96e8674db219e12463ecdbb405b99c631767972e489093045c97976c17c6561
621860 | 5e6739ea9f533f9cdb0b8db76e3d4ce63be6b2b612c8aff06c4b80451f8f2edc
290110 | 56944320e5abd3a854fffdd185b969727e8d414448d440725a392cda4c6355c4
(5 rows)

evantest=# \timing on
Timing is on.

evantest=# \o /dev/null
evantest=# set wip_radix_sort = 'off';
Time: 0.904 ms
evantest=# select * from test_multi limit 5;
Time: 0.983 ms
evantest=# select * from test_multi order by category, name;
Time: 593.578 ms
evantest=# select * from test_multi order by category, name;
Time: 597.329 ms
evantest=# select * from test_multi order by category, name;
Time: 600.050 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.737 ms
evantest=# select * from test_multi order by category, name;
Time: 611.604 ms
evantest=# select * from test_multi order by category, name;
Time: 613.115 ms
evantest=# select * from test_multi order by category, name;
Time: 615.003 ms
```

This seems like a real regression.

Then I tried to only sort on the first column, yes, now radix is faster:

```
evantest=# set wip_radix_sort = 'off’;
evantest=# select * from test_multi order by category;
Time: 445.498 ms
evantest=# select * from test_multi order by category;
Time: 451.834 ms
evantest=# select * from test_multi order by category;
Time: 454.531 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.329 ms
evantest=# select * from test_multi order by category;
Time: 402.829 ms
evantest=# select * from test_multi order by category;
Time: 408.014 ms
evantest=# select * from test_multi order by category;
Time: 415.340 ms
evantest=# select * from test_multi order by category;
Time: 413.969 ms
```

Hope the test helps. (The test was run a MacBook M4. )

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#3John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#2)
Re: tuple radix sort

On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:

I suspect the challenge
will be multikey sorts when the first key has low cardinality.

As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves that:

```
evantest=# drop table if exists test_multi;
evantest=# create unlogged table test_multi (category int, name text);
— first column has only 4 distinct values

Thanks for testing. Note it's actually 5 because of rounding. Your
text also seems to have em-dashes and unicode apostrophes where it
should have dashes / single quotes. That's not great if you expect
others to try to reproduce. I'm also not thrilled about having to
remove your psql prompt.

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 4)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

Anyway, because this table is larger than my first example, the input
no longer fits into 64MB of work_mem and it switches to an external
merge sort. Normally I set work_mem to 1GB for testing sorts so I
don't have to think about it, but neglected to in my first email. I
don't know if that explains the disparity, but I've made that change
for my quick tests below.

evantest=# \o /dev/null
evantest=# select * from test_multi order by category, name;

[...]

Roughly 5% slower for this corner case.

Seems fine for me on this old Intel laptop (min of 5 runs):

set wip_radix_sort = 'off';
2368.369

set wip_radix_sort = 'on';
2290.654

It's close enough that I'll want to test more closely at a range of
low-cardinality inputs. I haven't done any rigorous scripted testing
yet, so take this with a grain of salt.

However, when I recreate the test table with high cardinality first column, wip_radix_sort seems still slower:

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 1000000)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

evantest=# set wip_radix_sort = 'off';
Time: 0.904 ms

evantest=# select * from test_multi order by category, name;
Time: 593.578 ms
evantest=# select * from test_multi order by category, name;
Time: 597.329 ms
evantest=# select * from test_multi order by category, name;
Time: 600.050 ms

evantest=# set wip_radix_sort = 'on';
Time: 0.737 ms
evantest=# select * from test_multi order by category, name;
Time: 611.604 ms
evantest=# select * from test_multi order by category, name;
Time: 613.115 ms
evantest=# select * from test_multi order by category, name;
Time: 615.003 ms
```

This seems like a real regression.

It's better for me here (min of 5 again), although the time scanning
the table probably dominates:

off:
536.257

on:
471.345

--
John Naylor
Amazon Web Services

#4Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#3)
Re: tuple radix sort

On Oct 29, 2025, at 19:29, John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Oct 29, 2025 at 3:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:

I suspect the challenge
will be multikey sorts when the first key has low cardinality.

As you predicted, when the first key has very low cardinality, radix is a little bit slower. I built a test that proves that:

```
evantest=# drop table if exists test_multi;
evantest=# create unlogged table test_multi (category int, name text);
— first column has only 4 distinct values

Thanks for testing. Note it's actually 5 because of rounding.

Yes, 0-4, totally 5.

Your
text also seems to have em-dashes and unicode apostrophes where it
should have dashes / single quotes. That's not great if you expect
others to try to reproduce.

I just copied the content from psql (running in iTerm). I did a Google search, and found that was because of Mac Mail’s “smart quotes” substitution. Looks like even I manually type in a pair of single quotes, it still does the substitution. I will try to see how to disable that, but I don’t want to switch to another mail app.

I'm also not thrilled about having to
remove your psql prompt.

I just wanted to show my entire test process, so I simply copied all contents from psql. In future, I will remove psql prompts from reproduce procedure.

drop table if exists test_multi;
create unlogged table test_multi (category int, name text);
insert into test_multi select (random() * 4)::int as category,
md5(random()::text) || md5(random()::text) as name from
generate_series(1, 1000000);
vacuum freeze test_multi;

Anyway, because this table is larger than my first example, the input
no longer fits into 64MB of work_mem and it switches to an external
merge sort. Normally I set work_mem to 1GB for testing sorts so I
don't have to think about it, but neglected to in my first email.

I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data:

```
evantest=# set work_mem = '1GB';
Time: 0.301 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 575.247 ms
evantest=# select * from test_multi order by category, name;
Time: 554.351 ms
evantest=# select * from test_multi order by category, name;
Time: 565.100 ms
evantest=#
evantest=# set wip_radix_sort = 'on';
Time: 0.752 ms
evantest=# select * from test_multi order by category, name;
Time: 558.057 ms
evantest=# select * from test_multi order by category, name;
Time: 565.542 ms
evantest=# select * from test_multi order by category, name;
Time: 559.973 ms
```

With radix_sort on and off, execution time are almost the same.

Then I restore the data to low cardinality, off is still faster than on:
```
evantest=# set wip_radix_sort = ‘off';
Time: 0.549 ms
evantest=# select * from test_multi order by category, name;
Time: 5509.075 ms (00:05.509)
evantest=# select * from test_multi order by category, name;
Time: 5553.566 ms (00:05.554)
evantest=# select * from test_multi order by category, name;
Time: 5598.595 ms (00:05.599)
evantest=# set wip_radix_sort = ‘on';
Time: 0.786 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5770.964 ms (00:05.771)
evantest=# select * from test_multi order by category, name;
Time: 5779.755 ms (00:05.780)
evantest=# select * from test_multi order by category, name;
Time: 5851.134 ms (00:05.851)
evantest=#
evantest=# set work_mem = '2GB’; # increasing work_mem to 2GB doesn’t help
Time: 0.404 ms
evantest=#
evantest=# select * from test_multi order by category, name;
Time: 5781.005 ms (00:05.781)
evantest=# select * from test_multi order by category, name;
Time: 5826.025 ms (00:05.826)
evantest=# select * from test_multi order by category, name;
Time: 5937.919 ms (00:05.938)
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#5Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#1)
Re: tuple radix sort

On Oct 29, 2025, at 14:28, John Naylor <johncnaylorls@gmail.com> wrote:

<v1-0001-Use-radix-sort-when-datum1-is-an-integer-type.patch>

I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo that might effect the tests:

```
+ Assert(last = first);
```

“=“ should be “=="

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#6John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#5)
Re: tuple radix sort

On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:

I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data:

With radix_sort on and off, execution time are almost the same.

Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Speaking of tests, I forgot to mention that regression tests will fail
since in-place radix sort is an unstable sort, as qsort is as well,
but in a different way -- this is expected. In assert builds, the
patch verifies the sort after the fact with the standard comparator,
and will throw an error if it's wrong.

On Thu, Oct 30, 2025 at 9:19 AM Chao Li <li.evan.chao@gmail.com> wrote:

I just quick went through the code change. I guess I need more time to understand the entire logic, but I find a typo that might effect the tests:

```
+ Assert(last = first);
```

“=“ should be “=="

Yes, you're quite right, thanks for spotting! I reran my stress test
that has randomly distributed NULLs and the assert still holds, so
nothing further to fix yet. The NULL partitioning part of the code
hasn't been well tested in its current form, and I may arrange things
so that that step and the datum conditioning step happen in a single
pass. I'm not yet sure if it's important enough to justify the
additional complexity.

--
John Naylor
Amazon Web Services

#7Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#6)
Re: tuple radix sort

On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Oct 30, 2025 at 8:56 AM Chao Li <li.evan.chao@gmail.com> wrote:

I changed work_men to 1GB and reran the test. As the high cardinality data are still there, so I first reran with data:

With radix_sort on and off, execution time are almost the same.

Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#8David Rowley
dgrowleyml@gmail.com
In reply to: Chao Li (#7)
Re: tuple radix sort

On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:

On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Never expect anything meaningful to come from running performance
tests on Assert builds. You should always be rebuilding without
Asserts before doing performance tests.

David

#9Chao Li
li.evan.chao@gmail.com
In reply to: David Rowley (#8)
Re: tuple radix sort

On Oct 30, 2025, at 13:01, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 30 Oct 2025 at 16:46, Chao Li <li.evan.chao@gmail.com> wrote:

On Oct 30, 2025, at 11:40, John Naylor <johncnaylorls@gmail.com> wrote:
Are you by chance running with asserts on? It's happened before, so I
have to make sure. That makes a big difference here because I disabled
diversion thresholds in assert builds so that regression tests (few
cases with large inputs) cover the paths I want, in addition to my
running a standalone stress test.

Yes, assert is always enabled in my sandbox. I can disable assert and rerun the test later.

Never expect anything meaningful to come from running performance
tests on Assert builds. You should always be rebuilding without
Asserts before doing performance tests.

Sure, good to learn. Actually I am very new to PG development, so any guidance is greatly appreciated.

I just made a distclean, then configure without any parameter. Now, the overall execution time reduced ~10% than with asserts. With the low cardinality data, off and on are very close:

```
evantest=# set wip_radix_sort = 'off';
Time: 0.206 ms
evantest=# select * from test_multi order by category, name;
Time: 5070.277 ms (00:05.070)
evantest=# select * from test_multi order by category, name;
Time: 5158.748 ms (00:05.159)
evantest=# select * from test_multi order by category, name;
Time: 5072.708 ms (00:05.073)

evantest=# set wip_radix_sort = 'on';
Time: 0.177 ms
evantest=# select * from test_multi order by category, name;
Time: 4992.516 ms (00:04.993)
evantest=# select * from test_multi order by category, name;
Time: 5145.361 ms (00:05.145)
evantest=# select * from test_multi order by category, name;
Time: 5101.800 ms (00:05.102)

evantest=# \o
evantest=# show work_mem;
work_mem
----------
1GB
(1 row)

Time: 0.186 ms
evantest=# explain select * from test_multi order by category, name;
QUERY PLAN
---------------------------------------------------------------------------
Sort (cost=122003.84..124503.84 rows=1000000 width=69)
Sort Key: category, name
-> Seq Scan on test_multi (cost=0.00..22346.00 rows=1000000 width=69)
(3 rows)
```

And with high cardinality test data, on has a big win:
```
evantest=# set wip_radix_sort = 'off';
Time: 0.174 ms
evantest=# select * from test_multi order by category, name;
Time: 353.702 ms
evantest=# select * from test_multi order by category, name;
Time: 375.549 ms
evantest=# select * from test_multi order by category, name;
Time: 367.967 ms
evantest=# set wip_radix_sort = 'on';
Time: 0.147 ms
evantest=# select * from test_multi order by category, name;
Time: 279.537 ms
evantest=# select * from test_multi order by category, name;
Time: 278.114 ms
evantest=# select * from test_multi order by category, name;
Time: 284.273 ms
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#10John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#1)
Re: tuple radix sort

I wrote:

The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions.

In v1, radix sort diverts to qsort_tuple for small partitions (similar
to how quicksort diverts to insertion sort), but qsort_tuple is
inefficient because the comparator is called via a function pointer.

I also thought having two different radix sorts was too complex, so I
wondered if it'd be better to get rid of the smaller radix sort (whose
control flow I find harder to understand, even ignoring the unsightly
goto) and have the larger sort divert to a new quicksort
specialization that compares on the conditioned datum. That allows
skipping branches for NULL comparisons and order reversal. I've done
this in v2. It makes sense to replace the three current
integer-comparison quicksorts with one.

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I used a standalone program (attached) to microbenchmark this new
fallback qsort vs. a pass of radix sort on one byte to get a decent
threshold value. This is not quite fair, since the quicksort will then
be finished, but the radix sort could still need to recurse to the
next byte(s), so these number could underestimate the threshold. This
is just to get an idea.

The numbers are in RDTSC ticks per element sorted.

cardinality: 256
number of elements: 100 qsort: 35.4 radix: 49.2
number of elements: 200 qsort: 34.9 radix: 38.1
number of elements: 400 qsort: 42.4 radix: 34.4
number of elements: 800 qsort: 95.0 radix: 29.2
number of elements: 1600 qsort: 115.0 radix: 22.4
number of elements: 3200 qsort: 125.5 radix: 19.4
number of elements: 6400 qsort: 128.1 radix: 17.6

With the highest cardinality possible on a single byte, radix sort is
actually not bad at low inputs. Notice that the time per element is
consistently going down with larger inputs. Smaller inputs have large
constant overheads, made worse by my unrolling the counting step.

cardinality: 2
number of elements: 100 qsort: 09.2 radix: 28.0
number of elements: 200 qsort: 09.1 radix: 19.5
number of elements: 400 qsort: 10.4 radix: 15.7
number of elements: 800 qsort: 10.1 radix: 14.5
number of elements: 1600 qsort: 10.4 radix: 13.7
number of elements: 3200 qsort: 15.8 radix: 13.6
number of elements: 6400 qsort: 22.2 radix: 13.8

This is an extreme best case for B&M quicksort, which is basically
O(n) -- the point at which the per-element time goes up seems purely
due to exceeding L1 cache. Radix sort takes a big input to catch up,
but it doesn't seem awful, either.

cardinality: 16
number of elements: 100 qsort: 19.5 radix: 34.5
number of elements: 200 qsort: 18.7 radix: 22.6
number of elements: 400 qsort: 18.5 radix: 17.2
number of elements: 800 qsort: 25.0 radix: 14.8
number of elements: 1600 qsort: 43.8 radix: 13.8
number of elements: 3200 qsort: 51.2 radix: 13.2
number of elements: 6400 qsort: 59.0 radix: 12.8

This is still low cardinality, but behaves more like the high cardinality case.

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Other things to consider:

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
/messages/by-id/CANWCAZbAmaZ7P+ARjS97sJLXsBB5CPZyzFgqNDiqe-L+BqXzug@mail.gmail.com

- The quicksort on the NULL partition still compares isnull1 -- the
branches are predictable but perhaps it's worth it to add a
specialization that skips that.

--
John Naylor
Amazon Web Services

Attachments:

v2-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchDownload+506-21
test-ska-byte-sort-threshold.ctext/x-csrc; charset=US-ASCII; name=test-ska-byte-sort-threshold.cDownload
#11John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#10)
Re: tuple radix sort

I wrote:

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Looking more closely at my development history, it turns out I added
loop unrolling before adding common prefix detection. Most real-world
non-negative integers have the upper bytes the same, especially since
the datum is 8 bytes regardless of underlying type. For those bytes,
the radix sort finds only one unique byte and continues on to the next
byte. By detecting the common prefix as we condition the datums, it
matters less how fast we can count since we simply skip some useless
work. (This is not as relevant when we have an abbreviated datum)

Repeating part of the microbenchmark from last time with no unrolling,
a threshold of 200 works for all but the lowest cardinality inputs:

cardinality: 256
number of elements: 100 qsort: 34.8 radix: 38.3
number of elements: 200 qsort: 34.9 radix: 29.7
number of elements: 400 qsort: 40.8 radix: 29.2

cardinality: 16
number of elements: 100 qsort: 19.3 radix: 26.2
number of elements: 200 qsort: 18.9 radix: 18.2
number of elements: 400 qsort: 18.5 radix: 14.5

cardinality: 2
number of elements: 100 qsort: 09.3 radix: 21.6
number of elements: 200 qsort: 08.9 radix: 15.4
number of elements: 400 qsort: 10.3 radix: 14.0

To test further, I dug up a test from [1]/messages/by-id/CAApHDvqkHZsT2gaAWFM7D=7qyQ=eKXQvvn+BkwCn4Rvj1w4EKQ@mail.gmail.com that stresses low
cardinality on multiple sort keys (attached in a form to allow turing
radix sort on and off via a command line argument), and found no
regression with or without loop unrolling:

V2:
off:
4 ^ 8: latency average = 101.070 ms
5 ^ 8: latency average = 704.862 ms
6 ^ 8: latency average = 3651.015 ms
7 ^ 8: latency average = 15141.412 ms

on:
4 ^ 8: latency average = 99.939 ms
5 ^ 8: latency average = 683.018 ms
6 ^ 8: latency average = 3545.626 ms
7 ^ 8: latency average = 14095.677 ms

V3:
off:
4 ^ 8: latency average = 99.486 ms
5 ^ 8: latency average = 693.434 ms
6 ^ 8: latency average = 3607.940 ms
7 ^ 8: latency average = 14602.325 ms

on:
4 ^ 8: latency average = 97.664 ms
5 ^ 8: latency average = 678.752 ms
6 ^ 8: latency average = 3361.765 ms
7 ^ 8: latency average = 14121.190 ms

So v3 gets rid of loop unrolling, reduces the threshold to 200.

[1]: /messages/by-id/CAApHDvqkHZsT2gaAWFM7D=7qyQ=eKXQvvn+BkwCn4Rvj1w4EKQ@mail.gmail.com

TODOs:
- adding a "sorted pre-check" to keep up with our qsort for ascending inputs
- further performance validation
- possibly doing NULL partitioning differently
- possibly specializing qsort on the NULL partition
- code cleanup

--
John Naylor
Amazon Web Services

Attachments:

bench_cartesiansort.shapplication/x-sh; name=bench_cartesiansort.shDownload
v3-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchapplication/x-patch; name=v3-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchDownload+487-22
#12John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#10)
Re: tuple radix sort

On Mon, Nov 3, 2025 at 8:24 PM I wrote:

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I decided I wasn't quite comfortable with the full normalized datum
sharing space in SortTuple with isnull1. There's too much of a
cognitive burden involved in deciding when we do or don't need to
reset isnull1, and there's a non-zero risk of difficult-to-detect
bugs. For v4 I've instead used one byte of padding space in SortTuple
to store only the byte used for the current pass. That means we must
compute the normalized datum on every pass. That's actually better
than it sounds, since that one byte can now be used directly during
the "deal" step, rather than having to extract the byte from the
normalized datum by shifting and masking. That extraction step might
add significant cycles in cases where a pass requires multiple
iterations through the "deal" loop. It doesn't seem to make much
difference in practice, performance-wise, even with the following
pessimization:

I had to scrap the qsort specialization on the normalized datum for
small sorts, since it's no longer stored. It could still be worth it
to compute the "next byte of the normalized datum" and perform a qsort
on that (falling back to the comparator function in the usual way),
but I haven't felt the need to resort to that yet. For v4, I just
divert to qsort_tuple in non-assert builds, with a threshold of 40.

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
/messages/by-id/CANWCAZbAmaZ7P+ARjS97sJLXsBB5CPZyzFgqNDiqe-L+BqXzug@mail.gmail.com

This is done. Even though the inner loop is mysterious at first
glance, it's really quite elegant.

I made an attempt at clean-up, but it's still under-commented. The
common prefix detection has moved to a separate patch (v4-0004).

I've been forcing all eligible sorts to use radix sort in assert
builds, even when small enough that qsort would be faster. Since both
qsort and in-place radix sort are unstable, it's expected that some
regression tests need adjustment (v4-0002). One thing surprised me,
however: The pg_upgrade TAP test that runs regression tests on the old
cluster showed additional failures that I can't explain. I haven't
seen this before, but it's possible I never ran TAP tests when testing
new sort algorithms previously. This doesn't happen if you change the
current insertion sort threshold, so I haven't been able to reproduce
it aside from this patch. For that reason I can't completely rule out
an actual bug, although I actually have more confidence in the
verification of correct sort order in v4, since isnull1 now never
changes, just as in master. I found that changing some tests to have
additional sort keys seems to fix it (v4-0003). I did this in a rather
quick and haphazard fashion. There's probably a longer conversation to
be had about making test output more deterministic while still
covering the intended executor paths.

Aside from that, this seems like a good place to settle down, so I'm
going to create a CF entry for this. I'll start more rigorous
performance testing in the near future.

--
John Naylor
Amazon Web Services

Attachments:

v4-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Detect-common-prefix-to-avoid-wasted-work-during-.patchDownload+52-2
v4-0002-WIP-Adjust-regression-tests.patchtext/x-patch; charset=US-ASCII; name=v4-0002-WIP-Adjust-regression-tests.patchDownload+300-302
v4-0003-WIP-make-some-regression-tests-sort-order-more-de.patchtext/x-patch; charset=US-ASCII; name=v4-0003-WIP-make-some-regression-tests-sort-order-more-de.patchDownload+55-54
v4-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Use-radix-sort-when-datum1-is-an-integer-type.patchDownload+389-20
#13David Geier
geidav.pg@gmail.com
In reply to: John Naylor (#1)
Re: tuple radix sort

On 29.10.2025 07:28, John Naylor wrote:

Next steps: Try to find regressions (help welcome here). The v1 patch
has some optimizations, but in other ways things are simple and/or
wasteful. Exactly how things fit together will be informed by what, if
anything, has to be done to avoid regressions. I suspect the challenge
will be multikey sorts when the first key has low cardinality. This is
because tiebreaks are necessarily postponed rather than taken care of
up front. I'm optimistic, since low cardinality cases can be even
faster than our B&M qsort, so we have some headroom:

Hi John,

I've also been looking into radix sort the last days to accelerate GIN
index builds. Ordering and removing duplicates requires a fast sort in
generate_trgm(). My own implementation (likely slower than the
algorithms you used) also showed a decent speedup.

Beyond that there are many more places in the code base that could be
changed to use radix sort instead of qsort.

What would be great is if we could build a generic radix sort
implementation, similarly to sort_template.h that can be used in other
places. We would have to think a bit about the interface because instead
of a comparator we would require some radix extraction callback.

If you're open to that idea I could give abstracting the code a try.

--
David Geier

#14John Naylor
john.naylor@enterprisedb.com
In reply to: David Geier (#13)
Re: tuple radix sort

On Wed, Nov 12, 2025 at 9:28 PM David Geier <geidav.pg@gmail.com> wrote:

I've also been looking into radix sort the last days to accelerate GIN
index builds. Ordering and removing duplicates requires a fast sort in
generate_trgm().

If that's the case then I suggest first seeing if dfd8e6c73ee made
things any worse. A simpler possible improvement is to use a similar
normalization step for the chars, if needed, then do the sort and
quinique with a specialization for unsigned chars. (We don't yet
specialize qunique, but that can be remedied). If you're interested,
please start a separate thread for that.

What would be great is if we could build a generic radix sort
implementation, similarly to sort_template.h that can be used in other
places. We would have to think a bit about the interface because instead
of a comparator we would require some radix extraction callback.

That's moving the goalposts too far IMO. I want to get to a place
where I feel comfortable with the decisions made, and that already
requires a lot of testing. Also, I don't want to risk introducing
abstractions that make future improvements to tuplesort more
cumbersome.

--
John Naylor
Amazon Web Services

#15David Geier
geidav.pg@gmail.com
In reply to: John Naylor (#14)
Re: tuple radix sort

Hi John!

On 13.11.2025 05:01, John Naylor wrote:

If that's the case then I suggest first seeing if dfd8e6c73ee made
things any worse. A simpler possible improvement is to use a similar
normalization step for the chars, if needed, then do the sort and
quinique with a specialization for unsigned chars. (We don't yet
specialize qunique, but that can be remedied). If you're interested,
please start a separate thread for that.

It did but only a bit. I worked around it by having two sort
specializations, one for signed and one for unsigned. I also wanted to
try to use a hash table to filter out duplicates and then only sort the
remaining unique trigams, which are, most of the times, a lot less.

Generally speaking, the GIN code is death by a thousand cuts. I've got a
patch coming up that cuts CREATE INDEX runtime in half for columns with
relatively short strings and yields even better results for columns with
longer strings. But that's not only changing the sort but requires a few
changes in a couple of places. More details in the upcoming thread.

I thought qunique() is already pretty optimal because it's defined in a
header file. I believe that even the comparator gets inlined. What would
be useful though is if qunique() used an equality comparator which only
returns true/false instead of a sort comparator. In the GIN code this
also shaved off a few percent. I'll take a closer look at qunique() at
open a thread with the findings / ideas for changes.

Anyways. In this context GIN was just one example for where a generic
radix sort would be useful and there are certainly more.

That's moving the goalposts too far IMO. I want to get to a place
where I feel comfortable with the decisions made, and that already
requires a lot of testing. Also, I don't want to risk introducing
abstractions that make future improvements to tuplesort more
cumbersome.

On a quick glance it looks like you didn't specialize much. So the
testing seems related to if the new algo introduces regressions, not if
the abstraction would cause problems. So it should be possible to
extract out the code fairly easily without invalidating your existing
benchmark results.

I understand that you want to make progress with the use case at hand
but I feel like we're missing out on a lot of opportunity where the
introduced code would also be very beneficial. Beyond that we could
nicely test the new sort code in the spirit of test_rbtree.c and
friends. Maybe you want to give it a 2nd thought.

--
David Geier

#16John Naylor
john.naylor@enterprisedb.com
In reply to: David Geier (#15)
Re: tuple radix sort

On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:

I understand that you want to make progress with the use case at hand
but I feel like we're missing out on a lot of opportunity where the
introduced code would also be very beneficial.

The patch is independently beneficial, but is also just a stepping
stone toward something larger, and I don't yet know exactly how it's
going to look. Premature abstractions are just going to get in the
way. I'd be open to hear proposals for possible wider application
after the dust settles, but that's not going to happen during the PG19
cycle.

--
John Naylor
Amazon Web Services

On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:

Hi John!

On 13.11.2025 05:01, John Naylor wrote:

If that's the case then I suggest first seeing if dfd8e6c73ee made
things any worse. A simpler possible improvement is to use a similar
normalization step for the chars, if needed, then do the sort and
quinique with a specialization for unsigned chars. (We don't yet
specialize qunique, but that can be remedied). If you're interested,
please start a separate thread for that.

It did but only a bit. I worked around it by having two sort
specializations, one for signed and one for unsigned. I also wanted to
try to use a hash table to filter out duplicates and then only sort the
remaining unique trigams, which are, most of the times, a lot less.

Generally speaking, the GIN code is death by a thousand cuts. I've got a
patch coming up that cuts CREATE INDEX runtime in half for columns with
relatively short strings and yields even better results for columns with
longer strings. But that's not only changing the sort but requires a few
changes in a couple of places. More details in the upcoming thread.

I thought qunique() is already pretty optimal because it's defined in a
header file. I believe that even the comparator gets inlined. What would
be useful though is if qunique() used an equality comparator which only
returns true/false instead of a sort comparator. In the GIN code this
also shaved off a few percent. I'll take a closer look at qunique() at
open a thread with the findings / ideas for changes.

Anyways. In this context GIN was just one example for where a generic
radix sort would be useful and there are certainly more.

That's moving the goalposts too far IMO. I want to get to a place
where I feel comfortable with the decisions made, and that already
requires a lot of testing. Also, I don't want to risk introducing
abstractions that make future improvements to tuplesort more
cumbersome.

On a quick glance it looks like you didn't specialize much. So the
testing seems related to if the new algo introduces regressions, not if
the abstraction would cause problems. So it should be possible to
extract out the code fairly easily without invalidating your existing
benchmark results.

I understand that you want to make progress with the use case at hand
but I feel like we're missing out on a lot of opportunity where the
introduced code would also be very beneficial. Beyond that we could
nicely test the new sort code in the spirit of test_rbtree.c and
friends. Maybe you want to give it a 2nd thought.

--
David Geier

--
John Naylor
Amazon Web Services

#17David Geier
geidav.pg@gmail.com
In reply to: John Naylor (#16)
Re: tuple radix sort

Hi John!

On 15.11.2025 03:47, John Naylor wrote:

On Sat, Nov 15, 2025 at 1:05 AM David Geier <geidav.pg@gmail.com> wrote:

I understand that you want to make progress with the use case at hand
but I feel like we're missing out on a lot of opportunity where the
introduced code would also be very beneficial.

The patch is independently beneficial, but is also just a stepping
stone toward something larger, and I don't yet know exactly how it's
going to look. Premature abstractions are just going to get in the
way. I'd be open to hear proposals for possible wider application
after the dust settles, but that's not going to happen during the PG19
cycle.

That sounds like a good compromise. Let's see what else can profit from
the new sorting code once we've got the tuple sort in.

--
David Geier

#18John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#12)
Re: tuple radix sort

I wrote:

Aside from that, this seems like a good place to settle down, so I'm
going to create a CF entry for this. I'll start more rigorous
performance testing in the near future.

Here's the first systematic test results, with scripts. Overall, I'm
very pleased. With extremely low cardinality, it's close enough to our
B&M quicksort that any difference (a hair slower or faster) can be
discarded as insignificant. It's massively faster with most other
inputs, so I'll just highlight the exceptions:

"ascending" - Our qsort runs a "presorted check" before every
partitioning step, and I hadn't done this for radix sort yet because I
wanted to see what the "natural" difference is. I'm inclined to put in
a single precheck at the beginning (people have come to expect that to
be there), but not one at every recursion because I don't think that's
useful. (Aside: that precheck at every recursion should be replaced by
something that detects ascending/descending runs at the very start,
but that's a separate thread)

"stagger" with multiplier = no. records / 2 - This seems to be a case
where the qsort's presorted check happens to get lucky. As I said
above, we should actually detect more sorted runs with something more
comprehensive.

"p5" - This is explicitly designed to favor the B&M qsort. The input
is 95% zeros, 2.5% negative numbers, and 2.5% positive numbers. The
first qsort pivot is pretty much guaranteed to be zero, and the first
partitioning step completes very quickly. Radix sort must do a lot
more work, but it not different than the amount of work it does with
other patterns -- it's much less sensitive to the input distribution
than qsort. In this case, there's a mix of negative and positive
bigints. That defeats common prefix detection, and the first iteration
deals into two piles: negative and non-negative. Then a few recursions
happen where there is only a single distinct byte, so no useful work
happens. I suppose I could try common prefix detection at every
recursion, but I don't think that's widely beneficial for integers.
Maybe the single-byte-plus comparator small qsort would help a little,
and I'm considering adding that anyway. In one sense this is the most
worrying, since there doesn't seem to be a widely-useful mitigation,
but in another sense it's the least worrying, since this case is
deliberately constructed to be at a disadvantage.

--
John Naylor
Amazon Web Services

Attachments:

BM-perf-test-rand-stag-sawtooth-20251120.shapplication/x-shellscript; name=BM-perf-test-rand-stag-sawtooth-20251120.shDownload
BM-perf-test-misc-20251120.shapplication/x-shellscript; name=BM-perf-test-misc-20251120.shDownload
v4-test-stagger.pngimage/png; name=v4-test-stagger.pngDownload+0-1
v4-test-sawtooth.pngimage/png; name=v4-test-sawtooth.pngDownload
v4-test-misc.pngimage/png; name=v4-test-misc.pngDownload+1-0
v4-test-random.pngimage/png; name=v4-test-random.pngDownload
#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: John Naylor (#12)
Re: tuple radix sort

On 2025-Nov-12, John Naylor wrote:

+/*
+ * Based on implementation in https://github.com/skarupke/ska_sort (Boost license),
+ * with the following noncosmetic change:
+ *  - count sorted partitions in every pass, rather than maintaining a
+ *    list of unsorted partitions
+ */
+static void
+radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *state)

I think given https://www.boost.org/LICENSE_1_0.txt you should include a
copy of the Boost license in this comment, as well as the copyright
statement from the hpp file,

// Copyright Malte Skarupke 2016.
// Distributed under the Boost Software License, Version 1.0.
// (See http://www.boost.org/LICENSE_1_0.txt)

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#20John Naylor
john.naylor@enterprisedb.com
In reply to: Alvaro Herrera (#19)
Re: tuple radix sort

On Thu, Nov 20, 2025 at 6:13 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

I think given https://www.boost.org/LICENSE_1_0.txt you should include a
copy of the Boost license in this comment, as well as the copyright
statement from the hpp file,

Will do, next time I do some polishing.

(While thinking about it, I need to sprinkle in some
CHECK_FOR_INTERRUPTS(), too).

--
John Naylor
Amazon Web Services

#21Chengpeng Yan
chengpeng_yan@Outlook.com
In reply to: John Naylor (#12)
#22John Naylor
john.naylor@enterprisedb.com
In reply to: Chengpeng Yan (#21)
#23John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#22)
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Chengpeng Yan (#21)
#25Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#24)
#26John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#25)
#27Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#26)
#28John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#27)
#29Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#28)
#30Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#29)
#31John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#30)
#32Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#31)
#33John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#32)
#34zengman
zengman@halodbtech.com
In reply to: John Naylor (#33)
#35John Naylor
john.naylor@enterprisedb.com
In reply to: zengman (#34)
#36zengman
zengman@halodbtech.com
In reply to: John Naylor (#35)
#37cca5507
cca5507@qq.com
In reply to: zengman (#36)
#38John Naylor
john.naylor@enterprisedb.com
In reply to: cca5507 (#37)
#39cca5507
cca5507@qq.com
In reply to: John Naylor (#38)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: cca5507 (#39)
#41John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#40)
#42cca5507
cca5507@qq.com
In reply to: John Naylor (#41)
#43John Naylor
john.naylor@enterprisedb.com
In reply to: cca5507 (#42)
#44cca5507
cca5507@qq.com
In reply to: John Naylor (#43)
#45John Naylor
john.naylor@enterprisedb.com
In reply to: cca5507 (#44)
#46cca5507
cca5507@qq.com
In reply to: John Naylor (#45)