Summary of Sort Improvement Proposals
Hello,
In light of multiple threads [1-6] discussing sorting improvements, I'd like to consolidate the old (+some new) ideas as a starting point.
It might make sense to brain storm on a few of these ideas and maybe even identify some that are worth implementing and testing.
1. Simple algorithmic ideas:
- Use single-assignment insertion-sort instead of swapping
- Increase insertion-sort threshold to at least 8 (possibly 10+), to be determined empirically based on current hardware
- Make insertion-sort threshold customizable via template based on sort element size
2. More complex/speculative algorithmic ideas:
- Try counting insertion-sort loop iterations and bail after a certain limit (include presorted check in insertion-sort loop and continue presorted check from last position in separate loop after bailout)
- Try binary search for presorted check (outside of insertion-sort-code)
- Try binary insertion sort (if comparison costs are high)
- Try partial insertion sort (include presorted check)
- Try presorted check only at top-level, not on every recursive step, or if on every level than at least only for n > some threshold
- Try asymmetric quick-sort partitioning
- Try dual pivot quick-sort
- Try switching to heap-sort dependent on recursion depth (might allow ripping out median-of-median)
3. TupleSort ideas:
- Use separate sort partition for NULL values to avoid null check on every comparison and to make nulls first/last trivial
- Pass down non-nullness info to avoid null check and/or null-partition creation (should ideally be determined by planner)
- Skip comparison of first sort key on subsequent full tuple tie-breaker comparison (unless abbreviated key)
- Encode NULL directly in abbreviated key (only if no null-partitioning)
4. Planner ideas:
- Use pg_stats.correlation to inform sort algorithm selection for sort keys that come from sequential-scans/bitmap-heap-scans
- Use n_distinct to inform sort algorithm selection (many tie-breaker comparisons necessary on multi-key sort)
- Improve costing of sorts in planner considering tuple size, distribution and n_distinct
[1]: /messages/by-id/ddc4e498740a8e411c59@zeyos.com
[2]: /messages/by-id/CAFBsxsHanJTsX9DNJppXJxwg3bU+YQ6pnmSfPM0uvYUaFdwZdQ@mail.gmail.com
[3]: /messages/by-id/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw@mail.gmail.com
[4]: /messages/by-id/CAEYLb_Xn4-6f1ofsf2qduf24dDCVHbQidt7JPpdL_RiT1zBJ6A@mail.gmail.com
[5]: /messages/by-id/CAEYLb_W++UhrcWprzG9TyBVF7Sn-c1s9oLbABvAvPGdeP2DFSQ@mail.gmail.com
[6]: /messages/by-id/683635b8-381b-5b08-6069-d6a45de19a12@enterprisedb.com
--
Benjamin Coutu
http://www.zeyos.com