Test coverage for external sorting
PostgreSQL uses two different sorting algorithms, qsort and the external
sorting method in tuplesort.c. There are some possible improvements in
external sorting, so I'd like to check on the solidity of the testing
mechanisms.
Whether external sorting can be improved upon is a different debate,
though I do have reason to believe it is possible. Initially, I am
interested in proving correctness of any change, though the eventual
goal would be performance.
I'm looking through the regression tests, but can't find anything that
explicitly tests both types of sort.
If you run the regression tests against an existing database instance it
would be possible to run the tests with various values of work_mem so as
to force the sorts to either be internal or external.
...only problem is that the largest regression test table: tenk doesn't
occupy as much as 1 MB of space in total and work_mem cannot be set
lower than 1 MB.
Could anybody comment on whether the current tests appropriately cover
the correctness of the external sorting algorithms? Will any changes to
the external sort algorithms be appropriately tested by what is there
currently?
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Could anybody comment on whether the current tests appropriately cover
the correctness of the external sorting algorithms?
It's highly unlikely that the regression tests stress external sorts
much, or that anyone would hold still for making them run long enough
to do so ;-)
It's not hard to create a stress test: just load a bunch of random
numbers into a table and create a b-tree index on it. To check the
correctness of the sort, you could CLUSTER on the index and then read
out the table to see if it were now in sorted order.
BTW, as for your original question about performance, the current
external sort algorithm is mainly designed to conserve disk space,
not to be as fast as possible. It could probably be a good bit faster
if we didn't mind taking twice as much space (mainly because the
physical disk access pattern would be a lot less random). But I know
we will get push-back if we try to revert to doing that.
regards, tom lane
On Tue, 2005-04-12 at 10:04 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Could anybody comment on whether the current tests appropriately cover
the correctness of the external sorting algorithms?It's highly unlikely that the regression tests stress external sorts
much, or that anyone would hold still for making them run long enough
to do so ;-)
OK
It's not hard to create a stress test: just load a bunch of random
numbers into a table and create a b-tree index on it. To check the
correctness of the sort, you could CLUSTER on the index and then read
out the table to see if it were now in sorted order.
Just checking. No point starting anything until a test is in place. Yes,
they're fairly straightforward to do - I just didn't want to do it...
BTW, as for your original question about performance, the current
external sort algorithm is mainly designed to conserve disk space,
not to be as fast as possible. It could probably be a good bit faster
if we didn't mind taking twice as much space (mainly because the
physical disk access pattern would be a lot less random). But I know
we will get push-back if we try to revert to doing that.
That's roughly what I'm looking into now: just scoping for the time
being. Anything submitted would take the status quo as default and
present other functionality as an option only.
There's also some research into improved replacement selection
algorithms that may soon be submitted/submittable.
Best Regards, Simon Riggs
Tom,
BTW, as for your original question about performance, the current
external sort algorithm is mainly designed to conserve disk space,
not to be as fast as possible. It could probably be a good bit faster
if we didn't mind taking twice as much space (mainly because the
physical disk access pattern would be a lot less random). But I know
we will get push-back if we try to revert to doing that.
We could do it as a compile-time option or a GUC. I know I wouldn't mind
taking extra disk space if my sorts ran faster on very large tables.
Currently, building a new btree index on a 100GB table takes about 2 hours on
a v40z. And that's not becuase of I/O; disk bandwidth is less than 20% used.
--
Josh Berkus
Aglio Database Solutions
San Francisco
Import Notes
Reply to msg id not found: 20050412150436.B1EEA53705@svr1.postgresql.orgReference msg id not found: 20050412150436.B1EEA53705@svr1.postgresql.org | Resolved by subject fallback