B-Tree support function number 3 (strxfrm() optimization)
I have thought a little further about my proposal to have inner B-Tree
pages store strxfrm() blobs [1]/messages/by-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com; my earlier conclusion was that this
could not be trusted when equality was indicated [2]/messages/by-id/CAM3SWZS7wewrBmRGCi9_yCX49Ug6UgqN2xNGJG3Zq5v8LbDU4g@mail.gmail.com due to the
"Hungarian problem" [3]http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad. As I noted earlier, that actually doesn't
really matter, because we may have missed what's really so useful
about strxfrm() blobs: We *can* trust a non-zero result as a proxy for
what we would have gotten with a proper bttextcmp(), even if a
strcmp() only looks at the first byte of a blob for each of two text
strings being compared. Here is what the strxfrm() blob looks like for
some common English words when put through glibc's strxfrm() with the
en_US.UTF-8 collation (I wrote a wrapper function to do this and
output bytea a couple of years ago):
[local]/postgres=# select strxfrm_test('Yellow');
strxfrm_test
--------------------------------------------
\x241017171a220109090909090901100909090909
(1 row)
[local]/postgres=# select strxfrm_test('Red');
strxfrm_test
--------------------------
\x1d100f0109090901100909
(1 row)
[local]/postgres=# select strxfrm_test('Orange');
strxfrm_test
--------------------------------------------
\x1a1d0c1912100109090909090901100909090909
(1 row)
[local]/postgres=# select strxfrm_test('Green');
strxfrm_test
--------------------------------------
\x121d101019010909090909011009090909
(1 row)
[local]/postgres=# select strxfrm_test('White');
strxfrm_test
--------------------------------------
\x2213141f10010909090909011009090909
(1 row)
[local]/postgres=# select strxfrm_test('Black');
strxfrm_test
--------------------------------------
\x0d170c0e16010909090909011009090909
(1 row)
Obviously, all of these blobs, while much larger than the original
string still differ in their first byte. It's almost as if they're
intended to be truncated.
The API I envisage is a new support function 3 that operator class
authors may optionally provide. Within the support function a text
varlena argument is passed, or whatever the type that relates to the
opclass happens to be. It returns a Datum to the core system. That
datum is always what is actually passed to their sort support routine
(B-Tree support function number 2) iff a support function 3 is
provided (you must provide number 2 if you provide number 3, but not
vice-versa). In respect of support-function-3-providing opclasses, the
core system is entitled to take the sort support return value as a
proxy for what a proper support function 1 call would indicate iff the
sort support routine does not return 0 in respect of two
support-function-3 blobs (typically strxfrm() blobs truncated at 8
bytes for convenient storage as pass-by-value Datums). Otherwise, a
proper call to support function 1, with fully-formed text arguments is
required.
I see at least two compelling things we can do with these blobs:
1. Store them as pseudo-columns of inner B-Tree IndexTuples (not leaf
pages). Naturally, inner pages are very heterogeneous, so only having
8 bytes is very probably an excellent trade-off there. Typically 1-3%
of B-Tree pages are inner pages, so the bloat risk seems acceptable.
2. When building a SortTuple array within TupleSort, we can store far
more of these truncated blobs in memory than we can proper strings. if
SortTuple.datum1 (the first column to sort on among tuples being
sorted, which is currently stored in memory as an optimization) was
just an 8 byte truncated strxfrm() blob, and not a pointer to a text
string in memory, that would be pretty great for performance for
several reasons. So just as with B-Tree inner pages, for SortTuples
there can be a pseudo leading key - we need only compare additional
sort keys/heap_getattr() when we need a tie-breaker, when those 8
bytes aren't enough to reach a firm conclusion.
It doesn't just stop with strxfrm() blobs, though. Why couldn't we
create blobs that can be used as reliable proxies for numerics, that
are just integers? Sure, you need to reserve a bit to indicate an
inability to represent the original value, and possibly work out other
details like that, but the only negative thing you can say about that,
or indeed applying these techniques to any operator class is that it
might not help in certain worst cases (mostly contrived cases). Still,
the overhead of doing that bit of extra work is surely quite low
anyway - at worst, a extra few instructions wasted per comparison -
making these techniques likely quite profitable for the vast majority
of real-world applications.
Does anyone else want to pick this idea up and run with it? I don't
think I'll have time for it.
[1]: /messages/by-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com
[2]: /messages/by-id/CAM3SWZS7wewrBmRGCi9_yCX49Ug6UgqN2xNGJG3Zq5v8LbDU4g@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
The API I envisage is a new support function 3 that operator class
authors may optionally provide.
I've built a prototype patch, attached, that extends SortSupport and
tuplesort to support "poor man's normalized keys". All the regression
tests pass, so while it's just a proof of concept, it is reasonably
well put together for one. The primary shortcoming of the prototype
(the main reason why I'm calling it a prototype rather than just a
patch) is that it isn't sufficiently generalized (i.e. it only works
for the cases currently covered by SortSupport - not B-Tree index
builds, or B-Tree scanKeys). There is no B-Tree support function
number 3 in the patch. I didn't spend too long on this.
I'm pretty happy with the results for in-memory sorting of text (my
development system uses 'en_US.UTF8', so please assume that any costs
involved are for runs that use that collation). With the dellstore2
sample database [1]http://pgfoundry.org/forum/forum.php?forum_id=603 -- Peter Geoghegan restored to my local development instance, the
following example demonstrates just how much the technique can help
performance.
With master:
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 819
latency average: 122.100 ms
tps = 8.186197 (including connections establishing)
tps = 8.186522 (excluding connections establishing)
With patch applied (requires initdb for new text SortSupport pg_proc entry):
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2525
latency average: 39.604 ms
tps = 25.241723 (including connections establishing)
tps = 25.242447 (excluding connections establishing)
It looks like this technique is very valuable indeed, at least in the
average or best case. We're not just benefiting from following the
advice of the standard, and using strxfrm() for sorting to amortize
the cost of the strxfrm() transformation that strcoll() must do
anyway. It stands to reason that there is also a lot of benefit from
sorting tightly-packed keys. Quicksort is cache oblivious, and having
it sort tightly-packed binary data, as opposed to going through all of
that deferencing and deserialization indirection is probably also very
helpful. A tool like Cachegrind might offer some additional insights,
but I haven't gone to the trouble of trying that out.
(By the way, my earlier recollection about how memory-frugal
MinimalTuple/memtuple building is within tuplesort was incorrect, so
there are no savings in memory to be had here).
As I mentioned, something like a SortSupport for numeric, with poor
man's normalized keys might also be compelling. I suggest we focus on
how this technique can be further generalized, though. This prototype
patch is derivative of Robert's abandoned SortSupport for text patch.
If he wanted to take this off my hands, I'd have no objections - I
don't think I'm going to have time to take this as far as I'd like.
[1]: http://pgfoundry.org/forum/forum.php?forum_id=603 -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
poorman-prototype.2014_03_30.patchtext/x-patch; charset=US-ASCII; name=poorman-prototype.2014_03_30.patchDownload+417-70
On 31 March 2014 06:51, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
The API I envisage is a new support function 3 that operator class
authors may optionally provide.I've built a prototype patch, attached, that extends SortSupport and
tuplesort to support "poor man's normalized keys". All the regression
tests pass, so while it's just a proof of concept, it is reasonably
well put together for one. The primary shortcoming of the prototype
(the main reason why I'm calling it a prototype rather than just a
patch) is that it isn't sufficiently generalized (i.e. it only works
for the cases currently covered by SortSupport - not B-Tree index
builds, or B-Tree scanKeys). There is no B-Tree support function
number 3 in the patch. I didn't spend too long on this.I'm pretty happy with the results for in-memory sorting of text (my
development system uses 'en_US.UTF8', so please assume that any costs
involved are for runs that use that collation). With the dellstore2
sample database [1] restored to my local development instance, the
following example demonstrates just how much the technique can help
performance.With master:
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 819
latency average: 122.100 ms
tps = 8.186197 (including connections establishing)
tps = 8.186522 (excluding connections establishing)With patch applied (requires initdb for new text SortSupport pg_proc entry):
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2525
latency average: 39.604 ms
tps = 25.241723 (including connections establishing)
tps = 25.242447 (excluding connections establishing)
As another data point, I ran the same benchmark, but I don't appear to
yield the same positive result. An initdb was done for each rebuild,
my system uses en_GB.UTF-8 (if that's relevant) and I used your same
sort.sql...
With master:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 421479
latency average: 0.237 ms
tps = 4214.769601 (including connections establishing)
tps = 4214.906079 (excluding connections establishing)
With patch applied:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 412405
latency average: 0.242 ms
tps = 4124.047237 (including connections establishing)
tps = 4124.177437 (excluding connections establishing)
And with 4 runs (TPS):
Master: 4214.906079 / 4564.532623 / 4152.784608 / 4152.578297 (avg: 4271)
Patched: 4124.177437 / 3777.561869 / 3777.561869 / 2484.220475 (avg: 3481)
I'm not sure what's causing the huge variation. I ran 5 minute benchmarks too:
Master:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1092221
latency average: 0.275 ms
tps = 3640.733002 (including connections establishing)
tps = 3640.784628 (excluding connections establishing)
Patched:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1068239
latency average: 0.281 ms
tps = 3560.794946 (including connections establishing)
tps = 3560.835076 (excluding connections establishing)
And per-second results for the first 30 seconds:
Master:
progress: 1.0 s, 2128.8 tps, lat 0.464 ms stddev 0.084
progress: 2.0 s, 2138.9 tps, lat 0.464 ms stddev 0.015
progress: 3.0 s, 2655.6 tps, lat 0.374 ms stddev 0.151
progress: 4.0 s, 2214.0 tps, lat 0.448 ms stddev 0.080
progress: 5.0 s, 2171.1 tps, lat 0.457 ms stddev 0.071
progress: 6.0 s, 2131.6 tps, lat 0.466 ms stddev 0.035
progress: 7.0 s, 3811.2 tps, lat 0.260 ms stddev 0.177
progress: 8.0 s, 2139.6 tps, lat 0.464 ms stddev 0.017
progress: 9.0 s, 7989.7 tps, lat 0.124 ms stddev 0.091
progress: 10.0 s, 8509.7 tps, lat 0.117 ms stddev 0.062
progress: 11.0 s, 3131.3 tps, lat 0.317 ms stddev 0.177
progress: 12.0 s, 9362.1 tps, lat 0.106 ms stddev 0.006
progress: 13.0 s, 5831.0 tps, lat 0.170 ms stddev 0.137
progress: 14.0 s, 4949.3 tps, lat 0.201 ms stddev 0.156
progress: 15.0 s, 2136.9 tps, lat 0.464 ms stddev 0.028
progress: 16.0 s, 3918.3 tps, lat 0.253 ms stddev 0.177
progress: 17.0 s, 4102.7 tps, lat 0.242 ms stddev 0.122
progress: 18.0 s, 2997.6 tps, lat 0.331 ms stddev 0.151
progress: 19.0 s, 2139.1 tps, lat 0.464 ms stddev 0.034
progress: 20.0 s, 3189.5 tps, lat 0.311 ms stddev 0.173
progress: 21.0 s, 2120.7 tps, lat 0.468 ms stddev 0.030
progress: 22.0 s, 3197.7 tps, lat 0.311 ms stddev 0.182
progress: 23.0 s, 2115.3 tps, lat 0.469 ms stddev 0.034
progress: 24.0 s, 2129.0 tps, lat 0.466 ms stddev 0.031
progress: 25.0 s, 2190.7 tps, lat 0.453 ms stddev 0.106
progress: 26.0 s, 2118.6 tps, lat 0.468 ms stddev 0.031
progress: 27.0 s, 2136.8 tps, lat 0.464 ms stddev 0.018
progress: 28.0 s, 5160.7 tps, lat 0.193 ms stddev 0.156
progress: 29.0 s, 2312.5 tps, lat 0.429 ms stddev 0.107
progress: 30.0 s, 2145.9 tps, lat 0.463 ms stddev 0.038
progress: 31.0 s, 2107.6 tps, lat 0.471 ms stddev 0.071
Patched:
progress: 1.0 s, 2136.2 tps, lat 0.463 ms stddev 0.084
progress: 2.0 s, 2153.3 tps, lat 0.461 ms stddev 0.035
progress: 3.0 s, 2336.0 tps, lat 0.425 ms stddev 0.112
progress: 4.0 s, 2144.8 tps, lat 0.463 ms stddev 0.037
progress: 5.0 s, 2171.7 tps, lat 0.457 ms stddev 0.041
progress: 6.0 s, 2161.9 tps, lat 0.459 ms stddev 0.036
progress: 7.0 s, 2143.1 tps, lat 0.463 ms stddev 0.019
progress: 8.0 s, 2148.4 tps, lat 0.462 ms stddev 0.032
progress: 9.0 s, 2142.1 tps, lat 0.463 ms stddev 0.028
progress: 10.0 s, 2133.6 tps, lat 0.465 ms stddev 0.032
progress: 11.0 s, 2138.3 tps, lat 0.464 ms stddev 0.020
progress: 12.0 s, 2578.7 tps, lat 0.385 ms stddev 0.149
progress: 13.0 s, 2455.6 tps, lat 0.404 ms stddev 0.119
progress: 14.0 s, 2909.5 tps, lat 0.341 ms stddev 0.170
progress: 15.0 s, 2133.7 tps, lat 0.465 ms stddev 0.025
progress: 16.0 s, 2876.9 tps, lat 0.345 ms stddev 0.160
progress: 17.0 s, 2167.1 tps, lat 0.458 ms stddev 0.038
progress: 18.0 s, 3623.4 tps, lat 0.274 ms stddev 0.181
progress: 19.0 s, 2476.3 tps, lat 0.401 ms stddev 0.137
progress: 20.0 s, 2166.9 tps, lat 0.458 ms stddev 0.030
progress: 21.0 s, 3150.0 tps, lat 0.315 ms stddev 0.176
progress: 22.0 s, 2220.3 tps, lat 0.447 ms stddev 0.084
progress: 23.0 s, 2131.3 tps, lat 0.466 ms stddev 0.030
progress: 24.0 s, 2154.2 tps, lat 0.461 ms stddev 0.047
progress: 25.0 s, 2122.5 tps, lat 0.468 ms stddev 0.040
progress: 26.0 s, 2202.2 tps, lat 0.451 ms stddev 0.079
progress: 27.0 s, 2150.6 tps, lat 0.461 ms stddev 0.039
progress: 28.0 s, 2156.1 tps, lat 0.460 ms stddev 0.027
progress: 29.0 s, 2152.1 tps, lat 0.461 ms stddev 0.028
progress: 30.0 s, 2235.4 tps, lat 0.444 ms stddev 0.071
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 31 March 2014 11:32, Thom Brown <thom@linux.com> wrote:
On 31 March 2014 06:51, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
The API I envisage is a new support function 3 that operator class
authors may optionally provide.I've built a prototype patch, attached, that extends SortSupport and
tuplesort to support "poor man's normalized keys". All the regression
tests pass, so while it's just a proof of concept, it is reasonably
well put together for one. The primary shortcoming of the prototype
(the main reason why I'm calling it a prototype rather than just a
patch) is that it isn't sufficiently generalized (i.e. it only works
for the cases currently covered by SortSupport - not B-Tree index
builds, or B-Tree scanKeys). There is no B-Tree support function
number 3 in the patch. I didn't spend too long on this.I'm pretty happy with the results for in-memory sorting of text (my
development system uses 'en_US.UTF8', so please assume that any costs
involved are for runs that use that collation). With the dellstore2
sample database [1] restored to my local development instance, the
following example demonstrates just how much the technique can help
performance.With master:
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 819
latency average: 122.100 ms
tps = 8.186197 (including connections establishing)
tps = 8.186522 (excluding connections establishing)With patch applied (requires initdb for new text SortSupport pg_proc entry):
pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2525
latency average: 39.604 ms
tps = 25.241723 (including connections establishing)
tps = 25.242447 (excluding connections establishing)As another data point, I ran the same benchmark, but I don't appear to
yield the same positive result. An initdb was done for each rebuild,
my system uses en_GB.UTF-8 (if that's relevant) and I used your same
sort.sql...With master:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 421479
latency average: 0.237 ms
tps = 4214.769601 (including connections establishing)
tps = 4214.906079 (excluding connections establishing)With patch applied:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 412405
latency average: 0.242 ms
tps = 4124.047237 (including connections establishing)
tps = 4124.177437 (excluding connections establishing)And with 4 runs (TPS):
Master: 4214.906079 / 4564.532623 / 4152.784608 / 4152.578297 (avg: 4271)
Patched: 4124.177437 / 3777.561869 / 3777.561869 / 2484.220475 (avg: 3481)I'm not sure what's causing the huge variation. I ran 5 minute benchmarks too:
Master:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1092221
latency average: 0.275 ms
tps = 3640.733002 (including connections establishing)
tps = 3640.784628 (excluding connections establishing)Patched:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1068239
latency average: 0.281 ms
tps = 3560.794946 (including connections establishing)
tps = 3560.835076 (excluding connections establishing)And per-second results for the first 30 seconds:
Master:
progress: 1.0 s, 2128.8 tps, lat 0.464 ms stddev 0.084
progress: 2.0 s, 2138.9 tps, lat 0.464 ms stddev 0.015
progress: 3.0 s, 2655.6 tps, lat 0.374 ms stddev 0.151
progress: 4.0 s, 2214.0 tps, lat 0.448 ms stddev 0.080
progress: 5.0 s, 2171.1 tps, lat 0.457 ms stddev 0.071
progress: 6.0 s, 2131.6 tps, lat 0.466 ms stddev 0.035
progress: 7.0 s, 3811.2 tps, lat 0.260 ms stddev 0.177
progress: 8.0 s, 2139.6 tps, lat 0.464 ms stddev 0.017
progress: 9.0 s, 7989.7 tps, lat 0.124 ms stddev 0.091
progress: 10.0 s, 8509.7 tps, lat 0.117 ms stddev 0.062
progress: 11.0 s, 3131.3 tps, lat 0.317 ms stddev 0.177
progress: 12.0 s, 9362.1 tps, lat 0.106 ms stddev 0.006
progress: 13.0 s, 5831.0 tps, lat 0.170 ms stddev 0.137
progress: 14.0 s, 4949.3 tps, lat 0.201 ms stddev 0.156
progress: 15.0 s, 2136.9 tps, lat 0.464 ms stddev 0.028
progress: 16.0 s, 3918.3 tps, lat 0.253 ms stddev 0.177
progress: 17.0 s, 4102.7 tps, lat 0.242 ms stddev 0.122
progress: 18.0 s, 2997.6 tps, lat 0.331 ms stddev 0.151
progress: 19.0 s, 2139.1 tps, lat 0.464 ms stddev 0.034
progress: 20.0 s, 3189.5 tps, lat 0.311 ms stddev 0.173
progress: 21.0 s, 2120.7 tps, lat 0.468 ms stddev 0.030
progress: 22.0 s, 3197.7 tps, lat 0.311 ms stddev 0.182
progress: 23.0 s, 2115.3 tps, lat 0.469 ms stddev 0.034
progress: 24.0 s, 2129.0 tps, lat 0.466 ms stddev 0.031
progress: 25.0 s, 2190.7 tps, lat 0.453 ms stddev 0.106
progress: 26.0 s, 2118.6 tps, lat 0.468 ms stddev 0.031
progress: 27.0 s, 2136.8 tps, lat 0.464 ms stddev 0.018
progress: 28.0 s, 5160.7 tps, lat 0.193 ms stddev 0.156
progress: 29.0 s, 2312.5 tps, lat 0.429 ms stddev 0.107
progress: 30.0 s, 2145.9 tps, lat 0.463 ms stddev 0.038
progress: 31.0 s, 2107.6 tps, lat 0.471 ms stddev 0.071Patched:
progress: 1.0 s, 2136.2 tps, lat 0.463 ms stddev 0.084
progress: 2.0 s, 2153.3 tps, lat 0.461 ms stddev 0.035
progress: 3.0 s, 2336.0 tps, lat 0.425 ms stddev 0.112
progress: 4.0 s, 2144.8 tps, lat 0.463 ms stddev 0.037
progress: 5.0 s, 2171.7 tps, lat 0.457 ms stddev 0.041
progress: 6.0 s, 2161.9 tps, lat 0.459 ms stddev 0.036
progress: 7.0 s, 2143.1 tps, lat 0.463 ms stddev 0.019
progress: 8.0 s, 2148.4 tps, lat 0.462 ms stddev 0.032
progress: 9.0 s, 2142.1 tps, lat 0.463 ms stddev 0.028
progress: 10.0 s, 2133.6 tps, lat 0.465 ms stddev 0.032
progress: 11.0 s, 2138.3 tps, lat 0.464 ms stddev 0.020
progress: 12.0 s, 2578.7 tps, lat 0.385 ms stddev 0.149
progress: 13.0 s, 2455.6 tps, lat 0.404 ms stddev 0.119
progress: 14.0 s, 2909.5 tps, lat 0.341 ms stddev 0.170
progress: 15.0 s, 2133.7 tps, lat 0.465 ms stddev 0.025
progress: 16.0 s, 2876.9 tps, lat 0.345 ms stddev 0.160
progress: 17.0 s, 2167.1 tps, lat 0.458 ms stddev 0.038
progress: 18.0 s, 3623.4 tps, lat 0.274 ms stddev 0.181
progress: 19.0 s, 2476.3 tps, lat 0.401 ms stddev 0.137
progress: 20.0 s, 2166.9 tps, lat 0.458 ms stddev 0.030
progress: 21.0 s, 3150.0 tps, lat 0.315 ms stddev 0.176
progress: 22.0 s, 2220.3 tps, lat 0.447 ms stddev 0.084
progress: 23.0 s, 2131.3 tps, lat 0.466 ms stddev 0.030
progress: 24.0 s, 2154.2 tps, lat 0.461 ms stddev 0.047
progress: 25.0 s, 2122.5 tps, lat 0.468 ms stddev 0.040
progress: 26.0 s, 2202.2 tps, lat 0.451 ms stddev 0.079
progress: 27.0 s, 2150.6 tps, lat 0.461 ms stddev 0.039
progress: 28.0 s, 2156.1 tps, lat 0.460 ms stddev 0.027
progress: 29.0 s, 2152.1 tps, lat 0.461 ms stddev 0.028
progress: 30.0 s, 2235.4 tps, lat 0.444 ms stddev 0.071
While this seems to indicate some kind or regression with the patch,
I've just realised that the data files weren't read in as the csv
files were in the wrong path, so that explains the very high tps.
*facepalm* So here are the *right* benchmarks:
Master:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2093
latency average: 47.778 ms
tps = 20.920833 (including connections establishing)
tps = 20.921719 (excluding connections establishing)
(a repeat shows 20.737619)
With patch:
thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 6148
latency average: 16.265 ms
tps = 61.477152 (including connections establishing)
tps = 61.479736 (excluding connections establishing)
(a repeat shows 62.252024)
So clearly a 3-fold improvement in this case.
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/31/2014 08:51 AM, Peter Geoghegan wrote:
+ #ifdef HAVE_LOCALE_T + if (tss->locale) + strxfrm_l(pres, tss->buf1, Min(sizeof(Datum), len), tss->locale); + else + #endif + strxfrm(pres, tss->buf1, Min(sizeof(Datum), len)); + + pres[Min(sizeof(Datum) - 1, len)] = '\0';
I'm afraid that trick isn't 100% reliable. The man page for strxrfm says:
RETURN VALUE
The strxfrm() function returns the number of bytes required to store
the transformed string in dest excluding the terminating null byte
('\0'). If the value returned is n or more, the contents of dest are
indeterminate.
Note the last sentence. To avoid the undefined behavior, you have to
pass a buffer that's large enough to hold the whole result, and then
truncate the result.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 31, 2014 at 8:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Note the last sentence. To avoid the undefined behavior, you have to pass a
buffer that's large enough to hold the whole result, and then truncate the
result.
I was working off the glibc documentation, which says:
"The return value is the length of the entire transformed string. This
value is not affected by the value of size, but if it is greater or
equal than size, it means that the transformed string did not entirely
fit in the array to. In this case, only as much of the string as
actually fits was stored. To get the whole transformed string, call
strxfrm again with a bigger output array."
It looks like this may be a glibc-ism. I'm not sure whether or not
it's worth attempting to benefit from glibc's apparent additional
guarantees here. Obviously that is something that would have to be
judged by weighing any actual additional benefit. FWIW, I didn't
actually imagine that this was a trick I was exploiting. I wouldn't be
surprised if this wasn't very important in practice.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> writes:
On Mon, Mar 31, 2014 at 8:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:Note the last sentence. To avoid the undefined behavior, you have to pass a
buffer that's large enough to hold the whole result, and then truncate the
result.
I was working off the glibc documentation, which says:
"The return value is the length of the entire transformed string. This
value is not affected by the value of size, but if it is greater or
equal than size, it means that the transformed string did not entirely
fit in the array to. In this case, only as much of the string as
actually fits was stored. To get the whole transformed string, call
strxfrm again with a bigger output array."
It looks like this may be a glibc-ism.
Possibly worth noting is that on my RHEL6 and Fedora machines,
"man strxfrm" says the same thing as the POSIX spec. Likewise on OS X.
I don't think we can rely on what you suggest.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 31, 2014 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Possibly worth noting is that on my RHEL6 and Fedora machines,
"man strxfrm" says the same thing as the POSIX spec. Likewise on OS X.
I don't think we can rely on what you suggest.
Okay. Attached revision only trusts strxfrm() blobs (as far as that
goes) when the buffer passed to strxfrm() was sufficiently large that
the blob could fully fit. We're now managing two buffers as part of
the text sort support state (for leading poor man's keys, rather than
just for non-leading keys as before) - one for the original string (to
put in a NULL byte, since like strcoll(), strxfrm() expects that), and
the other to temporarily store the blob before memcpy()ing over to the
pass-by-value poor man's normalized key Datum. These are the same two
persistent sortsupport-state buffers used when sorting a non-leading
text key, where there is one for each string that needs to have a NULL
byte appended.
This appears to perform at least as well as the prior revision, and
possibly even appreciably better. I guess that glibc was internally
managing a buffer to do much the same thing, so perhaps we gain
something more from mostly avoiding that with a longer lived buffer.
Perhaps you could investigate the performance of this new revision too, Thom?
I'd really like to see this basic approach expanded to store pseudo
leading keys in internal B-Tree pages too, per my earlier proposals.
At the very least, B-Tree index builds should benefit from this. I
think I'm really only scratching the surface here, both in terms of
the number of places in the code where a poor man's normalized key
could usefully be exploited, as well as in terms of the number of
B-Tree operator classes that could be made to provide one.
--
Peter Geoghegan
Attachments:
poorman.v2.2014_03_31.patchtext/x-patch; charset=US-ASCII; name=poorman.v2.2014_03_31.patchDownload+435-70
On Mon, Mar 31, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
Okay. Attached revision only trusts strxfrm() blobs (as far as that
goes) when the buffer passed to strxfrm() was sufficiently large that
the blob could fully fit.
Attached revision has been further polished. I've added two additional
optimizations:
* Squeeze the last byte out of each Datum, so that on a 64-bit system,
the full 8 bytes are available to store strxfrm() blobs.
* Figure out when the strcoll() bttextfastcmp_locale() comparator is
called, if it was called because a poor man's comparison required it
(and *not* because it's the non-leading key in the traditional sense,
which implies there are no poorman's normalized keys in respect of
this attribute at all). This allows us to try and get away with a
straight memcmp if and when the lengths of the original text strings
match, on the assumption that when the initial poorman's comparison
didn't work out, and when the string lengths match, there is a very
good chance that both are equal, and on average it's a win to avoid
doing a strcoll() (along with the attendant copying around of buffers
for NULL-termination) entirely. Given that memcmp() is so much cheaper
than strcoll() anyway, this seems like a good trade-off.
--
Peter Geoghegan
Attachments:
poorman.v3.2014_04_03.patchtext/x-patch; charset=US-ASCII; name=poorman.v3.2014_04_03.patchDownload+475-72
On 3 April 2014 17:52, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 31, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
Okay. Attached revision only trusts strxfrm() blobs (as far as that
goes) when the buffer passed to strxfrm() was sufficiently large that
the blob could fully fit.Attached revision has been further polished. I've added two additional
optimizations:* Squeeze the last byte out of each Datum, so that on a 64-bit system,
the full 8 bytes are available to store strxfrm() blobs.* Figure out when the strcoll() bttextfastcmp_locale() comparator is
called, if it was called because a poor man's comparison required it
(and *not* because it's the non-leading key in the traditional sense,
which implies there are no poorman's normalized keys in respect of
this attribute at all). This allows us to try and get away with a
straight memcmp if and when the lengths of the original text strings
match, on the assumption that when the initial poorman's comparison
didn't work out, and when the string lengths match, there is a very
good chance that both are equal, and on average it's a win to avoid
doing a strcoll() (along with the attendant copying around of buffers
for NULL-termination) entirely. Given that memcmp() is so much cheaper
than strcoll() anyway, this seems like a good trade-off.
I'm getting an error when building this:
In file included from printtup.c:23:0:
../../../../src/include/utils/memdebug.h:21:31: fatal error:
valgrind/memcheck.h: No such file or directory
compilation terminated.
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -I../../../src/include
-D_GNU_SOURCE -c -o analyze.o analyze.c -MMD -MP -MF
.deps/analyze.Po
make[4]: *** [printtup.o] Error 1
make[4]: Leaving directory
`/home/thom/Development/postgresql/src/backend/access/common'
make[3]: *** [common-recursive] Error 2
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 3, 2014 at 1:23 PM, Thom Brown <thom@linux.com> wrote:
I'm getting an error when building this:
Sorry. I ran this through Valgrind, and forgot to reset where the
relevant macro is define'd before submission. Attached revision should
build without issue.
--
Peter Geoghegan
Attachments:
poorman.v3-revised.2014_04_03.patchtext/x-patch; charset=US-ASCII; name=poorman.v3-revised.2014_04_03.patchDownload+473-70
On 3 April 2014 19:05, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Apr 3, 2014 at 1:23 PM, Thom Brown <thom@linux.com> wrote:
I'm getting an error when building this:
Sorry. I ran this through Valgrind, and forgot to reset where the
relevant macro is define'd before submission. Attached revision should
build without issue.
Looking good:
-T 100 -n -f sort.sql
Master: 21.670467 / 21.718653 (avg: 21.69456)
Patch: 66.888756 / 66.888756 (avg: 66.888756)
308% increase in speed
-c 80 -j 80 -T 100 -n -f sort.sql
Master: 38.450082 / 37.440701 (avg: 37.9453915)
Patch: 153.321946 / 145.004726 (avg: 149.163336)
393% increase in speed
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 3, 2014 at 3:19 PM, Thom Brown <thom@linux.com> wrote:
Master: 38.450082 / 37.440701 (avg: 37.9453915)
Patch: 153.321946 / 145.004726 (avg: 149.163336)
I think that those are objectively very large reductions in a cost
that figures prominently in most workloads. Based solely on those
facts, but also on the fairly low complexity of the patch, it may be
worth considering committing this before 9.4 goes into feature freeze,
purely as something that just adds a SortSupport function for the
default text opclass, with more or less the same cases accelerated as
with the existing SortSupport-supplying-opclasses. There has been only
very modest expansions to the SortSupport and tuplesort code. I
haven't generalized this to work with other areas where a normalized
key could be put to good use, but I see no reason to block on that.
Obviously I've missed the final commitfest deadline by a wide berth. I
don't suggest this lightly, and it in no small part has a lot to do
with the patch being simple and easy to reason about. The patch could
almost (but not quite) be written as part of a third party text
operator class's sort support routine. I think that if an individual
committer were willing to commit this at their own discretion before
feature freeze, outside of the formal commitfest process, that would
not be an unreasonable thing in these particular, somewhat unusual
circumstances.
I defer entirely to the judgement of others here - this is not an
issue that I feel justified in expressing a strong opinion on, and in
fact I don't have such an opinion anyway. However, by actually looking
at the risks and the benefits here, I think everyone will at least
understand why I'd feel justified in broaching the topic. This is a
very simple idea, and a rather old one at that.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> writes:
I think that those are objectively very large reductions in a cost
that figures prominently in most workloads. Based solely on those
facts, but also on the fairly low complexity of the patch, it may be
worth considering committing this before 9.4 goes into feature freeze,
Personally, I have paid no attention to this thread and have no intention
of doing so before feature freeze. There are three dozen patches at
https://commitfest.postgresql.org/action/commitfest_view?id=21
that have moral priority for consideration for 9.4. Not all of them are
going to get in, certainly, and I'm already feeling a lot of guilt about
the small amount of time I've been able to devote to reviewing/committing
patches this cycle. Spending time now on patches that didn't even exist
at the submission deadline feels quite unfair to me.
Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 3, 2014 at 11:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Personally, I have paid no attention to this thread and have no intention
of doing so before feature freeze.
Anything that you missed was likely musings on how to further
generalize SortSupport. The actual additions to SortSupport and
tuplesort proposed are rather small. A simple abstraction to allow for
semi-reliable normalized keys, and a text sort support function to use
it.
Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.
I think that that's a matter of personal priorities for committers. I
am not in a position to tell anyone what their priorities ought to be,
and giving extra attention to this patch may be unfair. It doesn't
have to be a zero-sum game, though. Attention from a committer to,
say, this patch does not necessarily have to come at the expense of
another, if for example this patch piques somebody's interest and
causes them to put extra work into it on top of what they'd already
planned to look at. Again, under these somewhat unusual circumstances,
that seems like something that some committer might be inclined to do,
without it being altogether unreasonable. Then again, perhaps not.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 03, 2014 at 11:44:46PM -0400, Tom Lane wrote:
Peter Geoghegan <pg@heroku.com> writes:
I think that those are objectively very large reductions in a cost
that figures prominently in most workloads. Based solely on those
facts, but also on the fairly low complexity of the patch, it may be
worth considering committing this before 9.4 goes into feature freeze,Personally, I have paid no attention to this thread and have no intention
of doing so before feature freeze. There are three dozen patches at
https://commitfest.postgresql.org/action/commitfest_view?id=21
that have moral priority for consideration for 9.4. Not all of them are
going to get in, certainly, and I'm already feeling a lot of guilt about
the small amount of time I've been able to devote to reviewing/committing
patches this cycle. Spending time now on patches that didn't even exist
at the submission deadline feels quite unfair to me.Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.
+1
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 4, 2014 at 12:13 PM, Noah Misch <noah@leadboat.com> wrote:
On Thu, Apr 03, 2014 at 11:44:46PM -0400, Tom Lane wrote:
Peter Geoghegan <pg@heroku.com> writes:
I think that those are objectively very large reductions in a cost
that figures prominently in most workloads. Based solely on those
facts, but also on the fairly low complexity of the patch, it may be
worth considering committing this before 9.4 goes into feature freeze,Personally, I have paid no attention to this thread and have no intention
of doing so before feature freeze. There are three dozen patches at
https://commitfest.postgresql.org/action/commitfest_view?id=21
that have moral priority for consideration for 9.4. Not all of them are
going to get in, certainly, and I'm already feeling a lot of guilt about
the small amount of time I've been able to devote to reviewing/committing
patches this cycle. Spending time now on patches that didn't even exist
at the submission deadline feels quite unfair to me.Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.+1
+1
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 4, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.+1
+1
I don't think we have to promise a strict priority queue and preempt all
other development. But I agree loosely that processing patches that have
been around should be a higher priority.
I've been meaning to do more review for a while and just took a skim
through the queue. There are only a couple I feel I can contribute with so
I'm going to work on those and then if it's still before the feature freeze
I would like to go ahead with Peter's patch. I think it's generally a good
patch.
Two questions I have:
1) Would it make more sense to use a floating point instead of an integer?
I saw a need for a function like this when I was looking into doing GPU
sorts. But GPUs expect floating point values.
2) I would want to see a second data type, probably numeric, before
committing to be sure we had a reasonably generic api. But it's pretty
simply to do so.
--
greg
On Fri, Apr 4, 2014 at 5:29 PM, Greg Stark <stark@mit.edu> wrote:
Two questions I have:
1) Would it make more sense to use a floating point instead of an integer? I
saw a need for a function like this when I was looking into doing GPU sorts.
But GPUs expect floating point values.
In the context of this patch, I don't think you want to add
uncertainty to the != 0 or ==0 case (which is what FP would do).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 4, 2014 at 4:29 PM, Greg Stark <stark@mit.edu> wrote:
1) Would it make more sense to use a floating point instead of an integer? I
saw a need for a function like this when I was looking into doing GPU sorts.
But GPUs expect floating point values.
There is no reason to assume that the normalized keys within the
memtuples array need to be compared as integers, floats, or anything
else. That's a matter for the opclass author. Right now, it happens to
be the case that some of our earlier aspirations for SortSupport, such
as alternative user-defined sorting algorithms are not supported. This
is presumably only because no one came up with a compelling
alternative (i.e. no one followed through with trying to figure out if
radix sort added much value). It wouldn't be very hard to make
SortSupport and tuplesort care about that, but that's a separate
issue.
2) I would want to see a second data type, probably numeric, before
committing to be sure we had a reasonably generic api. But it's pretty
simply to do so.
I could pretty easily write a numeric proof of concept. I don't think
it would prove anything about the interface, though - I've only had
tuplesort and SortSupport assume that normalized keys are generated as
tuples are initially copied over (where applicable), as well as what
to do when a non-reliable comparison returns 0 (where applicable).
That is surely the kernel of the general idea, and nothing more, so I
don't see that numeric support proves the suitability of the interface
(or at least the essential idea of the interface, as opposed to
exactly how the mechanism works in the proposed patch, where the
SortSupport struct is slightly extended to express this idea).
A large majority of new code is the new SortSupport routine for text.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers