tweaking perfect hash multipliers
Hi all,
While playing around with Peter E.'s unicode normalization patch [1]/messages/by-id/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com,
I found that HEAD failed to build a perfect hash function for any of
the four sets of 4-byte keys ranging from 1k to 17k in number. It
probably doesn't help that codepoints have nul bytes and often cluster
into consecutive ranges. In addition, I found that a couple of the
candidate hash multipliers don't compile to shift-and-add
instructions, although they were chosen with that intent in mind. It
seems compilers will only do that if the number is exactly 2^n +/- 1.
Using the latest gcc and clang, I tested all prime numbers up to 5000
(plus 8191 for good measure), and found a handful that are compiled
into non-imul instructions. Dialing back the version, gcc 4.8 and
clang 7.0 are the earliest I found that have the same behavior as
newer ones. For reference:
https://gcc.godbolt.org/z/bxcXHu
In addition to shift-and-add, there are also a few using lea,
lea-and-add, or 2 leas.
Then I used the attached program to measure various combinations of
compiled instructions using two constant multipliers iterating over
bytes similar to a generated hash function.
<cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
./a.out
Median of 3 with clang 10:
lea, lea 0.181s
lea, lea+add 0.248s
lea, shift+add 0.251s
lea+add, shift+add 0.273s
shift+add, shift+add 0.276s
2 leas, 2 leas 0.290s
shift+add, imul 0.329s
Taking this with a grain of salt, it nonetheless seems plausible that
a single lea could be faster than any two instructions here. The only
primes that compile to a single lea are 3 and 5, but I've found those
multipliers can build hash functions for all our keyword lists, as
demonstration. None of the others we didn't have already are
particularly interesting from a performance point of view.
With the unicode quick check, I found that the larger sets need (257,
8191) as multipliers to build the hash table, and none of the smaller
special primes I tested will work.
Keeping these two properties in mind, I came up with the scheme in the
attached patch that tries adjacent pairs in this array:
(3, 5, 17, 31, 127, 257, 8191)
so that we try (3,5) first, next (5,17), and then all the pure
shift-and-adds with (257,8191) last.
The main motivation is to be able to build the unicode quick check
tables, but if we ever use this functionality in a hot code path, we
may as well try to shave a few more cycles while we're at it.
[1]: /messages/by-id/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
tweak-perfect-hash-gen.patchapplication/x-patch; name=tweak-perfect-hash-gen.patchDownload
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index 74fb1f2ef6..4a029621b2 100644
--- a/src/tools/PerfectHash.pm
+++ b/src/tools/PerfectHash.pm
@@ -78,18 +78,25 @@ sub generate_hash_function
# Try different hash function parameters until we find a set that works
# for these keys. The multipliers are chosen to be primes that are cheap
- # to calculate via shift-and-add, so don't change them without care.
+ # to calculate via lea or shift, so don't change them without care.
+ # The list of multipliers is designed such that by iterating through
+ # adjacent pairs, we first try a couple combinations where at least one
+ # multiplier is compiled to a single lea (3 or 5). Also, we eventually
+ # want to include the largest numbers in our search for the sake of
+ # finicky key sets such as a large number of unicode codepoints.
# (Commonly, random seeds are tried, but we want reproducible results
# from this program so we don't do that.)
- my $hash_mult1 = 31;
+ my @HASH_MULTS = (3, 5, 17, 31, 127, 257, 8191);
+ my $hash_mult1;
my $hash_mult2;
my $hash_seed1;
my $hash_seed2;
my @subresult;
FIND_PARAMS:
- foreach (127, 257, 521, 1033, 2053)
+ foreach my $i (0 .. $#HASH_MULTS - 1)
{
- $hash_mult2 = $_; # "foreach $hash_mult2" doesn't work
+ $hash_mult1 = $HASH_MULTS[$i];
+ $hash_mult2 = $HASH_MULTS[$i+1];
for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
{
for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++)
Hi,
On 2020-03-30 21:33:14 +0800, John Naylor wrote:
Then I used the attached program to measure various combinations of
compiled instructions using two constant multipliers iterating over
bytes similar to a generated hash function.
It looks like you didn't attach the program?
<cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
./a.out
Median of 3 with clang 10:lea, lea 0.181s
lea, lea+add 0.248s
lea, shift+add 0.251slea+add, shift+add 0.273s
shift+add, shift+add 0.276s2 leas, 2 leas 0.290s
shift+add, imul 0.329sTaking this with a grain of salt, it nonetheless seems plausible that
a single lea could be faster than any two instructions here.
It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.
I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.
There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.
See the tables in https://www.agner.org/optimize/instruction_tables.pdf
I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.
Cool work!
Greetings,
Andres Freund
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-03-30 21:33:14 +0800, John Naylor wrote:
Then I used the attached program to measure various combinations of
compiled instructions using two constant multipliers iterating over
bytes similar to a generated hash function.It looks like you didn't attach the program?
Funny, I did, but then decided to rename the files. Here they are. I
tried to make the loop similar to how it'd be in the actual hash
function, but leaving out the post-loop modulus and array access. Each
loop iteration is dependent on the last one's result.
It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.
That's interesting, I'll have to look into that.
Thanks for the info!
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.
I looked into slow vs. fast leas, and I think the above are actually
fast because they have 2 operands.
leal (%rdi,%rdi,2), %eax
A 3-op lea would look like this:
leal 42(%rdi,%rdi,8), %ecx
In other words, the scale doesn't count as an operand. Although I've
seen in a couple places say that a non-1 scale adds a cycle of latency
for some AMD chips.
Some interesting discussion in these LLVM commits and discussion from
2017 about avoiding slow leas:
https://reviews.llvm.org/D32277
https://reviews.llvm.org/D32352
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 31, 2020 at 4:05 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
On Tue, Mar 31, 2020 at 2:31 AM Andres Freund <andres@anarazel.de> wrote:
I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.I looked into slow vs. fast leas, and I think the above are actually
fast because they have 2 operands.
No, scratch that, it seems the two forms of lea are:
leal (,%rdx,8), %ecx
leal (%rdx,%rdx,8), %ecx
The first operand in both is the implicit zero, so with 3 and 5 we do
get the slow lea on some architectures. So I've only kept the
shift-and-add multipliers in v2. I also changed the order of iteration
of the parameters, for speed. Before, it took over 30 seconds to build
the unicode quick check tables, now it takes under 2 seconds.
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v2-tweak-perfect-hash-gen.patchapplication/octet-stream; name=v2-tweak-perfect-hash-gen.patchDownload
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index 74fb1f2ef6..0ad619eea8 100644
--- a/src/tools/PerfectHash.pm
+++ b/src/tools/PerfectHash.pm
@@ -79,19 +79,26 @@ sub generate_hash_function
# Try different hash function parameters until we find a set that works
# for these keys. The multipliers are chosen to be primes that are cheap
# to calculate via shift-and-add, so don't change them without care.
+ # The largest multipliers in the array below are needed for finicky key
+ # sets such as a large number of short keys with multiple nul bytes.
# (Commonly, random seeds are tried, but we want reproducible results
# from this program so we don't do that.)
- my $hash_mult1 = 31;
+ my @HASH_MULTS = (17, 31, 127, 257, 8191);
+ my $hash_mult1;
my $hash_mult2;
my $hash_seed1;
my $hash_seed2;
my @subresult;
FIND_PARAMS:
- foreach (127, 257, 521, 1033, 2053)
+ # For speed, move on the next pair of multipliers after trying all
+ # the seed2 values. Only start varying seed1 after all previous
+ # combinations have been tried.
+ for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
{
- $hash_mult2 = $_; # "foreach $hash_mult2" doesn't work
- for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
+ foreach my $i (0 .. $#HASH_MULTS - 1)
{
+ $hash_mult1 = $HASH_MULTS[$i];
+ $hash_mult2 = $HASH_MULTS[$i+1];
for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++)
{
@subresult = _construct_hash_table(