General purpose hashing func in pgbench
Hi hackers,
Following up the recent discussion on zipfian distribution I was trying
to reproduce some YCSB-like workloads. As this paper [1]http://www.brianfrankcooper.net/home/publications/ycsb.pdf describes, YCSB
uses zipfian distribution to generate keys in order simulate intensive
load on small number of records as it happens in real world applications
(e.g. blogs). One problem is that most popular records keys are
clustered together. To scatter them across the keyspace authors use
hashing, the FNV-1a hash function in particular [2]https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function.
I've read Fabien Coelho's thread on additional operators and functions.
Generally it could be possible to implement some basic hashing
algorithms right in a pgbench script using just bitwise and arithmetic
operators. But should we probably provide users with some general
purpose hash function?
The attached patch introduces hash() function which implements FNV-1a as
an example of such hashing algorithm. There are also couple of images in
the attachement that I have got from visualizing original zipfian
distribution and the hashed one. Usage example:
In psql:
create table abc as select generate_series(0, 999) as a, 0 as b;
pgbench script:
\set rnd random_zipfian(0, 1000000, 0.99)
\set key abs(hash(:rnd)) % 1000
begin;
update abc set b = b + 1 where a = :key;
end;
Any thoughts or suggestions?
[1]: http://www.brianfrankcooper.net/home/publications/ycsb.pdf
[2]: https://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function
Thanks!
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Hello Ildar,
Following up the recent discussion on zipfian distribution I was trying
to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB
uses zipfian distribution to generate keys in order simulate intensive
load on small number of records as it happens in real world applications
(e.g. blogs). One problem is that most popular records keys are
clustered together. To scatter them across the keyspace authors use
hashing, the FNV-1a hash function in particular [2].
Yes, clustering is an issue for realistic load tests and may influence the
resulting measured performance unduly.
I've read Fabien Coelho's thread on additional operators and functions.
Generally it could be possible to implement some basic hashing
algorithms right in a pgbench script using just bitwise and arithmetic
operators. But should we probably provide users with some general
purpose hash function?
The problem I see with hash functions is that it generates collisions,
which is an undesirable effect when dealing with keys as it biases the
initial randomization.
Thus I intend to submit a parametric pseudo-random permutation function,
some day. As I foresaw that it might require some arguing I did not
include the function in the operators and functions collection I
submitted, so as not to slow down the inclusion process. Given that the
patch was submitted over 20 months ago and is still stuck in the CF, that
was a misjudgement:-)
This is no no way a point against including one/some hash functions,
especially if they actually appear in a benchmark.
The attached patch introduces hash() function which implements FNV-1a as
an example of such hashing algorithm. There are also couple of images in
the attachement that I have got from visualizing original zipfian
distribution and the hashed one. Usage example:
In psql:
create table abc as select generate_series(0, 999) as a, 0 as b;pgbench script:
\set rnd random_zipfian(0, 1000000, 0.99)
\set key abs(hash(:rnd)) % 1000
begin;
update abc set b = b + 1 where a = :key;
end;Any thoughts or suggestions?
As there may be several hash functions included in the long run. I'd
suggest that the hash function should be named more precisely, eg
"hash_fnv1a".
The image looks like the distribution is more regularly scattered than
actually randomized... Maybe this is because the first highest 256 values
are really scattered by the process multiply/modulo process. Or maybe this
is an optical effect?
ISTM that there are undesired utf8 chars in a comment. Should be kept
ASCII.
I would put the actual hash computation in a separate function rather than
inlined in the evaluator.
Add the submission to the next CF?
--
Fabien.
Hello Fabien,
20/12/2017 10:36, Fabien COELHO пишет:
As there may be several hash functions included in the long run. I'd
suggest that the hash function should be named more precisely, eg
"hash_fnv1a".
Done. Added "hash_murmur2" too, see below.
The image looks like the distribution is more regularly scattered than
actually randomized... Maybe this is because the first highest 256
values are really scattered by the process multiply/modulo process. Or
maybe this is an optical effect?
After your comment I searched the internet for different hashing
algorithms comparison wrt randomness and found an interesting post at
stackexchange [1]https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed. According to author's research the murmur2 algorithm
has the best randomness rate (among those he tested). So I implemented
it (using original code by Austin Appleby as a reference [2]https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp) and
conducted few experiments. Results are in attachement. Indeed, comparing
to murmur2 the FNV distribution seems pretty regular.
ISTM that there are undesired utf8 chars in a comment. Should be kept
ASCII.
Oops, I copy-pasted the algorithm name from wikipedia, didn't notice
there were some fancy unicode hyphens.
I would put the actual hash computation in a separate function rather
than inlined in the evaluator.
Done.
Add the submission to the next CF?
I think it is not commitfest ready yet -- I need to add some
documentation and tests first.
[1]: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
[2]: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
--
Ildar Musin
Postgres Professional:
http://www.postgrespro.com
Russian Postgres Company
Add the submission to the next CF?
I think it is not commitfest ready yet -- I need to add some
documentation and tests first.
It just helps to that the thread is referenced, and the review process has
started anyway.
--
Fabien.
21/12/2017 15:44, Fabien COELHO пишет:
Add the submission to the next CF?
I think it is not commitfest ready yet -- I need to add some
documentation and tests first.It just helps to that the thread is referenced, and the review process
has started anyway.
You are right, I've submitted the patch to upcoming commitfest.
Thanks!
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
I think it is not commitfest ready yet -- I need to add some
documentation and tests first.
Yes, doc & test are missing.
From your figures, the murmur2 algorithm output looks way better. I'm
wondering whether it makes sense to provide a bad hash function if a
good/better one is available, unless the bad one actually appears in some
benchmark... So I would suggest to remove fnv1a.
One implementation put constants in defines, the other one uses "const
int". The practice in pgbench seems to use defines (eg
MIN_GAUSSIAN_PARAM...), so I would suggest to stick to this style.
I'm wondering whether "hash" should be a shorthand for one hash functions,
as a provided default chosen for its quality and efficiency.
--
Fabien.
21/12/2017 18:26, Fabien COELHO пишет:
I think it is not commitfest ready yet -- I need to add some
documentation and tests first.Yes, doc & test are missing.
From your figures, the murmur2 algorithm output looks way better. I'm
wondering whether it makes sense to provide a bad hash function if a
good/better one is available, unless the bad one actually appears in
some benchmark... So I would suggest to remove fnv1a.
Actually the "bad" one appears in YCSB. But if we should choose the only
one I would stick to murmur too given it provides better results while
having similar computational complexity.
One implementation put constants in defines, the other one uses "const
int". The practice in pgbench seems to use defines (eg
MIN_GAUSSIAN_PARAM...), so I would suggest to stick to this style.
That had been my intention from the start until I coded it that way and
it looked ugly and hard to read (IMHO), like:
k *= MURMUR2_M;
k ^= k >> MURMUR2_R;
k *= MURMUR2_M;
result ^= k;
result *= MURMUR2_M;
...etc. And besides it is not a parameter to be tuned and not something
someone would ever want to change; it appears in just a single function.
So I'd better leave it the way it is. Actually I was thinking to do the
same to fnv1a too : )
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Hello Ildar,
Actually the "bad" one appears in YCSB.
Fine. Then it must be kept, whatever its quality.
But if we should choose the only one I would stick to murmur too given
it provides better results while having similar computational
complexity.
No. Keep both as there is a justification for the bad one. Just make
"hash()" default to a good one.
One implementation put constants in defines, the other one uses "const
int". [...][...] it looked ugly and hard to read (IMHO), like:
k *= MURMUR2_M;
k ^= k >> MURMUR2_R;
k *= MURMUR2_M;
result ^= k;
result *= MURMUR2_M;
Yep. The ugliness is significantly linked to the choice of name. With
MM2_MUL and MM2_ROT ISTM that it is more readable:
k *= MM2_MUL;
k ^= k >> MM2_ROT;
k *= MM2_MUL;
result ^= k;
result *= MM2_MUL;
[...] So I'd better leave it the way it is. Actually I was thinking to
do the same to fnv1a too : )
I think that the implementation style should be homogeneous, so I'd
suggest at least to stick to one style.
I noticed from the source of all human knowledege (aka Wikipedia:-) that
there seems to be a murmur3 successor. Have you considered it? One good
reason to skip it would be that the implementation is long and complex.
I'm not sure about a 8-byte input simplified version.
Just a question: Have you looked at SipHash24?
https://en.wikipedia.org/wiki/SipHash
The interesting point is that it can use a key and seems somehow
cryptographically secure, for a similar cost. However the how to decide
for/control the key is unclear.
--
Fabien.
Hello Fabien,
24/12/2017 11:12, Fabien COELHO пишет:
Yep. The ugliness is significantly linked to the choice of name. With
MM2_MUL and MM2_ROT ISTM that it is more readable:k *= MM2_MUL;
k ^= k >> MM2_ROT;
k *= MM2_MUL;
result ^= k;
result *= MM2_MUL;
Ok, will do.
[...] So I'd better leave it the way it is. Actually I was thinking
to do the same to fnv1a too : )I think that the implementation style should be homogeneous, so I'd
suggest at least to stick to one style.I noticed from the source of all human knowledege (aka Wikipedia:-)
that there seems to be a murmur3 successor. Have you considered it?
One good reason to skip it would be that the implementation is long
and complex. I'm not sure about a 8-byte input simplified version.
Murmur2 naturally supports 8-byte data. Murmur3 has 32- and 128-bit
versions. So to handle int64 I could
1) split input value into two halfs and combine somehow the results of
32 bit version or
2) use 128-bit version and discard higher bytes.
Btw, postgres core already has a 32bit murmur3 implementation, but it
only uses the finalization part of algorithm (see murmurhash32). As my
colleague Yura Sokolov told me in a recent conversation it alone
provides pretty good randomization. I haven't tried it yet though.
Just a question: Have you looked at SipHash24?
https://en.wikipedia.org/wiki/SipHash
The interesting point is that it can use a key and seems somehow
cryptographically secure, for a similar cost. However the how to
decide for/control the key is unclear.
Not yet. As I can understand from the wiki its main feature is to
prevent attacks with crafted input data. How can it be useful in
benchmarking? Unless it shows superior performance and randomization.
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Hello,
I noticed from the source of all human knowledege (aka Wikipedia:-)
that there seems to be a murmur3 successor. Have you considered it?
One good reason to skip it would be that the implementation is long
and complex. I'm not sure about a 8-byte input simplified version.Murmur2 naturally supports 8-byte data. Murmur3 has 32- and 128-bit
versions.
Ok.
So it would make sense to keep the 64 bit murmur2 version.
Pgbench ints are 64 bits, seems logical to keep them that way, so 32 bits
versions do not look too interesting.
Just a question: Have you looked at SipHash24?
Not yet. As I can understand from the wiki its main feature is to
prevent attacks with crafted input data. How can it be useful in
benchmarking?
No and yes:-)
The attack prevention is pretty useless in the context.
However, the key can be used if controlled so that different values do not
have the same randomization in different part of the script, so as to
avoid using the same patterns for different things if not desirable.
For the pgbench pseudo-random permutation I'm looking at, the key can be
specified to control whether the same pattern or not should be used.
--
Fabien.
25/12/2017 17:12, Fabien COELHO пишет:
However, the key can be used if controlled so that different values do
not have the same randomization in different part of the script, so as
to avoid using the same patterns for different things if not desirable.
Original murmur algorithm accepts seed as a parameter, which can be used
for this purpose. I used value itself as a seed in the patch, but I can
turn it into a function argument.
For the pgbench pseudo-random permutation I'm looking at, the key can
be specified to control whether the same pattern or not should be used.
Just curious, which algorithm are you intended to choose?
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Hello,
However, the key can be used if controlled so that different values do
not have the same randomization in different part of the script, so as
to avoid using the same patterns for different things if not desirable.Original murmur algorithm accepts seed as a parameter, which can be used
for this purpose. I used value itself as a seed in the patch, but I can
turn it into a function argument.
Hmmm. Possibly. However it needs a default, something like
x = hash(v[, key])
with key some pseudo-random value?
For the pgbench pseudo-random permutation I'm looking at, the key can
be specified to control whether the same pattern or not should be used.Just curious, which algorithm are you intended to choose?
I have found nothing convincing, so I'm inventing.
Most "permutation" functions are really cryptographic cyphers which are
quite expensive, and require powers of two, which is not what is needed.
ISTM that there are some constructs to deal with arbitrary sizes based on
cryptographic functions, but that would make it too expensive for the
purpose.
Currently I use the key as a seed for a cheap a prng, two rounds of
overlapping (to deal with non power of two) xor (from the prng output) and
prime-multiply-modulo to scatter the value. See attached work in progress.
If you have a pointer for a good and cheap generic (any size) keyed
permutation, feel free to share it.
--
Fabien.
Attachments:
test_hash_2.ctext/x-csrc; name=test_hash_2.cDownload
Fabien COELHO wrote:
Most "permutation" functions are really cryptographic cyphers which are
quite expensive, and require powers of two, which is not what is needed.
ISTM that there are some constructs to deal with arbitrary sizes based on
cryptographic functions, but that would make it too expensive for the
purpose.
FWIW, you might have a look at the permuteseq extension:
https://pgxn.org/dist/permuteseq
It permutes an arbitrary range of 64-bit integers into itself,
with a user-supplied key as the seed.
Outputs are coerced into the desired range by using the
smallest possible power of two for the Feistel cypher's
block size, and then cycle-walking over the results.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
Bonjour Daniel,
Most "permutation" functions are really cryptographic cyphers which are
quite expensive, and require powers of two, which is not what is needed.
ISTM that there are some constructs to deal with arbitrary sizes based on
cryptographic functions, but that would make it too expensive for the
purpose.FWIW, you might have a look at the permuteseq extension:
https://pgxn.org/dist/permuteseq
It permutes an arbitrary range of 64-bit integers into itself,
with a user-supplied key as the seed.
Outputs are coerced into the desired range by using the
smallest possible power of two for the Feistel cypher's
block size, and then cycle-walking over the results.
Thanks for the pointer.
I must admit that I do not like much the iterative "cycle walking"
approach because it can be much more expensive for some values, and it
makes the cost non uniform. Without that point, the overall encryption
looks quite costly.
For testing purposes, I'm looking for "pretty cheap" and "good enough",
and for that I'm ready to forsake "cryptographic":-)
Thanks again, it made an interesting read!
--
Fabien.
Greetings Ildar,
* Fabien COELHO (coelho@cri.ensmp.fr) wrote:
I noticed from the source of all human knowledege (aka Wikipedia:-)
that there seems to be a murmur3 successor. Have you considered it?
One good reason to skip it would be that the implementation is long
and complex. I'm not sure about a 8-byte input simplified version.Murmur2 naturally supports 8-byte data. Murmur3 has 32- and 128-bit
versions.Ok.
So it would make sense to keep the 64 bit murmur2 version.
Pgbench ints are 64 bits, seems logical to keep them that way, so 32 bits
versions do not look too interesting.
This sounds like it was settled and the patch needs updating with these
changes, and with documentation and tests added.
While the commitfest is still pretty early on, I would suggest trying to
find time soon to update the patch and resubmit it, so it can get
another review and hopefully be included during this commitfest.
Also, I've not checked myself, but the patch appears to be failing for
http://commitfest.cputube.org/ so it likely also needs a rebase.
Thanks!
Stephen
Hello Fabien,
25/12/2017 19:17, Fabien COELHO пишет:
However, the key can be used if controlled so that different values do
not have the same randomization in different part of the script, so as
to avoid using the same patterns for different things if not desirable.Original murmur algorithm accepts seed as a parameter, which can be used
for this purpose. I used value itself as a seed in the patch, but I can
turn it into a function argument.Hmmm. Possibly. However it needs a default, something like
x = hash(v[, key])
with key some pseudo-random value?
Sorry for a long delay. I've added hash() function which is just an
alias for murmur2. I've also utilized variable arguments feature from
least()/greates() functions to make optional seed parameter, but I
consider this as a hack. Should we probably add some infrastructure for
optional arguments? Docs and tests are on their way.
Thanks!
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
pgbench_hash_v3.patchtext/plain; charset=UTF-8; name=pgbench_hash_v3.patch; x-mac-creator=0; x-mac-type=0Download+98-2
09/01/2018 19:22, Ildar Musin пишет:
Hello Fabien,
25/12/2017 19:17, Fabien COELHO пишет:
However, the key can be used if controlled so that different values do
not have the same randomization in different part of the script, so as
to avoid using the same patterns for different things if not desirable.Original murmur algorithm accepts seed as a parameter, which can be used
for this purpose. I used value itself as a seed in the patch, but I can
turn it into a function argument.Hmmm. Possibly. However it needs a default, something like
x = hash(v[, key])
with key some pseudo-random value?
Sorry for a long delay. I've added hash() function which is just an
alias for murmur2. I've also utilized variable arguments feature from
least()/greates() functions to make optional seed parameter, but I
consider this as a hack. Should we probably add some infrastructure for
optional arguments? Docs and tests are on their way.Thanks!
I forgot to attach couple of fresh charts that show how distribution
changes with respect to seed.
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Hello Ildar,
Sorry for a long delay. I've added hash() function which is just an
alias for murmur2. I've also utilized variable arguments feature from
least()/greates() functions to make optional seed parameter, but I
consider this as a hack.
Patch needs a rebase after Teodor push for a set of pgbench functions.
Should we probably add some infrastructure for optional arguments?
You can look at the handling of "CASE" which may or may not have an "ELSE"
clause.
I'd suggest you use a new negative argument with the special meaning for
hash, and create the seed value when missing when building the function,
so as to simplify the executor code.
Instead of 0, I'd consider providing a random default so that the hashing
behavior is not the same from one run to the next. What do you think?
Like the previous version, murmur2 with seed looks much more random than
fnv with seed...
--
Fabien.
09/01/2018 23:11, Fabien COELHO пишет:
Hello Ildar,
Sorry for a long delay. I've added hash() function which is just an
alias for murmur2. I've also utilized variable arguments feature from
least()/greates() functions to make optional seed parameter, but I
consider this as a hack.Patch needs a rebase after Teodor push for a set of pgbench functions.
Done. Congratulations on your patch finally being committed : )
Should we probably add some infrastructure for optional arguments?
You can look at the handling of "CASE" which may or may not have an
"ELSE" clause.I'd suggest you use a new negative argument with the special meaning
for hash, and create the seed value when missing when building the
function, so as to simplify the executor code.
Added a new nargs option -3 for hash functions and moved arguments check
to parser. It's starting to look a bit odd and I'm thinking about
replacing bare numbers (-1, -2, -3) with #defined macros. E.g.:
#define PGBENCH_NARGS_VARIABLE (-1)
#define PGBENCH_NARGS_CASE (-2)
#define PGBENCH_NARGS_HASH (-3)
Instead of 0, I'd consider providing a random default so that the
hashing behavior is not the same from one run to the next. What do you
think?
Makes sence since each thread is also initializes its random_state with
random numbers before start. So I added global variable 'hash_seed' and
initialize it with random() before threads spawned.
Like the previous version, murmur2 with seed looks much more random
than fnv with seed...
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
10/01/2018 16:35, Ildar Musin пишет:
09/01/2018 23:11, Fabien COELHO пишет:
Hello Ildar,
Sorry for a long delay. I've added hash() function which is just an
alias for murmur2. I've also utilized variable arguments feature from
least()/greates() functions to make optional seed parameter, but I
consider this as a hack.Patch needs a rebase after Teodor push for a set of pgbench functions.
Done. Congratulations on your patch finally being committed : )
Should we probably add some infrastructure for optional arguments?
You can look at the handling of "CASE" which may or may not have an
"ELSE" clause.I'd suggest you use a new negative argument with the special meaning
for hash, and create the seed value when missing when building the
function, so as to simplify the executor code.Added a new nargs option -3 for hash functions and moved arguments check
to parser. It's starting to look a bit odd and I'm thinking about
replacing bare numbers (-1, -2, -3) with #defined macros. E.g.:#define PGBENCH_NARGS_VARIABLE (-1)
#define PGBENCH_NARGS_CASE (-2)
#define PGBENCH_NARGS_HASH (-3)Instead of 0, I'd consider providing a random default so that the
hashing behavior is not the same from one run to the next. What do you
think?Makes sence since each thread is also initializes its random_state with
random numbers before start. So I added global variable 'hash_seed' and
initialize it with random() before threads spawned.Like the previous version, murmur2 with seed looks much more random
than fnv with seed...
Sorry, here is the patch
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company