gaussian distribution pgbench

Started by KONDO Mitsumasaover 12 years ago61 messageshackers
Jump to latest
#1KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp

Hello,

I create gaussinan distribution pgbench patch that can access records with
gaussian frequency. And I submit this commit fest.

* Purpose this patch
In the general transaction situation, clients access for all records equally is
hard to happen. I think gaussian distribution access patterns are most of
transaction petterns in general. My patch realizes neary this access pattern.

I think that not only it can simulate a general access pattern as an effect of
this patch, but also it is useful for new development features such as effective
use and free of shared_buffers, the readahead optimization in the OS, and the
speed-up of the tuple level lock.

* Usage
It is easy to use, only put -g with standard deviation threshold parameter.
If we set larger standard deviation threshold, pgbench access patern limited
more specific records. Min standard deviation threshold is 2.

Execution example command is here.

[mitsu-ko@localhost postgresql]$ bin/pgbench -g 10 -c 16 -j 8 -T 300
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
standard deviation threshold: 10.00000
access probability of top 20%, 10% and 5% records: 0.95450 0.68269 0.38292
query mode: simple
number of clients: 16
number of threads: 8
duration: 300 s
number of transactions actually processed: 566367
tps = 1887.821409 (including connections establishing)
tps = 1887.949390 (excluding connections establishing)

"access probability" indicates top N access probability in this benchmark.
If we set larger standard deviation threshold parameter, it become more large.

Attached png files which are "gausian_2.png" and "gaussian_10.png" indicate
gaussian distribution access patern by my patch. "no_gaussian.png" is not with -g
option (normal). I think my patch realize gaussian distribution access patern.

* Approach
It replaces uniform random number generator to gaussian distribution random
number generator using by box-muller tansform method. Then, I use standard
deviation threshold parameter for mapping a normal distribution access pattern in
each record and normalization. It is linear mappping method that is a floating
point to an integer value.

* other
I also create another patches that can get more accurate benchmark result in
pgbench, and will submit them this commit fest. They are like that I submitted
checkpoint patch in the past. They are all right, too!

Any question?

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

gaussian_pgbench_v0.patchtext/x-diff; name=gaussian_pgbench_v0.patchDownload+58-3
gaussian_2.pngimage/png; name=gaussian_2.pngDownload
gaussian_10.pngimage/png; name=gaussian_10.pngDownload
no_gaussian.pngimage/png; name=no_gaussian.pngDownload
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: KONDO Mitsumasa (#1)
Re: gaussian distribution pgbench

KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create gaussinan distribution pgbench patch that can access
records with gaussian frequency. And I submit this commit fest.

Thanks!

I have moved this to the Open CommitFest, though.

https://commitfest.postgresql.org/action/commitfest_view/open

You had accidentally added to the CF In Progress.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#1)
Re: gaussian distribution pgbench

Hello Mitsumasa,

In the general transaction situation, clients access for all records equally is
hard to happen. I think gaussian distribution access patterns are most of
transaction petterns in general. My patch realizes neary this access pattern.

That is great! I was just looking for something like that!

I have not looked at the patch yet, but from the plots you sent, it seems
that it is a gaussian distribution over the keys. However this pattern
induces stronger cache effects which are maybe not too realistic, because
neighboring keys in the middle are more likely to be chosen.

It seems to me that this is not desirable.

Have you considered adding a "randomization" layer, that is once you have
a key in [1 .. n] centered around n/2, then you perform a pseudo-random
transformation into the same domain so that key values are scattered over
the whole domain?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#3)
Re: gaussian distribution pgbench

You had accidentally added to the CF In Progress.

Oh, I had completely mistook this CF schedule :-)

Maybe, Horiguchi-san is same situation...

However, because of your moving, I become first submitter in next CF.

Thank you for moving :-)

--

Mitsumasa KONDO

#5Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Mitsumasa KONDO (#4)
Re: gaussian distribution pgbench

However this pattern induces stronger cache effects which are maybe not

too realistic,

because neighboring keys in the middle are more likely to be chosen.

I think that your opinion is right. However, in effect, it is a
paseudo-benchmark, so that I think that such a simple mechanism is also
necessary.

Have you considered adding a "randomization" layer, that is once you have

a key in [1 .. > n] centered around n/2, then you perform a pseudo-random
transformation into the same > domain so that key values are scattered over
the whole domain?

Yes. I also consider this patch. It can realize by adding linear mapping
array which is created by random generator. However, current erand48
algorithm is not high accuracy and fossil algorithm, I do not know whether
it works well. If we realize it, we may need more accurate random generator
algorithm which is like Mersenne Twister*.*

Regards,

--

Mitsumasa KONDO

#6Peter Eisentraut
peter_e@gmx.net
In reply to: KONDO Mitsumasa (#1)
Re: gaussian distribution pgbench

On 9/20/13 2:42 AM, KONDO Mitsumasa wrote:

I create gaussinan distribution pgbench patch that can access records with
gaussian frequency. And I submit this commit fest.

This patch no longer applies.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Kevin Grittner (#2)
Re: gaussian distribution pgbench

Sorry for my delay reply.
Since I have had vacation last week, I replyed from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 6:05), Kevin Grittner wrote:

You had accidentally added to the CF In Progress.

Oh, I had completely mistook this CF schedule :-)
Maybe, Horiguchi-san is same situation...

However, because of your moving, I become first submitter in next CF.
Thank you for moving !
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#3)
Re: gaussian distribution pgbench

Sorry for my delay reply.
Since I have had vacation last week, I replied from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 7:54), Fabien COELHO wrote:

However this pattern induces stronger cache effects which are maybe not too realistic,
because neighboring keys in the middle are more likely to be chosen.

I think that your opinion is right. However, in effect, it is a
paseudo-benchmark, so that I think that such a simple mechanism is also necessary.

Have you considered adding a "randomization" layer, that is once you have a key in [1 .. > n] centered around n/2, then you perform a pseudo-random transformation into the same > domain so that key values are scattered over the whole domain?

Yes. I also consider this patch. It can realize by adding linear mapping array
which is created by random generator. However, current erand48 algorithm is not
high accuracy and fossil algorithm, I do not know whether it works well. If we
realize it, we may need more accurate random generator algorithm which is like
Mersenne Twister.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Peter Eisentraut (#6)
Re: gaussian distribution pgbench

(2013/09/27 5:29), Peter Eisentraut wrote:

This patch no longer applies.

I will try to create this patch in next commit fest.
If you have nice idea, please send me!

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: KONDO Mitsumasa (#9)
Re: gaussian distribution pgbench

On 30.09.2013 07:12, KONDO Mitsumasa wrote:

(2013/09/27 5:29), Peter Eisentraut wrote:

This patch no longer applies.

I will try to create this patch in next commit fest.
If you have nice idea, please send me!

A few thoughts on this:

1. DBT-2 uses a non-uniform distribution. You can use that instead of
pgbench.

2. Do we really want to add everything and the kitchen sink to pgbench?
Every addition is small when considered alone, but we'll soon end with a
monster. So I'm inclined to reject this patch on those grounds.

3. That said, this could be handy. But it would be even more handy if
you could get Gaussian random numbers with \setrandom, so that you could
use this with custom scripts. And once you implement that, do we
actually need the -g flag anymore? If you want TPC-B transactions with
gaussian distribution, you can write a custom script to do that. The
documentation includes a full script that corresponds to the built-in
TPC-B script.

So what I'd actually like to see is \setgaussian, for use in custom scripts.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Heikki Linnakangas (#10)
Re: gaussian distribution pgbench

3. That said, this could be handy. But it would be even more handy if you
could get Gaussian random numbers with \setrandom, so that you could use this
with custom scripts. And once you implement that, do we actually need the -g
flag anymore? If you want TPC-B transactions with gaussian distribution, you
can write a custom script to do that. The documentation includes a full
script that corresponds to the built-in TPC-B script.

So what I'd actually like to see is \setgaussian, for use in custom scripts.

Indeed, great idea! That looks pretty elegant! It would be something like:

\setgauss var min max sigma

I'm not sure whether sigma should be relative to max-min, or absolute.
I would say relative is better...

A concerned I raised is that what one should really want is a "pseudo
randomized" (discretized) gaussian, i.e. you want the probability of each
value along a gaussian distribution, *but* no direct frequency correlation
between neighbors. Otherwise, you may have unwanted/unrealistic positive
cache effects. Maybe this could be achieved by an independent built-in,
say either:

\randomize var min max [parameter ?]
\randomize var min max val [parameter]

Which would mean take variable var which must be in [min,max], and apply a
pseudo-random transformation which results is also in [min,max].

From a probabilistic point of view, it seems to me that a randomized
(discretized) exponential would be more significant to model a server
load.

\setexp var min max lambda...

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Heikki Linnakangas (#10)
Re: gaussian distribution pgbench

On Thu, Nov 21, 2013 at 9:13 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

So what I'd actually like to see is \setgaussian, for use in custom scripts.

+1. I'd really like to be able to run a benchmark with a Gaussian and
uniform distribution side-by-side for comparative purposes - we need
to know that we're not optimizing one at the expense of the other.
Sure, DBT-2 gets you a non-uniform distribution, but it has serious
baggage from it being a tool primarily intended for measuring the
relative performance of different database systems. pgbench would be
pretty worthless for measuring the relative strengths and weaknesses
of different database systems, but it is not bad at informing the
optimization efforts of hackers. pgbench is a defacto standard for
that kind of thing, so we should make it incrementally better for that
kind of thing. No standard industry benchmark is likely to replace it
for this purpose, because such optimizations require relatively narrow
focus.

Sometimes I want to maximally pessimize the number of FPIs generated.
Other times I do not. Getting a sense of how something affects a
variety of distributions would be very valuable, not least since
normal distributions abound in nature.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Peter Geoghegan (#12)
Re: gaussian distribution pgbench

On 20/12/13 09:36, Peter Geoghegan wrote:

On Thu, Nov 21, 2013 at 9:13 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

So what I'd actually like to see is \setgaussian, for use in custom scripts.

+1. I'd really like to be able to run a benchmark with a Gaussian and
uniform distribution side-by-side for comparative purposes - we need
to know that we're not optimizing one at the expense of the other.
Sure, DBT-2 gets you a non-uniform distribution, but it has serious
baggage from it being a tool primarily intended for measuring the
relative performance of different database systems. pgbench would be
pretty worthless for measuring the relative strengths and weaknesses
of different database systems, but it is not bad at informing the
optimization efforts of hackers. pgbench is a defacto standard for
that kind of thing, so we should make it incrementally better for that
kind of thing. No standard industry benchmark is likely to replace it
for this purpose, because such optimizations require relatively narrow
focus.

Sometimes I want to maximally pessimize the number of FPIs generated.
Other times I do not. Getting a sense of how something affects a
variety of distributions would be very valuable, not least since
normal distributions abound in nature.

Curious, wouldn't the common usage pattern tend to favour a skewed
distribution, such as the Poisson Distribution (it has been over 40
years since I studied this area, so there may be better candidates).

Just that gut feeling & experience tends to make me think that the
"Normal" distribution may often not be the best for database access
simulation.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Gregory Smith
gregsmithpgsql@gmail.com
In reply to: Gavin Flower (#13)
Re: gaussian distribution pgbench

On 12/19/13 5:52 PM, Gavin Flower wrote:

Curious, wouldn't the common usage pattern tend to favour a skewed
distribution, such as the Poisson Distribution (it has been over 40
years since I studied this area, so there may be better candidates).

Some people like database load testing with a "Pareto principle"
distribution, where 80% of the activity hammers 20% of the rows such
that locking becomes important. (That's one specific form of Pareto
distribution) The standard pgbench load indirectly gets you quite a bit
of that due to all the contention on the branches table. Targeting all
of that at a single table can be more realistic.

My last round of reviewing a pgbench change left me pretty worn out with
wanting to extend that code much further. Adding in some new
probability distributions would be fine though, that's a narrow change.
We shouldn't get too excited about pgbench remaining a great tool for
too much longer though. pgbench is fast approaching a wall nowadays,
where it's hard for any single client server to fully overload today's
larger server. You basically need a second large server to generate
load, whereas what people really want is a bunch of coordinated small
clients. (That sort of wall was in early versions too, it just got
pushed upward a lot by the multi-worker changes in 9.0 coming around the
same time desktop core counts really skyrocketed)

pgbench started as a clone of a now abandoned Java project called
JDBCBench. I've been seriously considering a move back toward that
direction lately. Nowadays spinning up ten machines to run load
generation is trivial. The idea of extending pgbench's C code to
support multiple clients running at the same time and collating all of
their results is not a project I'd be excited about. It should remain a
perfectly fine tool for PostgreSQL developers to find code hotspots, but
that's only so useful.

(At this point someone normally points out Tsung solved all of those
problems years ago if you'd only give it a chance. I think it's kind of
telling that work on sysbench is rewriting the whole thing so you can
use Lua for your test scripts.)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Gregory Smith (#14)
Re: gaussian distribution pgbench

Hi,

I revise my gaussian pgbench patch which wss requested from community.

* Changes
- Support custom script.
- "\setgaussian" is generating gaussian distribute random number.
- ex) \setgaussian [var] [min] [max] [stddev_threshold]
- We can use mixture model in multiple custom scripts.
- Delete short option "-g", and add long options ”--gaussian"
- Refactoring getrand() interface

- getrand(TState *thread, int64 min, int64 max)
+ getrand(TState *thread, int64 min, int64 max, DistType dist_type, double value1)

- We can easy to add other random distribution algorithms. Please see detail
design in attached patch.

Febien COELHO wrote:

From a probabilistic point of view, it seems to me that a randomized

(discretized) exponential would be more significant to model a server load.

\setexp var min max lambda...

I can create randomized exponential distribution under following. It is very easy.
double rand_exp( double lambda ){
return -log(Uniform(0,1))/lambda;
}
If community wants this, I will add this function in my patch.

Gavin Flower wrote:

Curious, wouldn't the common usage pattern tend to favour a skewed distribution,
such as the Poisson Distribution (it has been over 40 years since I studied
this area, so there may be better candidates).

The difference between Poisson distribution and Gaussian distribution is discrete
or not.
In my gaussian algorithm, first generating continuos gaussian distribution, next
projection to integer values which are each record, it will be discrete value.
Therefore, it will be almost simular with Poisson distribution. And when we set
larger standard deviations(higher 10), it will be created better approximation of
Poisson distribution.

Attached sql files are for custom scripts which are different distribution. It
realize mixture distribuion benchmark. And attached graph is the result.
[example command]
$pgbench -f file1.sql file2.sql

If you have more some comment, please send me.
Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

gaussian_pgbench_v3.patchtext/x-diff; name=gaussian_pgbench_v3.patchDownload+316-66
file1.sqltext/plain; charset=Shift_JIS; name=file1.sqlDownload
file2.sqltext/plain; charset=Shift_JIS; name=file2.sqlDownload
graph.pngimage/png; name=graph.pngDownload
#16Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#15)
Re: gaussian distribution pgbench

Hello,

I revise my gaussian pgbench patch which wss requested from community.

With a lot of delay for which I apologise, please find hereafter the
review.

Gaussian Pgbench v3 patch by Mitsumasa KONDO review

* The purpose of the patch is to allow a pgbench script to draw from normally
distributed integer values instead of uniformly distributed.

This is a valuable contribution to enable pgbench to generate more realistic
loads, which is seldom uniform in practice.

However, ISTM that other distributions such an exponantial one would make
more sense, and also the values should be further randomized so that
neighboring values are not more likely to be drawn. The latest point is non
trivial.

* Compilation

The patch applies and compiles against current head. It works as expected,
although there is few feedback from the script to show that.

* Mathematical soundness

We want to derive a discrete normal distribution from a uniform one.
Well, normal distributions are for continuous variables... Anyway, this is
done by computing a continuous normal distribution which is then projected
onto integers. I'm basically fine with that.

The system uses a Box-Muller transform (1958) to do this transformation.
The Ziggurat method seems to be prefered for this purpose, *but* it would
require precalculated tables which depends on the target values. So I'm
fine with the Box-Muller transform for pgbench.

The BM method uses 2 uniformly distributed numbers to derive 2 normally
distributed numbers. The implementation computes one of these, and loops
over till one match a threshold criterion.

More explanations, at least in comments, are needed about this threshold
and its meaning. It is required to be more than 2. I guess is that it allows
to limit the number of iterations of the while loop, but in what proportion
is unclear. The documentation does not also help the user to understand
this value and its meaning.

What I think it is: it is the deviation for the FURTHEST point around the
mean, that is the actual deviation associated to the "min" and "max" target
values. The 2 minimum value induces that there is a least 4 stddev lengths
between min & max, with the most likely mean in the middle.

If the threshold test fails, one of the 2 uniform number is redrawn, a new
candidate value is tested. I'm not at ease about why only 1 value is redrawn
and not both, some explanations would be welcome. Also, on the other hand,
why not test the other possible value (with cos) if the first one fails?

Also, as suggested above, I would like some explanations about how much this
while loop may iterate without success, say with the expected average number
of iterations with its explanation in a comment.

* Implementation

Random values :
double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h
rand2 should be in (0, 1], but it is in [0, 1), use "1.0 - ..." as well?!

What is called "stdev*" in getrand() is really the chosen deviation from
the target mean, so it would make more sense to name it "dev".

I do not think that the getrand refactoring was such a good idea. I'm sorry
if I may have suggested that in a previous comment.
The new getrand possibly ignores its parameters, hmmmm. ISTM that it would
be much simpler in the code to have a separate and clean "getrand_normal"
or "getrand_gauss" called for "\setgaussian", and that's it. This would
allow to get rid of DistType and all of getrand changes in the code.

There are heavy constants computations (sqrt(log()) within the while
loop which would be moved out of the loop.

ISTM that the while condition would be easier to read as:

while ( dev < - threshold || threshold < dev )

Maybe the \\setgaussian argument handling may be transformed into a function,
so that it could be used easily later for some other distribution (say some
setexp:-)

* Options

ISTM that the test options would be better if made orthogonal, i.e. not to
have three --gaussian* options. I would suggest to have only one
--gaussian=NUM which would trigger gaussian tests with this threshold,
and --gaussian=3.5 --select-only would use the select-only variant,
and so on.

* Typos

gausian -> gaussian
patern -> pattern

* Conclusion :

- this is a valuable patch to help create more realistic load and make pgbench
a more useful tool. I'm greatly in favor of having such a functionality.

- it seems to me that the patch should be further improved before being
committed, in particular I would suggest:

(1) improve the explanations in the code and in the documentation, especially
about what is the "deviation threshold" and its precise link to generated
values.

(2) simplify the code with a separate gaussian getrand, and simpler or
more efficient code here and there, see comments above.

(3) use only one option to trigger gaussian tests.

(bonus) \setexp would be a nice:-)

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#16)
Re: gaussian distribution pgbench

Hi Febien,

Thank you very much for your very detail and useful comments!
I read your comment, I agree most of your advice:)

Attached patch is fixed for your comment. That are...
- Remove redundant long-option.
- We can use "--gaussian=NUM -S" or "--gaussian=NUMN -N" options.
- Add sentence in document
- Separate two random generate function which are uniform and gaussian.
- getGaussianrand() is created.
- Fix ranged random number more strictly, ex. (0,1) or [0,1).
- Please see comment of source code in detail:).
- Fix typo.
- Use cos() and sin() function when we generate gaussian random number.
- Add fast sqrt calculation algorithm.
- Reuse sqrt result and pre generate random number for reducing calculation cost.
- Experience of this method is under following. It will be little-bit faster
than non-reuse method. And distribution of gaussian is still good.

* Settings
shared_buffers = 1024MB

* Test script
pgbench -i -s 1
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
pgbench --gaussian=2 -T 30 -S -c8 -j4 -n

* Result
method | try1 | try2 | try3 |
--------------------------------------------|
reuse method | 44189 | 44453 | 44013 |
non-reuse method | 43567 | 43635 | 43508 |

(2014/02/09 21:32), Fabien COELHO wrote:

This is a valuable contribution to enable pgbench to generate more realistic
loads, which is seldom uniform in practice.

Thanks!

However, ISTM that other distributions such an exponantial one would make
more sense,

I can easy to create exponential distribution. Here, I assume exponential
distribution that is f(x) = lambda * exp^(-lambda * x) in general.
What do you think under following interface?

custom script: \setexp [varname] min max threshold
command : --exponential=NUM(threshold)

I don't want to use lambda variable for simple implementation. So lambda is
always 1. Because it can enough to control distribution by threshold. Threshold
parameter is f(x) value. And using created distribution projects to 'aid' by same
method. If you think OK, I will impliment under followings tomorrow, and also
create parseing part of this function...

do
{
rand = 1.0 - pg_erand48(thread->random_state);
rand = -log(rand);
}while( rand > exp_threshold)

return rand / exp_threshold;

and also the values should be further randomized so that
neighboring values are not more likely to be drawn. The latest point is non
trivial.

That's right, but I worry about gaussian randomness and benchmark reproducibility
might be disappeared when we re-randomized access pattern, because Postgres
storage method manages records by each pages and it is difficult to realize
access randomness in whole pages, not record. If we solve this problem, we have
to need algorithm for smart shuffule projection function that is still having
gaussian randomized. I think it will be difficult, and it have to impement in
another patch in the future.

* Mathematical soundness

We want to derive a discrete normal distribution from a uniform one.
Well, normal distributions are for continuous variables... Anyway, this is
done by computing a continuous normal distribution which is then projected
onto integers. I'm basically fine with that.

The system uses a Box-Muller transform (1958) to do this transformation.
The Ziggurat method seems to be prefered for this purpose, *but* it would
require precalculated tables which depends on the target values. So I'm
fine with the Box-Muller transform for pgbench.

Yes, that's right. I selected simple and relatively faster algorithm, that is
Box-Muller transform.

The BM method uses 2 uniformly distributed numbers to derive 2 normally
distributed numbers. The implementation computes one of these, and loops
over till one match a threshold criterion.

More explanations, at least in comments, are needed about this threshold
and its meaning. It is required to be more than 2. I guess is that it allows
to limit the number of iterations of the while loop,

Yes. This loop could not almost go on, because min stdev_threshold is 2.
The possibility of retry-loop is under 4 percent. It might not be problem.

but in what proportion
is unclear. The documentation does not also help the user to understand
this value and its meaning.

Yes, it is huristic method. So I added the comments in document.

What I think it is: it is the deviation for the FURTHEST point around the
mean, that is the actual deviation associated to the "min" and "max" target
values. The 2 minimum value induces that there is a least 4 stddev lengths
between min & max, with the most likely mean in the middle.

Correct!

If the threshold test fails, one of the 2 uniform number is redrawn, a new
candidate value is tested. I'm not at ease about why only 1 value is redrawn
and not both, some explanations would be welcome. Also, on the other hand,
why not test the other possible value (with cos) if the first one fails?

Yes, I think so too. So I fixed this partan and it will be better. Past
implementations are not good:(

Also, as suggested above, I would like some explanations about how much this
while loop may iterate without success, say with the expected average number
of iterations with its explanation in a comment.

I add my comments in source code.

* Implementation

Random values :
double rand1 = 1.0 - rand; // instead of the LONG_MAX computation & limits.h
rand2 should be in (0, 1], but it is in [0, 1), use "1.0 - ..." as well?!

It's more smart method. I change to this method.

What is called "stdev*" in getrand() is really the chosen deviation from
the target mean, so it would make more sense to name it "dev".

Hmm, I like stdev*. Short variable makes us more confuse:( And it's not big problem.

I do not think that the getrand refactoring was such a good idea. I'm sorry
if I may have suggested that in a previous comment.
The new getrand possibly ignores its parameters, hmmmm. ISTM that it would
be much simpler in the code to have a separate and clean "getrand_normal"
or "getrand_gauss" called for "\setgaussian", and that's it. This would
allow to get rid of DistType and all of getrand changes in the code.

I separate two function that are getrand() and getGaussianrand(), it becomes more
clear I think.

There are heavy constants computations (sqrt(log()) within the while
loop which would be moved out of the loop.

ISTM that the while condition would be easier to read as:

while ( dev < - threshold || threshold < dev )

OK, fixed.

Maybe the \\setgaussian argument handling may be transformed into a function,
so that it could be used easily later for some other distribution (say some
setexp:-)

* Options

ISTM that the test options would be better if made orthogonal, i.e. not to
have three --gaussian* options. I would suggest to have only one
--gaussian=NUM which would trigger gaussian tests with this threshold,
and --gaussian=3.5 --select-only would use the select-only variant,
and so on.

Agreed. Fixed.

* Typos

gausian -> gaussian
patern -> pattern

Oh, fixed.

* Conclusion :

- this is a valuable patch to help create more realistic load and make pgbench
a more useful tool. I'm greatly in favor of having such a functionality.

- it seems to me that the patch should be further improved before being
committed, in particular I would suggest:

(1) improve the explanations in the code and in the documentation, especially
about what is the "deviation threshold" and its precise link to generated
values.

(2) simplify the code with a separate gaussian getrand, and simpler or
more efficient code here and there, see comments above.

(3) use only one option to trigger gaussian tests.

(bonus) \setexp would be a nice:-)

Thank you for your comments. They make my patch more polished:)
I think my patch is fixed for supporting all your comments, but it might not be fixed
as you think. And if you notice other part, please send me.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

gaussian_pgbench_v5.patchtext/x-diff; name=gaussian_pgbench_v5.patchDownload+329-36
#18KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#17)
Re: gaussian distribution pgbench

Sorry, previos attached patch has small bug.
Please use latest one.

134 - return min + (int64) (max - min + 1) * rand;
134 + return min + (int64)((max - min + 1) * rand);

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

gaussian_pgbench_v5-1.patchtext/x-diff; name=gaussian_pgbench_v5-1.patchDownload+329-36
#19Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: KONDO Mitsumasa (#18)
Re: gaussian distribution pgbench

I add exponential distribution random generator (and little bit
refactoring:) ).
I use inverse transform method to create its distribution. It's very
simple method that is
created by - log (rand()). We can control slope of distribution using
threshold parameter.
It is same as gaussian threshold.

usage example
pgbench --exponential=NUM -S

Attached graph is created with exponential threshold = 5. We can see
exponential
distribution in the graphs. It supports -S, -N options and custom script.
So we set
"¥setexponential [var] [min] [max] [threshold]" in a transaction pattern
file,
it appear distribution we want.

We have no time to fix its very much... But I think almost part of patch
have been completed.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

gaussian_and_exponential_pgbench_v6.patchapplication/octet-stream; name=gaussian_and_exponential_pgbench_v6.patchDownload+453-68
exponential=5.pngimage/png; name="exponential=5.png"Download
gnuplot.shapplication/x-sh; name=gnuplot.shDownload
#20Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Mitsumasa KONDO (#19)
Re: gaussian distribution pgbench

Gaussian Pgbench v6 patch by Mitsumasa KONDO review & patch v7.

* The purpose of the patch is to allow a pgbench script to draw from normally
distributed or exponentially distributed integer values instead of uniformly
distributed.

This is a valuable contribution to enable pgbench to generate more realistic
loads, which is seldom uniform in practice.

* Changes

I have updated the patch (v7) based on Mitsumasa latest v6:
- some code simplifications & formula changes.
- I've added explicit looping probability computations in comments
to show the (low) looping probability of the iterative search.
- I've tried to clarify the sgml documentation.
- I've removed the 5.0 default value as it was not used anymore.
- I've renamed some variables to match the naming style around.

* Compilation

The patch applies and compiles against current head. It works as expected,
although there is few feedback from the script to show that. By looking
at the "aid" distribution in the "pgbench_history" table after a run, I
could check that the aid values are indeed skewed, depending on the parameters.

* Mathematical soundness

I've checked again the mathematical soundness for the methods involved.

After further thoughts, I'm not that sure that there is not a bias induced
by taking the second value based on "cos" when the first based on "sin"
as failed the test. So I removed the cos computation for the gaussian version,
and simplified the code accordingly. This mean that it may be a little
less efficient, but I'm more confident that there is no bias.

* Conclusion

If Mitsumasa-san is okay with the changes I have made, I would suggest
to accept this patch.

--
Fabien.

Attachments:

gaussian_and_exponential_pgbench_v7.patchtext/x-diff; name=gaussian_and_exponential_pgbench_v7.patchDownload+420-15
#21KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#20)
#22Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fabien COELHO (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#23)
#25Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#24)
#26KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#25)
#27Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#26)
#28KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#27)
#29Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#28)
#30KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#29)
#31KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#30)
#32KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#31)
#33Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#31)
#34KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#33)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: KONDO Mitsumasa (#34)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#35)
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#36)
#38Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fujii Masao (#35)
#39KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#38)
#40KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fujii Masao (#37)
#41Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#40)
#42Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#37)
#43Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Heikki Linnakangas (#42)
#44Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#43)
#45Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Mitsumasa KONDO (#44)
#46Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Mitsumasa KONDO (#44)
#47Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#46)
#48KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fabien COELHO (#43)
#49KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fabien COELHO (#43)
#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: KONDO Mitsumasa (#49)
#52KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#50)
#53KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#51)
#54Fujii Masao
masao.fujii@gmail.com
In reply to: KONDO Mitsumasa (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: KONDO Mitsumasa (#53)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Mitsumasa KONDO (#44)
#57KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Tom Lane (#55)
#58KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Robert Haas (#56)
#59KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#57)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: KONDO Mitsumasa (#59)
#61Fabien COELHO
coelho@cri.ensmp.fr
In reply to: KONDO Mitsumasa (#49)