Re: gaussian distribution pgbench

Started by Mitsumasa KONDOalmost 12 years ago40 messageshackers
Jump to latest
#1Mitsumasa KONDO
kondo.mitsumasa@gmail.com

Hello Fabien-san,

I have checked your v13 patch, and tested the new exponential distribution
generating algorithm. It works fine and less or no overhead than previous
version.
Great work! And I agree with your proposal.

And I'm also interested in your "decile percents" output like under
followings,

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=20
~
decile percents: 86.5% 11.7% 1.6% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
~
decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
~
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
~

I think that it is easy to understand exponential distribution when I check
the exponential parameter. I also agree with it. So I create decile
percents output
in gaussian distribution.
Here are the examples.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=20
~
decile percents: 0.0% 0.0% 0.0% 0.0% 50.0% 50.0% 0.0% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=10
~
decile percents: 0.0% 0.0% 0.0% 2.3% 47.7% 47.7% 2.3% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=5
~
decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

I think that it is easier than before. Sum of decile percents is just 100%.

However, I don't prefer "highest/lowest percentage" because it will be
confused
with decile percentage for users, and anyone cannot understand this
digits.

Here is example when sets exponential=5,

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
~
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
highest/lowest percent of the range: 4.9% 0.0%
~

I cannot understand "4.9%, 0.0%" when I see the first time.
Then, I checked the source code, I understood it:( It's not good design...
#Why this parameter use 100?
So I'd like to remove it if you like. It will be more simple.

Attached patch is fixed version, please confirm it.
#Of course, World Cup is being held now. I'm not hurry at all.

Best regards,
--
Mitsumasa KONDO

Attachments:

gaussian_and_exponential_pgbench_v14.patchtext/x-diff; charset=US-ASCII; name=gaussian_and_exponential_pgbench_v14.patchDownload+470-150
#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Mitsumasa KONDO (#1)

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just 100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will be
confused with decile percentage for users, and anyone cannot understand
this digits. I cannot understand "4.9%, 0.0%" when I see the first time.
Then, I checked the source code, I understood it:( It's not good
design... #Why this parameter use 100?

What else? People have ten fingers and like powers of 10, and are used to
percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for
high threshold, to have the lowest/highest percent density. For low
thresholds, the decile is also definitely useful. So I'm fine with both
outputs as you have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update. I'm
validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

Have a nice day/night,

--
Fabien.

Attachments:

gaussian_and_exponential_pgbench_v15.patchtext/x-diff; name=gaussian_and_exponential_pgbench_v15.patchDownload+451-19
#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#2)

I have just updated the wording so that it may be clearer:

Oops, I have sent the wrong patch, without the wording fix. Here is the
real updated version, which I tested.

probability of fist/last percent of the range: 11.3% 0.0%

--
Fabien.

Attachments:

gaussian_and_exponential_pgbench_v15b.patchtext/x-diff; name=gaussian_and_exponential_pgbench_v15b.patchDownload+451-19
#4Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Fabien COELHO (#2)

On 02/07/14 21:05, Fabien COELHO wrote:

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just
100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will
be confused with decile percentage for users, and anyone cannot
understand this digits. I cannot understand "4.9%, 0.0%" when I see
the first time. Then, I checked the source code, I understood it:(
It's not good design... #Why this parameter use 100?

What else? People have ten fingers and like powers of 10, and are used
to percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for
high threshold, to have the lowest/highest percent density. For low
thresholds, the decile is also definitely useful. So I'm fine with
both outputs as you have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update.
I'm validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

Have a nice day/night,

I would suggest that probabilities should NEVER be expressed in
percentages! As a percentage probability looks weird, and is never used
for serious statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e.
0.35 rather than 35%.

Cheers,
Gavin

#5Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Gavin Flower (#4)

Hello Gavin,

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

I would suggest that probabilities should NEVER be expressed in percentages!
As a percentage probability looks weird, and is never used for serious
statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35
rather than 35%.

I could agree about the mathematics, but ISTM that "11.5%" is more
readable and intuitive than "0.115".

I could change "probability" and replace it with "frequency" or maybe
"occurence", what would you think about that?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Fabien COELHO (#5)

On 03/07/14 20:58, Fabien COELHO wrote:

Hello Gavin,

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

I would suggest that probabilities should NEVER be expressed in
percentages! As a percentage probability looks weird, and is never
used for serious statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e.
0.35 rather than 35%.

I could agree about the mathematics, but ISTM that "11.5%" is more
readable and intuitive than "0.115".

I could change "probability" and replace it with "frequency" or maybe
"occurence", what would you think about that?

You may well be hitting a situation, where you meet opposition whatever
you do! :-)

"frequency" implies a positive integer (though "relative frequency"
might be okay) - and if you use "occurrence", someone else is bound to
complain...

Though, I'd opt for "relative frequency", if you can't use values in the
range 0 ... 1 for probabilities, if %'s are used - so long as it does
not generate a flame war.

I suspect it may not be worth the grief to change.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Fabien COELHO (#2)

On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just
100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will be
confused with decile percentage for users, and anyone cannot understand this
digits. I cannot understand "4.9%, 0.0%" when I see the first time. Then, I
checked the source code, I understood it:( It's not good design... #Why this
parameter use 100?

What else? People have ten fingers and like powers of 10, and are used to
percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for high
threshold, to have the lowest/highest percent density. For low thresholds,
the decile is also definitely useful. So I'm fine with both outputs as you
have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update. I'm
validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.
OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#7)

On 2014-07-03 21:27:53 +0900, Fujii Masao wrote:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word
difference. Independent of whether we want those options, I don't think
that's going to fly.

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#8)

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom
script.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Fabien COELHO (#9)

On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:

Yea. I certainly disagree with the patch in it's current state because it
copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

I think we also need documentation about the actual mathematical
behaviour of the randomness generators.

+     <para>
+      With the gaussian option, the larger the <replaceable>threshold</>,
+      the more frequently values close to the middle of the interval are drawn,
+      and the less frequently values close to the <replaceable>min</> and
+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>

The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Andres Freund (#10)

Hi,

2014-07-04 19:05 GMT+09:00 Andres Freund <andres@2ndquadrant.com>:

On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:

Yea. I certainly disagree with the patch in it's current state because

it

copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

I still agree with Fabien-san. I cannot understand why our logical proposal
isn't accepted...

I think we also need documentation about the actual mathematical

behaviour of the randomness generators.

+     <para>
+      With the gaussian option, the larger the

<replaceable>threshold</>,

+ the more frequently values close to the middle of the interval

are drawn,

+ and the less frequently values close to the <replaceable>min</>

and

+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>

The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

That's right. Therefore, we create command line option to easy to
understand parametrized Gaussian distribution.
When you want to know the parameter of distribution, you can use command
line option like under followings.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000
decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 5.00000
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
highest/lowest percent of the range: 4.9% 0.0%

If you have a better method than our method, please share us.

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom

script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

If we can do positive and logical discussion, I will agree with the
proposal about separate patches.
However, I think that most opposite hacker decided by his feelings...
Actuary, he didn't answer to our proposal about understanding the
parametrized distribution...
So I also think it is blocker. Command line feature is also needed.
Besides, is there a other good method? Please share us.

Best regards,
--
Mitsumasa KONDO

#12Robert Haas
robertmhaas@gmail.com
In reply to: Mitsumasa KONDO (#11)

On Sun, Jul 13, 2014 at 2:27 AM, Mitsumasa KONDO
<kondo.mitsumasa@gmail.com> wrote:

I still agree with Fabien-san. I cannot understand why our logical proposal
isn't accepted...

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

And this isn't helping me a bit:

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs. The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference. Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses. For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on. --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#12)

Hello Robert,

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

We can try to improve the documentation, once more!

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs. The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference. Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses. For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on. --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#10)
Re: gaussian distribution pgbench -- part 1/2

pgbench with gaussian & exponential, part 1 of 2.

This patch is a subset of the previous patch which only adds the two
new \setrandom gaussian and exponantial variants, but not the
adapted pgbench test cases, as suggested by Fujii Masao.
There is no new code nor code changes.

The corresponding documentation has been yet again extended wrt
to the initial patch, so that what is achieved is hopefully unambiguous
(there are two mathematical formula, tasty!), in answer to Andres Freund
comments, and partly to Robert Haas comments as well.

This patch also provides several sql/pgbench scripts and a README, so
that the feature can be tested. I do not know whether these scripts
should make it to postgresql. I would say yes, otherwise there is no way
to test...

part 2 which provide adapted pgbench test cases will come later.

--
Fabien.

Attachments:

gauss_A_17.patchtext/x-diff; name=gauss_A_17.patchDownload+298-13
#15Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#13)

On Wed, Jul 16, 2014 at 12:57 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

We can try to improve the documentation, once more!

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution. That having been said, I
agree the current documentation for the exponential distribution is
not too bad. But this part does not make sense:

+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.
+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

The documentation for the Gaussian distribution is in somewhat worse
shape. Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like. The closest it comes is this:

+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior, the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more. (Also, the English expression is "in other
words" not "in other worlds" - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely. And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
"gaussian" and "exponential" are referring to the option and which are
referring to the abstract mathematical concept. When you're talking
about the option, you should use all lower-case (as you've done) but
with <literal> tags or similar. When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#15)

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution.

Ok. But the general thrust was ok.

That having been said, I agree the current documentation for the
exponential distribution is not too bad. But this part does not make
sense:

+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.

I'm trying to be nice to the reader by providing an intuitive
information. I do not seem to succeed:-) I'm attempting to say that when
you draw from a range, say 1 to 1000, the first 1%, i.e. values 1 to 10,
are draw about "threshold"% of the time.

If I draw from one hundred values:

\setrandom x 1 100 exponential 10.0

The 1 will be drawn about 10% of the time, and the 99 next values will
share the remaining 90%.

+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

This is in the sense of "epsilon", small number close to 0 but different
from 0. The lowest possible threshold is the smallest
strictly positive representable with a "double".

The documentation for the Gaussian distribution is in somewhat worse
shape. Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like. The closest it comes is this:

+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior,

I have now added a precise formula for Gaussian. When you see the formula,
maybe you still would want see the decile to have an intuition.

I think that we assumed that the reader would know that a gaussian
distribution is the classic bell-shaped distribution, and if not .?he
would not be interested anyway.

the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more. (Also, the English expression is "in other
words" not "in other worlds" - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely. And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
"gaussian" and "exponential" are referring to the option and which are
referring to the abstract mathematical concept. When you're talking
about the option, you should use all lower-case (as you've done) but
with <literal> tags or similar. When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

I'll have yet another ago at trying to improve the documentation, esp the
gaussian part. However you must allow that these are Mathematics, and the
user who wants to use these distribution will be expected to understand
what they are somehow beforehand.

Moreover, I cannot make it precise, intuitive and very short.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#16)

2014-07-18 5:13 GMT+09:00 Fabien COELHO <coelho@cri.ensmp.fr>:

However, ISTM that it is not the purpose of pgbench documentation to be a

primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as
well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10

starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

Yeah, maybe, I and Fabien-san don't believe that he doesn't know the decile
percentage.
However, I think more description about decile is needed.

For example, when we set the number of transaction 10,000 (-t 10000),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
1 to 10,000 => 6,320 times
10,001 to 20,000 => 2,330 times
20,001 to 30,000 => 860 times
...
90,001 to 10,0000 => 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
1 to 1,000 => 950 times
...
99,001 to 10,0000 => 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.

Best regards,
--
Mitsumasa KONDO

#18Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Mitsumasa KONDO (#17)

For example, when we set the number of transaction 10,000 (-t 10000),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
1 to 10,000 => 6,320 times
10,001 to 20,000 => 2,330 times
20,001 to 30,000 => 860 times
...
90,001 to 10,0000 => 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
1 to 1,000 => 950 times
...
99,001 to 10,0000 => 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.

What we are talking about is the "summary" at the end of the run, which is
expected to be compact, hence the terse few lines.

I'm not sure how to make it explicit without extending the summary too
much, so it would not be a summary anymore:-)

My initial assumption is that anyone interested enough in changing the
default uniform distribution for a test would know about decile, but that
seems to be optimistic.

Maybe it would be okay to keep a terse summary but to expand the
documentation to explain what it means, as you suggested above...

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#15)

Please find attached 2 patches, which are a split of the patch discussed
in this thread.

(A) add gaussian & exponential options to pgbench \setrandom
the patch includes sql test files.

There is no change in the *code* from previous already reviewed
submissions, so I do not think that it needs another review on that
account.

However I have (yet again) reworked the *documentation* (for Andres Freund
& Robert Haas), in particular both descriptions now follow the same
structure (introduction, formula, intuition, rule of thumb and
constraint). I have differentiated the concept and the option by putting
the later in <literal> tags, and added a link to the corresponding
wikipedia pages.

Please bear in mind that:
1. English is not my native language.
2. this is not easy reading... this is maths, to read slowly:-)
3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian & exponential
distributions must know a little bit about probabilities...

(B) add pgbench test variants with gauss & exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as
requested by Andres Freund, thus this is new, although quite simple, code.
I have also added explanations in the documentation about how to interpret
the "decile" outputs, so as to hopefully address Robert Haas comments.

--
Fabien.

Attachments:

gauss_A_2.patchtext/x-diff; name=gauss_A_2.patchDownload+315-13
gauss_B_2.patchtext/x-diff; name=gauss_B_2.patchDownload+205-37
#20Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#19)

Please find attached 2 patches, which are a split of the patch discussed in
this thread.

Please find attached a very minor improvement to apply a code (variable
name) simplification directly in patch A so as to avoid a change in patch
B. The cumulated patch is the same as previous.

(A) add gaussian & exponential options to pgbench \setrandom
the patch includes sql test files.

There is no change in the *code* from previous already reviewed submissions,
so I do not think that it needs another review on that account.

However I have (yet again) reworked the *documentation* (for Andres Freund &
Robert Haas), in particular both descriptions now follow the same structure
(introduction, formula, intuition, rule of thumb and constraint). I have
differentiated the concept and the option by putting the later in <literal>
tags, and added a link to the corresponding wikipedia pages.

Please bear in mind that:
1. English is not my native language.
2. this is not easy reading... this is maths, to read slowly:-)
3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian & exponential
distributions must know a little bit about probabilities...

(B) add pgbench test variants with gauss & exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as
requested by Andres Freund, thus this is new, although quite simple, code. I
have also added explanations in the documentation about how to interpret the
"decile" outputs, so as to hopefully address Robert Haas comments.

--
Fabien.

Attachments:

gauss_A_3.patchtext/x-diff; name=gauss_A_3.patchDownload+315-13
gauss_B_3.patchtext/x-diff; name=gauss_B_3.patchDownload+195-27
#21Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#14)
#22Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#21)
#23Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#22)
#24Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Mitsumasa KONDO (#23)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fabien COELHO (#24)
#26Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#24)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fabien COELHO (#16)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#22)
#29Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#28)
#30Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#28)
#31Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#31)
#33Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#32)
#34Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#33)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Mitsumasa KONDO (#34)
#37Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#37)
#39Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#38)
#40Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Fabien COELHO (#39)