pgbench randomness initialization

Started by Andres Freundabout 10 years ago34 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

pondering
http://archives.postgresql.org/message-id/CA%2BTgmoZJdA6K7-17K4A48rVB0UPR98HVuaNcfNNLrGsdb1uChg%40mail.gmail.com
et al I was wondering why it's a good idea for pgbench to do
INSTR_TIME_SET_CURRENT(start_time);
srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
to initialize randomness and then
for (i = 0; i < nthreads; i++)
thread->random_state[0] = random();
thread->random_state[1] = random();
thread->random_state[2] = random();
to initialize the individual thread random state which is then used by
pg_erand48().

To me it seems better to instead initialize srandom() with a known value
(say, uh, 0). Or even better don't use random() at all, and fill a
global pg_erand48() with a known state; and use pg_erand48() to
initialize the thread states.

Obviously that doesn't make pgbench entirely reproducible, but it seems
a lot better than now. Individual threads would do work in a
reproducible order.

I see very little reason to have the current behaviour, or at the very
least not by default.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#1)
Re: pgbench randomness initialization

Hello Andres,

et al I was wondering why it's a good idea for pgbench to do
INSTR_TIME_SET_CURRENT(start_time);
srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
to initialize randomness and then
for (i = 0; i < nthreads; i++)
thread->random_state[0] = random();
thread->random_state[1] = random();
thread->random_state[2] = random();
to initialize the individual thread random state which is then used by
pg_erand48().

To me it seems better to instead initialize srandom() with a known value
(say, uh, 0). Or even better don't use random() at all, and fill a
global pg_erand48() with a known state; and use pg_erand48() to
initialize the thread states.

Obviously that doesn't make pgbench entirely reproducible, but it seems
a lot better than now. Individual threads would do work in a
reproducible order.

I see very little reason to have the current behaviour, or at the very
least not by default.

I think that it depends on what you want, which may vary:

(1) "exactly" reproducible runs, but one run may hit a particular
steady state not representative of what happens in general.

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

Currently pgbench focusses on (2), which may or may not be fine depending
on what you are doing. From a personal point of view I think that (2) is
more significant to collect performance data, even if the results are more
unstable: that simply reflects reality and its intrinsic variations, so
I'm fine that as the default.

Now for those interested in (1) for some reason, I would suggest to rely a
PGBENCH_RANDOM_SEED environment variable or --random-seed option which
could be used to have a oxymoronic "deterministic randomness", if desired.
I do not think that it should be the default, though.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Fabien COELHO (#2)
Re: pgbench randomness initialization

On 2016-04-07 11:56:12 +0200, Fabien COELHO wrote:

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

I don't think this POV makes all that much sense. If you do something
non-comparable, then the results aren't, uh, comparable. Which also
means there's a lower chance to reproduce observed problems.

Currently pgbench focusses on (2), which may or may not be fine depending on
what you are doing. From a personal point of view I think that (2) is more
significant to collect performance data, even if the results are more
unstable: that simply reflects reality and its intrinsic variations, so I'm
fine that as the default.

Uh, and what's the benefit of that variability? pgbench isn't a reality
simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
variability are bad benchmarks.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#3)
Re: pgbench randomness initialization

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

I don't think this POV makes all that much sense. If you do something
non-comparable, then the results aren't, uh, comparable. Which also
means there's a lower chance to reproduce observed problems.

That also means that you are likely not to hit them if you always do the
very same run...

Moreover, the Monte Carlo method requires randomness for its convergence
result.

Currently pgbench focusses on (2), which may or may not be fine depending on
what you are doing. From a personal point of view I think that (2) is more
significant to collect performance data, even if the results are more
unstable: that simply reflects reality and its intrinsic variations, so I'm
fine that as the default.

Uh, and what's the benefit of that variability? pgbench isn't a reality
simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
variability are bad benchmarks.

From a statistical perspective, one run does not mean anything. If you do
the exact same run over and over again, then all mathematical results
about (slow) convergence towards the average are lost. This is like trying
to survey a population by asking the questions to the same person over and
over: the result will be biased.

Now when you develop, which is the use case you probably have in mind, you
want to compare two pg version and check for the performance impact, so
having the exact same run seems like a proxy to quickly check for that.

However, from a stastistical perspective this is just heresy: you may do a
change which improves one given run at the expense of all possible others
and you would not know it: Say for instance that there are two different
behaviors depending on something, then you will check against one of them
only.

So I have no mathematical doubt that changing the seed is the right
default setting, thus I think that the current behavior is fine. However
I'm okay if someone wants to control the randomness for some reason (maybe
having "less sure" results, but quickly), so it could be allowed somehow.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Fabien COELHO (#4)
Re: pgbench randomness initialization

On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

I don't think this POV makes all that much sense. If you do something
non-comparable, then the results aren't, uh, comparable. Which also
means there's a lower chance to reproduce observed problems.

That also means that you are likely not to hit them if you always do the
very same run...

If you run the test for longer... Or explicitly iterate over IVs. At the
very least we need to make pgbench output the IV used, to have some
chance of repeating tests.

Uh, and what's the benefit of that variability? pgbench isn't a reality
simulation tool, it's a benchmarking tool. And benchmarks with intrisinc
variability are bad benchmarks.

From a statistical perspective, one run does not mean anything. If you do
the exact same run over and over again, then all mathematical results about
(slow) convergence towards the average are lost. This is like trying to
survey a population by asking the questions to the same person over and
over: the result will be biased.

That comparison pretty much invalidates any point you're making, it's
that bad.

Now when you develop, which is the use case you probably have in mind, you
want to compare two pg version and check for the performance impact, so
having the exact same run seems like a proxy to quickly check for that.

It's not about "quickly" checking for something. If you look at the
results in thread mentioned in the OP, the order of operations
drastically and *PERSISTENTLY* changes the observations. Causing *days*
of work lost.

However, from a stastistical perspective this is just heresy: you may do a
change which improves one given run at the expense of all possible others
and you would not know it: Say for instance that there are two different
behaviors depending on something, then you will check against one of them
only.

Meh. That assumes that we're doing a huge number of pgbench runs; but
usually people do maybe a handful. Tops. If you're trying to defend
against scenarios like that you need to design your tests so that you'll
encounter such problems by running longer.

So I have no mathematical doubt that changing the seed is the right default
setting, thus I think that the current behavior is fine. However I'm okay if
someone wants to control the randomness for some reason (maybe having "less
sure" results, but quickly), so it could be allowed somehow.

There might be some statistics arguments, but I think they're pretty
ignoring reality.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#5)
Re: pgbench randomness initialization

Hello Andres,

If you run the test for longer... Or explicitly iterate over IVs. At the
very least we need to make pgbench output the IV used, to have some
chance of repeating tests.

Note that I'm not against providing a way to repeat tests "exactly", and I
have suggested two means: environment variable and/or option.

[...] That comparison pretty much invalidates any point you're making,
it's that bad.

At least it is simple, if simplistic.

Here is another one: I knew a financial institution which needed to
evaluate the VAR of exotic financial products every night. They relied on
MC for that. Alas, it was not converging quickly enough, results were
unstable, so they took your advice: they froze the seed. Day after day the
results were mostly the same, the VAR was stable one morning to the other,
the management is happy, the risks were under control... That was in the
mid 2000s:-)

However, from a stastistical perspective this is just heresy: you may do a
change which improves one given run at the expense of all possible others
and you would not know it: Say for instance that there are two different
behaviors depending on something, then you will check against one of them
only.

Meh. That assumes that we're doing a huge number of pgbench runs;

A number of, not necessarily "huge". Or averaging a lot of intermediate
values and having a hard look at the distribution, not just the final tps
number.

but usually people do maybe a handful. Tops. If you're trying to defend
against scenarios like that you need to design your tests so that you'll
encounter such problems by running longer.

People usually do a lot of things, does not mean that it is "right".

So I have no mathematical doubt that changing the seed is the right
default setting, thus I think that the current behavior is fine.
However I'm okay if someone wants to control the randomness for some
reason (maybe having "less sure" results, but quickly), so it could be
allowed somehow.

There might be some statistics arguments,

Yep, there is.

but I think they're pretty ignoring reality.

Hmmm. If reality wants to ignore mathematics, usually it looses, so this
will not be with my blessing. Note that as a committer you do not need me
to freeze the seed. I'm just providing an opinion backed by mathematical
proofs.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#2)
Re: pgbench randomness initialization

On Thu, Apr 7, 2016 at 5:56 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I think that it depends on what you want, which may vary:

(1) "exactly" reproducible runs, but one run may hit a particular
steady state not representative of what happens in general.

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

Currently pgbench focusses on (2), which may or may not be fine depending on
what you are doing. From a personal point of view I think that (2) is more
significant to collect performance data, even if the results are more
unstable: that simply reflects reality and its intrinsic variations, so I'm
fine that as the default.

Now for those interested in (1) for some reason, I would suggest to rely a
PGBENCH_RANDOM_SEED environment variable or --random-seed option which could
be used to have a oxymoronic "deterministic randomness", if desired.
I do not think that it should be the default, though.

I agree entirely. If performance is erratic, that's actually
something you want to discover during benchmarking. If different
pgbench runs (of non-trivial length) are producing substantially
different results, then that's really a problem we need to fix, not
just adjust pgbench to cover it up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: pgbench randomness initialization

On 2016-04-07 08:58:16 -0400, Robert Haas wrote:

On Thu, Apr 7, 2016 at 5:56 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I think that it depends on what you want, which may vary:

(1) "exactly" reproducible runs, but one run may hit a particular
steady state not representative of what happens in general.

(2) runs which really vary from one to the next, so as
to have an idea about how much it may vary, what is the
performance stability.

Currently pgbench focusses on (2), which may or may not be fine depending on
what you are doing. From a personal point of view I think that (2) is more
significant to collect performance data, even if the results are more
unstable: that simply reflects reality and its intrinsic variations, so I'm
fine that as the default.

Now for those interested in (1) for some reason, I would suggest to rely a
PGBENCH_RANDOM_SEED environment variable or --random-seed option which could
be used to have a oxymoronic "deterministic randomness", if desired.
I do not think that it should be the default, though.

I agree entirely. If performance is erratic, that's actually
something you want to discover during benchmarking. If different
pgbench runs (of non-trivial length) are producing substantially
different results, then that's really a problem we need to fix, not
just adjust pgbench to cover it up.

It's not about "covering it up"; it's about actually being able to take
action based on benchmark results, and about practically being able to
run benchmarks. The argument above means essentially that we need to run
a significant number of pgbench runs for *anything*, because running
them 3-5 times before/after just isn't meaningful enough.

It means that you can't separate between OS caused, and pgbench order
caused performance differences.

I agree that it's a horrid problem that we can get half the throughput
dependent on large machines, dependant on the ordering. But without
running queries in the same order before/after a patch there's no way to
validate whether $patch caused the problem. And no way to reliably
trigger problematic scenarios.

I also agree that it's important to be able to vary workloads. But if
you do so, you should do so in the same order, both pre/post a
patch. Afaics the prime use of pgbench is validation of the performance
effects of patches; therefore it should be usable for that, and it's
not.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: pgbench randomness initialization

On Thu, Apr 7, 2016 at 9:15 AM, Andres Freund <andres@anarazel.de> wrote:

It's not about "covering it up"; it's about actually being able to take
action based on benchmark results, and about practically being able to
run benchmarks. The argument above means essentially that we need to run
a significant number of pgbench runs for *anything*, because running
them 3-5 times before/after just isn't meaningful enough.

It means that you can't separate between OS caused, and pgbench order
caused performance differences.

I'm not objecting to adding an option for this; but I think Fabien is
right that it shouldn't be the default.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: pgbench randomness initialization

Andres Freund <andres@anarazel.de> writes:

On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:

So I have no mathematical doubt that changing the seed is the right default
setting, thus I think that the current behavior is fine. However I'm okay if
someone wants to control the randomness for some reason (maybe having "less
sure" results, but quickly), so it could be allowed somehow.

There might be some statistics arguments, but I think they're pretty
ignoring reality.

Sorry, but I think Fabien is right and you are wrong. There is no
point in having randomness in there at all if the thing is constrained
to generate the same "random" sequence every time.

I don't object to having an option to force the initial seed, but
it should not be the default, and it most certainly should not be
the only behavior.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: pgbench randomness initialization

On 2016-04-07 09:46:27 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2016-04-07 12:25:58 +0200, Fabien COELHO wrote:

So I have no mathematical doubt that changing the seed is the right default
setting, thus I think that the current behavior is fine. However I'm okay if
someone wants to control the randomness for some reason (maybe having "less
sure" results, but quickly), so it could be allowed somehow.

There might be some statistics arguments, but I think they're pretty
ignoring reality.

Sorry, but I think Fabien is right and you are wrong.

Given that it's 3:1 so far, you might be right...

There is no point in having randomness in there at all if the thing is
constrained to generate the same "random" sequence every time.

but that argument seems pretty absurd. It's obviously different to query
for different rows over a run, rather than querying the same row again
and again, in all backends. The reason we use randomness is to avoid
easily discernible patterns in querying. Without randomness, how would
you do that?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#9)
Re: pgbench randomness initialization

It means that you can't separate between OS caused, and pgbench order
caused performance differences.

I'm not objecting to adding an option for this; but I think Fabien is
right that it shouldn't be the default.

Yep.

Andres, attached is a simple POC with an option & environment variable
(whereas I should rather have looked at the current checkpointer/vacuum
issue which I have reproduced:-().

While testing it I had a funny pattern, something like:

pgbench --random-seed=123 -M prepared -T 3 -P 1 -S
1.0: 600 tps
2.0: 600 tps
3.0: 600 tps

First rerun just after:

pgbench --random-seed=123 -M prepared -T 3 -P 1 -S
1.0: 1800 tps
2.0: 600 tps
3.0: 600 tps

The first rerun hits the same pages, so the first 1800 transactions are
run in one second, and then it is new pages which are loaded so the
performance goes down.

Second rerun just after:

pgbench --random-seed=123 -M prepared -T 3 -P 1 -S
1.0: 1800 tps
2.0: 1400 tps
3.0: 600 tps

The second redun hits the same 3000 transactions than the previous one in
about 1.7 seconds, then goes back to 600 tps for new pages...

After more iterations the performance is 1800 tps during the 3 seconds.

This clearly illustrates that it should be used with caution.

--
Fabien.

Attachments:

pgbench-seed-1.patchtext/x-diff; charset=us-ascii; name=pgbench-seed-1.patchDownload+52-1
#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fabien COELHO (#12)
Re: pgbench randomness initialization

Fabien COELHO wrote:

While testing it I had a funny pattern, something like:

pgbench --random-seed=123 -M prepared -T 3 -P 1 -S
1.0: 600 tps
2.0: 600 tps
3.0: 600 tps

The output should include the random seed used, whether it was passed
with --random-seed, environment variable or randomly determined. That
way, the user that later wants to verify why a particular run caused
some particular misbehavior knows what seed to use to reproduce that
run.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Alvaro Herrera (#13)
Re: [HACKERS] pgbench randomness initialization

Hello Alvaro,

I revive this patch because controlling the seed is useful for tap testing
pgbench.

The output should include the random seed used, whether it was passed
with --random-seed, environment variable or randomly determined. That
way, the user that later wants to verify why a particular run caused
some particular misbehavior knows what seed to use to reproduce that
run.

Yep.

Here is a new version which output use used seed when a seed is
explicitely set with an option or from the environment.

However, the default (current) behavior remains silent, so no visible
changes unless tinkering with it.

The patch also allows to use a "strong" random for seeding the PRNG,
thanks to pg_strong_random().

The tests assume that stdlib random/srandom behavior is standard thus
deterministic between platform.

--
Fabien.

Attachments:

pgbench-seed-2.patchtext/x-diff; name=pgbench-seed-2.patchDownload+91-7
#15Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#14)
Re: [HACKERS] pgbench randomness initialization

Here is a new version which output use used seed when a seed is explicitely
set with an option or from the environment.

It is even better without xml typos, with simpler coding and the doc in
the right place... Sorry for the noise.

--
Fabien.

Attachments:

pgbench-seed-3.patchtext/x-diff; name=pgbench-seed-3.patchDownload+88-7
#16Chapman Flack
chap@anastigmatix.net
In reply to: Fabien COELHO (#15)
Re: Re: [HACKERS] pgbench randomness initialization

On 01/02/18 05:57, Fabien COELHO wrote:

Here is a new version which output use used seed when a seed is
explicitely set with an option or from the environment.

This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)

With that taken care of, the added tests all pass for me. I wonder, though:

The tests assume that stdlib random/srandom behavior is standard thus
deterministic between platform.

Is the behavior of srandom() and the system generator really so precisely
specified that seed 5432 will produce the same values hardcoded in the
tests on all platforms? It does on mine, but could it produce spurious
test failures for others? I guess the (or a) spec would be here:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/initstate.html

It specifies a "non-linear additive feedback random-number generator
employing a default state array size of 31 long integers", but it does
not pin down parameters or claim only one candidate exists.

To have the test run pgbench twice with the same seed and compare the
results sounds safer.

This revised pgbench-seed-4.patch changes the code not at all, and
the TAP test only in whitespace. I did some wordsmithing of the doc,
which I hope was not presumptuous of me, only as a conversation starter.
I expanded the second sentence because on my first reading I wasn't quite
sure of its meaning. Once I had looked at the code, I could see that the
sentence was economical and precise already, but I tried to find a version
that would also have been clear to the me-before-I-looked.

The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
will have the same effect as 'time', and indeed, I really think it should
be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
or an error. The error, right now, says only "expecting an unsigned
integer"; it should also mention time and rand. Should 'rand' be something
that conveys more about its meaning, 'strong' perhaps?

The documentation doesn't mention the allowed range of the unsigned
integer (which I get, because 'unsigned int' is exactly the signature
for srandom, but somebody might read "unsigned integer" in a more
generic sense). I'm not sure what would be a better way to say it.
The base (only decimal, as now in the code) isn't specified either.

Maybe the documentation should mention that the output now includes the
random seed being used, so that (even without planning ahead) a session
can be re-run with the same seed if necessary. It could just say "an
unsigned integer in decimal, as it is shown in pgbench's output" or
something like that.

Something more may need to be said in the doc about reproducibility. I think
the real story is that a run can be reproduced if the number of clients is
equal to the number of threads. Although each thread has its own generator
state, each client does not (if there is more than one per thread), and as
soon as real select() delays start happening in CSTATE_WAIT_RESULT, the
clients dealt out to a given thread may not be hitting that thread's
generator in a deterministic order.

-Chap

Attachments:

pgbench-TAP-VPATH-1.patchtext/x-patch; name=pgbench-TAP-VPATH-1.patchDownload+5-1
pgbench-seed-4.patchtext/x-patch; name=pgbench-seed-4.patchDownload+90-8
#17Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#16)
Re: pgbench randomness initialization

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

Initial review is the message this is in reply to.

The new status of this patch is: Waiting on Author

#18Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Chapman Flack (#16)
Re: Re: [HACKERS] pgbench randomness initialization

Hello Chapman,

Thanks for the review,

The tests assume that stdlib random/srandom behavior is standard thus
deterministic between platform.

Is the behavior of srandom() and the system generator really so precisely
specified that seed 5432 will produce the same values hardcoded in the
tests on all platforms? [...]

Good question.

I'm hoping that in practice it would be the same, or that their would be
very few cases (eg linux vs windows vs macos...). I was counting on the
the buildfarm to tell me if I'm wrong, and fix it if needed.

To have the test run pgbench twice with the same seed and compare the
results sounds safer.

Interesting idea. The test script would be more complicated to do that,
though. I would prefer to bet on "random" determinism (:-) and resort to
such a solution only if it is not the case. Or maybe just put back some
"\d+" to keep it simple.

This is a debatable strategy.

I did some wordsmithing of the doc, which I hope was not presumptuous of
me, only as a conversation starter. [...]

Thanks for the doc improvements.

The documentation doesn't say that --random-seed= (or PGBENCH_RANDOM_SEED=)
will have the same effect as 'time', and indeed, I really think it should
be unset (defaulting to 'time'), or 'time', or 'rand', or an integer,
or an error.

Ok, done.

The error, right now, says only "expecting an unsigned integer"; it
should also mention time and rand.

Ok, done.

Should 'rand' be something that conveys more about its meaning, 'strong'
perhaps?

Hmmm. "Random means random":-). I have no opinion about whether it would
be better. ISTM that "strong" would require some explanations. I let is as
"rand" for now.

The documentation doesn't mention the allowed range of the unsigned
integer (which I get, because 'unsigned int' is exactly the signature
for srandom, but somebody might read "unsigned integer" in a more
generic sense).

Ok. I extended so that it works with octal, decimal and hexadecimal, and
updated the doc accordingly. I did not add range information though.

I'm not sure what would be a better way to say it.
The base (only decimal, as now in the code) isn't specified either.

Sure.

Maybe the documentation should mention that the output now includes the
random seed being used, so that (even without planning ahead) [...]

It does so only if the seed is explicitely set, otherwise it keeps the
previous behavior of being silent. I added a sentence about that.

Something more may need to be said in the doc about reproducibility. I think
the real story is that a run can be reproduced if the number of clients is
equal to the number of threads.

Yes, good point. I tried to hide the issue under the "as far as random
numbers are concerned". I've tried to improve this point in the doc.

Although each thread has its own generator state, each client does not
(if there is more than one per thread), and as soon as real select()
delays start happening in CSTATE_WAIT_RESULT, the clients dealt out to a
given thread may not be hitting that thread's generator in a
deterministic order.

Yep. This may evolve, for instance the error handling patch needs to
restart transactions so it adds a state into the client.

--
Fabien.

Attachments:

pgbench-seed-5.patchtext/x-diff; name=pgbench-seed-5.patchDownload+107-7
#19Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Chapman Flack (#16)
Re: Re: [HACKERS] pgbench randomness initialization

This is a simple patch that does what it says on the tin. I ran into
trouble with the pgbench TAP test *even before applying the patch*, but
only because I was doing a VPATH build as a user without 'write'
on the source tree (001_pgbench_with_server.pl tried to make pgbench
create log files there). Bad me. Oddly, that was the only test in the
whole tree to have such an issue, so here I add a pre-patch to fix that.
Now my review needs a review. :)

Yep. I find the multiple chdir solution a little bit too extreme.

ISTM that it should rather add the correct path to --log-prefix by
prepending $node->basedir, like the pgbench function does for -f scripts.

See attached.

--
Fabien.

Attachments:

pgbench-vpath-2.patchtext/x-diff; name=pgbench-vpath-2.patchDownload+6-4
#20Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#18)
Re: Re: [HACKERS] pgbench randomness initialization

Here is a rebase, plus some more changes:

I have improved the error message to tell from where the value was
provided.

I have removed the test to the exact values produced from the expression
test run.

I have added a test which run from the same seed value several times
and checks that the output values are the same.

--
Fabien.

Attachments:

pgbench-seed-6.patchtext/x-diff; name=pgbench-seed-6.patchDownload+148-10
#21Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#20)
#22Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#19)
#24Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fabien COELHO (#24)
#26Chapman Flack
chap@anastigmatix.net
In reply to: Fabien COELHO (#22)
#27Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Chapman Flack (#26)
#28Chapman Flack
chap@anastigmatix.net
In reply to: Fabien COELHO (#27)
#29Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Chapman Flack (#28)
#30Chapman Flack
chap@anastigmatix.net
In reply to: Fabien COELHO (#29)
#31Teodor Sigaev
teodor@sigaev.ru
In reply to: Fabien COELHO (#29)
#32Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Teodor Sigaev (#31)
#33Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Fabien COELHO (#32)
#34Teodor Sigaev
teodor@sigaev.ru
In reply to: Fabien COELHO (#33)