[PATCH] random_normal function
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.
random_normal(stddev float8 DEFAULT 1.0, mean float8 DEFAULT 0.0)
Attachments:
random_normal_01.patchapplication/octet-stream; name=random_normal_01.patchDownload+104-8
On Thu, Dec 8, 2022 at 2:53 PM Paul Ramsey <pramsey@cleverelephant.ca>
wrote:
random_normal(stddev float8 DEFAULT 1.0, mean float8 DEFAULT 0.0)
Any particular justification for placing stddev before mean? A brief
survey seems to indicate other libraries, as well as (at least for me)
learned convention, has the mean be supplied first, then the standard
deviation. The implementation/commentary seems to use that convention as
well.
Some suggestions:
/* Apply optional user parameters */ - that isn't important or even what is
happening though, and the body of the function shouldn't care about the
source of the values for the variables it uses.
Instead:
/* Transform the normal standard variable (z) using the target normal
distribution parameters */
Personally I'd probably make that even more explicit:
+ float8 z
...
* z = pg_prng_double_normal(&drandom_seed)
+ /* ... */
* result = (stddev * z) + mean
And a possible micro-optimization...
+ bool rescale = true
+ if (PG_NARGS() = 0)
+ rescale = false
...
+ if (rescale)
... result = (stddev * z) + mean
+ else
+ result = z
David J.
On Thu, Dec 08, 2022 at 01:53:23PM -0800, Paul Ramsey wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.random_normal(stddev float8 DEFAULT 1.0, mean float8 DEFAULT 0.0)
+++ b/src/backend/catalog/system_functions.sql
@@ -620,6 +620,13 @@ CREATE OR REPLACE FUNCTION
STABLE PARALLEL SAFE
AS 'sql_localtimestamp';
+CREATE OR REPLACE FUNCTION
+ random_normal(stddev float8 DEFAULT 1.0, mean float8 DEFAULT 0.0)
+RETURNS float8
+LANGUAGE INTERNAL
+STRICT VOLATILE PARALLEL SAFE
+AS 'make_interval';
I guess make_interval is a typo ?
This is causing it to fail tests:
http://cfbot.cputube.org/paul-ramsey.html
BTW you can run the same tests as CFBOT does from your own github
account; see:
/messages/by-id/20221116232507.GO26337@telsasoft.com
--
Justin
On Dec 8, 2022, at 2:40 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thu, Dec 8, 2022 at 2:53 PM Paul Ramsey <pramsey@cleverelephant.ca> wrote:
random_normal(stddev float8 DEFAULT 1.0, mean float8 DEFAULT 0.0)
Any particular justification for placing stddev before mean? A brief survey seems to indicate other libraries, as well as (at least for me) learned convention, has the mean be supplied first, then the standard deviation. The implementation/commentary seems to use that convention as well.
No, I'm not sure what was going through my head, but I'm sure it made sense at the time (maybe something like "people will tend to jimmy with the stddev more frequently than the mean"). I've reversed the order
Some suggestions:
Thanks! Taken :)
And a possible micro-optimization...
+ bool rescale = true + if (PG_NARGS() = 0) + rescale = false ... + if (rescale) ... result = (stddev * z) + mean + else + result = z
Feels a little too micro to me (relative to the overall cost of the transform from uniform to normal distribution). I'm going to leave it out unless you violently want it.
Revised patch attached.
Thanks!
P
Attachments:
random_normal_02.patchapplication/octet-stream; name=random_normal_02.patch; x-unix-mode=0644Download+103-8
On Dec 8, 2022, at 2:46 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
I guess make_interval is a typo ?
This is causing it to fail tests:
http://cfbot.cputube.org/paul-ramsey.html
Yep, dumb typo, thanks! This bot is amazeballs, thank you!
P.
Revised patch attached.
And again, because I think I missed one change in the last one.
Attachments:
random_normal_03.patchapplication/octet-stream; name=random_normal_03.patch; x-unix-mode=0644Download+105-8
On Dec 8, 2022, at 3:25 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Revised patch attached.
And again, because I think I missed one change in the last one.
<random_normal_03.patch>
Final tme, with fixes from cirrusci.
Attachments:
random_normal_04.patchapplication/octet-stream; name=random_normal_04.patch; x-unix-mode=0644Download+106-8
On Thu, Dec 08, 2022 at 04:44:56PM -0800, Paul Ramsey wrote:
Final tme, with fixes from cirrusci.
Well, why not. Seems like you would use that a lot with PostGIS.
#include <math.h> /* for ldexp() */
+#include <float.h> /* for DBL_EPSILON */
And be careful with the order here.
+static void
+drandom_check_default_seed()
We always use (void) rather than empty parenthesis sets.
I would not leave that unchecked, so I think that you should add
something in ramdom.sql. Or would you prefer switching some of
the regression tests be switched so as they use the new normal
function?
(Ahem. Bonus points for a random_string() returning a bytea, based on
pg_strong_random().)
--
Michael
On Dec 8, 2022, at 8:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 08, 2022 at 04:44:56PM -0800, Paul Ramsey wrote:
Final tme, with fixes from cirrusci.
Well, why not. Seems like you would use that a lot with PostGIS.
#include <math.h> /* for ldexp() */
+#include <float.h> /* for DBL_EPSILON */
And be careful with the order here.
Should be ... alphabetical?
+static void +drandom_check_default_seed() We always use (void) rather than empty parenthesis sets.
OK
I would not leave that unchecked, so I think that you should add
something in ramdom.sql. Or would you prefer switching some of
the regression tests be switched so as they use the new normal
function?
Reading through those tests... seems like they will (rarely) fail. Is that... OK?
The tests seem to be mostly worried that random() starts returning constants, which seems like a good thing to test for (is the random number generating returning randomness).
An obvious test for this function is that the mean and stddev converge on the supplied parameters, given enough inputs, which is actually kind of the opposite test. I use the same random number generator as the uniform distribution, so that aspect is already covered by the existing tests.
(Ahem. Bonus points for a random_string() returning a bytea, based on
pg_strong_random().)
Would love to. Separate patch of bundled into this one?
P
Show quoted text
--
Michael
On Dec 8, 2022, at 1:53 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.
Thanks for the patch. What do you think about these results?
+-- The semantics of a negative stddev are not well defined
+SELECT random_normal(mean := 0, stddev := -1);
+ random_normal
+---------------------
+ -1.0285744583010896
+(1 row)
+
+SELECT random_normal(mean := 0, stddev := '-Inf');
+ random_normal
+---------------
+ Infinity
+(1 row)
+
+-- This result may be defensible...
+SELECT random_normal(mean := '-Inf', stddev := 'Inf');
+ random_normal
+---------------
+ -Infinity
+(1 row)
+
+-- but if so, why is this NaN?
+SELECT random_normal(mean := 'Inf', stddev := 'Inf');
+ random_normal
+---------------
+ NaN
+(1 row)
+
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Dec 9, 2022, at 10:39 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Dec 8, 2022, at 1:53 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.Thanks for the patch. What do you think about these results?
Angels on pins time! :)
+-- The semantics of a negative stddev are not well defined +SELECT random_normal(mean := 0, stddev := -1); + random_normal +--------------------- + -1.0285744583010896 +(1 row)
Question is does a negative stddev make enough sense? It is functionally using fabs(stddev),
SELECT avg(random_normal(mean := 0, stddev := -1)) from generate_series(1,1000);
avg
---------------------
0.03156106778729526
So could toss an invalid parameter on negative? Not sure if that's more helpful than just being mellow about this input.
+ +SELECT random_normal(mean := 0, stddev := '-Inf'); + random_normal +--------------- + Infinity +(1 row)
The existing logic around means and stddevs and Inf is hard to tease out:
SELECT avg(v),stddev(v) from (VALUES ('Inf'::float8, '-Inf'::float8)) a(v);
avg | stddev
----------+--------
Infinity |
The return of NULL of stddev would seem to argue that null in this case means "does not compute" at some level. So return NULL on Inf stddev?
+ +-- This result may be defensible... +SELECT random_normal(mean := '-Inf', stddev := 'Inf'); + random_normal +--------------- + -Infinity +(1 row) + +-- but if so, why is this NaN? +SELECT random_normal(mean := 'Inf', stddev := 'Inf'); + random_normal +--------------- + NaN +(1 row)
An Inf mean only implies that one value in the distribution is Inf, but running the function in reverse (generating values) and only generating one value from the distribution implies we have to always return Inf (except in this case stddev is also Inf, so I'd go with NULL, assuming we accept the NULL premise above.
How do you read the tea leaves?
P.
On 12/9/22 13:51, Paul Ramsey wrote:
On Dec 9, 2022, at 10:39 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Dec 8, 2022, at 1:53 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.Thanks for the patch. What do you think about these results?
Angels on pins time! :)
I just noticed this thread -- what is lacking in the normal_rand()
function in the tablefunc contrib?
https://www.postgresql.org/docs/current/tablefunc.html#id-1.11.7.52.5
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Dec 9, 2022, at 11:01 AM, Joe Conway <mail@joeconway.com> wrote:
On 12/9/22 13:51, Paul Ramsey wrote:
On Dec 9, 2022, at 10:39 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Dec 8, 2022, at 1:53 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.Thanks for the patch. What do you think about these results?
Angels on pins time! :)
I just noticed this thread -- what is lacking in the normal_rand() function in the tablefunc contrib?
https://www.postgresql.org/docs/current/tablefunc.html#id-1.11.7.52.5
Simplicity I guess mostly. random_normal() has a direct analogue in random() which is also a core function. I mean it could equally be pointed out that a user can implement their own Box-Muller calculation pretty trivially. Part of this submission is a personal wondering to what extent the community values convenience vs composibility. The set-returning nature of normal_rand() may be a bit of a red herring to people who just want one value (even though normal_rand (1, 0.0, 1.0) does exactly what they want).
P.
On Dec 9, 2022, at 11:10 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
On Dec 9, 2022, at 11:01 AM, Joe Conway <mail@joeconway.com> wrote:
On 12/9/22 13:51, Paul Ramsey wrote:
On Dec 9, 2022, at 10:39 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Dec 8, 2022, at 1:53 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
Just a utility function to generate random numbers from a normal
distribution. I find myself doing this several times a year, and I am
sure I must not be the only one.Thanks for the patch. What do you think about these results?
Angels on pins time! :)
I just noticed this thread -- what is lacking in the normal_rand() function in the tablefunc contrib?
https://www.postgresql.org/docs/current/tablefunc.html#id-1.11.7.52.5
Simplicity I guess mostly. random_normal() has a direct analogue in random() which is also a core function. I mean it could equally be pointed out that a user can implement their own Box-Muller calculation pretty trivially. Part of this submission is a personal wondering to what extent the community values convenience vs composibility. The set-returning nature of normal_rand() may be a bit of a red herring to people who just want one value (even though normal_rand (1, 0.0, 1.0) does exactly what they want).
No related to the "reason to exist", but normal_rand() has some interesting behaviour under Mark's test cases!
select normal_rand (1, 'Inf', 'Inf'), a from generate_series(1,2) a;
normal_rand | a
-------------+---
NaN | 1
Infinity | 2
(2 rows)
On Dec 9, 2022, at 9:17 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
On Dec 8, 2022, at 8:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 08, 2022 at 04:44:56PM -0800, Paul Ramsey wrote:
Final tme, with fixes from cirrusci.
Well, why not. Seems like you would use that a lot with PostGIS.
#include <math.h> /* for ldexp() */
+#include <float.h> /* for DBL_EPSILON */
And be careful with the order here.Should be ... alphabetical?
+static void +drandom_check_default_seed() We always use (void) rather than empty parenthesis sets.OK
I would not leave that unchecked, so I think that you should add
something in ramdom.sql. Or would you prefer switching some of
the regression tests be switched so as they use the new normal
function?Reading through those tests... seems like they will (rarely) fail. Is that... OK?
The tests seem to be mostly worried that random() starts returning constants, which seems like a good thing to test for (is the random number generating returning randomness).
An obvious test for this function is that the mean and stddev converge on the supplied parameters, given enough inputs, which is actually kind of the opposite test. I use the same random number generator as the uniform distribution, so that aspect is already covered by the existing tests.(Ahem. Bonus points for a random_string() returning a bytea, based on
pg_strong_random().)Would love to. Separate patch of bundled into this one?
Here's the original with suggestions applied and a random_string that applies on top of it.
Thanks!
P
On Dec 9, 2022, at 3:20 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
On Dec 9, 2022, at 9:17 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:
On Dec 8, 2022, at 8:29 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 08, 2022 at 04:44:56PM -0800, Paul Ramsey wrote:
Final tme, with fixes from cirrusci.
Well, why not. Seems like you would use that a lot with PostGIS.
#include <math.h> /* for ldexp() */
+#include <float.h> /* for DBL_EPSILON */
And be careful with the order here.Should be ... alphabetical?
+static void +drandom_check_default_seed() We always use (void) rather than empty parenthesis sets.OK
I would not leave that unchecked, so I think that you should add
something in ramdom.sql. Or would you prefer switching some of
the regression tests be switched so as they use the new normal
function?Reading through those tests... seems like they will (rarely) fail. Is that... OK?
The tests seem to be mostly worried that random() starts returning constants, which seems like a good thing to test for (is the random number generating returning randomness).
An obvious test for this function is that the mean and stddev converge on the supplied parameters, given enough inputs, which is actually kind of the opposite test. I use the same random number generator as the uniform distribution, so that aspect is already covered by the existing tests.(Ahem. Bonus points for a random_string() returning a bytea, based on
pg_strong_random().)Would love to. Separate patch of bundled into this one?
Here's the original with suggestions applied and a random_string that applies on top of it.
Thanks!
P
Clearing up one CI failure.
On Tue, Dec 13, 2022 at 03:51:11PM -0800, Paul Ramsey wrote:
Clearing up one CI failure.
+-- normal values converge on stddev == 2.0
+SELECT round(stddev(random_normal(2, 2)))
+ FROM generate_series(1, 10000);
I am not sure that it is a good idea to make a test based on a random
behavior that should tend to a normalized value. This is costly in
cycles, requiring a lot of work just for generate_series(). You could
do the same kind of thing as random() a few lines above?
+SELECT bool_and(random_string(16) != random_string(16)) AS same
+ FROM generate_series(1,8);
That should be fine in terms of impossible chances :)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("return size must be non-negative")))
This could have a test, same for 0.
+#ifndef M_PI
+#define M_PI 3.14159265358979323846
+#endif
Postgres' float.h includes one version of that.
--
Michael
On Dec 14, 2022, at 9:17 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Dec 13, 2022 at 03:51:11PM -0800, Paul Ramsey wrote:
Clearing up one CI failure.
+-- normal values converge on stddev == 2.0 +SELECT round(stddev(random_normal(2, 2))) + FROM generate_series(1, 10000);I am not sure that it is a good idea to make a test based on a random
behavior that should tend to a normalized value. This is costly in
cycles, requiring a lot of work just for generate_series(). You could
do the same kind of thing as random() a few lines above?+SELECT bool_and(random_string(16) != random_string(16)) AS same + FROM generate_series(1,8); That should be fine in terms of impossible chances :)+ ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("return size must be non-negative"))) This could have a test, same for 0.+#ifndef M_PI +#define M_PI 3.14159265358979323846 +#endif Postgres' float.h includes one version of that.
Thanks again!
P
Hello Paul,
My 0.02€ about the patch:
Patch did not apply with "git apply", I had to "patch -p1 <" and there was
a bunch of warnings.
Patches compile and make check is okay.
The first patch adds empty lines at the end of "sql/random.sql", I think
that it should not.
# About random_normal:
I'm fine with providing a random_normal implementation at prng and SQL
levels.
There is already such an implementation in "pgbench.c", which outputs
integers, I'd suggest that it should also use the new prng function, there
should not be two box-muller transformations in pg.
# About pg_prng_double_normal:
On the comment, I'd write "mean + stddev * val" instead of starting with
the stddev part.
Usually there is an empty line between the variable declarations and the
first statement.
There should be a comment about why it needs u1
larger than some epsilon. This constraint seems to generate a small bias?
I'd suggest to add the UNLIKELY() compiler hint on the loop.
# About random_string:
Should the doc about random_string tell that the output bytes are expected
to be uniformly distributed? Does it return "random values" or "pseudo
random values"?
I do not understand why the "drandom_string" function is in "float.c", as
it is not really related to floats. Also it does not return a string but a
bytea, so why call it "_string" in the first place? I'm do not think that
it should use pg_strong_random which may be very costly on some platform.
Also, pg_strong_random does not use the seed, so I do not understand why
it needs to be checked. I'd suggest that the output should really be
uniform pseudo-random, possibly based on the drandom() state, or maybe
not.
Overall, I think that there should be a clearer discussion and plan about
which random functionS postgres should provide to complement the standard
instead of going there… randomly:-)
--
Fabien.
On Sat, Dec 17, 2022 at 05:49:15PM +0100, Fabien COELHO wrote:
Overall, I think that there should be a clearer discussion and plan about
which random functionS postgres should provide to complement the standard
instead of going there… randomly:-)
So, what does the specification tells about seeds, normal and random
functions? A bunch of DBMSs implement RAND, sometimes RANDOM, SEED or
even NORMAL using from time to time specific SQL keywords to do the
work.
Note that SQLValueFunction made the addition of more returning data
types a bit more complicated (not much, still) than the new
COERCE_SQL_SYNTAX by going through a mapping function, so the
keyword/function mapping is straight-forward.
--
Michael