Random not so random

Started by Arnau Rebassaover 21 years ago26 messagesgeneral

Jump to latest

Arnau Rebassa

arebassa@hotmail.com

over 21 years ago

Hi everybody,

I'm doing the following query:

select * from messages order by random() limit 1;

in the table messages I have more than 200 messages and a lot of times, the
message retrieved is the same. Anybody knows how I could do a more "random"
random?

Thank you very much

--
Arnau

_________________________________________________________________
Consigue aquï¿½ las mejores y mas recientes ofertas de trabajo EE.UU.
http://latino.msn.com/empleos

Jean-Luc Lachance

jllachan@sympatico.ca

over 21 years ago

In reply to: Arnau Rebassa (#1)

Re: Random not so random

Use a SERIAL id on messages, then

Select * from messages
where id = int8( random() * currval({sequence_name}));

Arnau Rebassa wrote:

Show quoted text

Hi everybody,

I'm doing the following query:

select * from messages order by random() limit 1;

in the table messages I have more than 200 messages and a lot of times,
the message retrieved is the same. Anybody knows how I could do a more
"random" random?

Thank you very much

--
Arnau

_________________________________________________________________
Consigue aquï¿½ las mejores y mas recientes ofertas de trabajo EE.UU.
http://latino.msn.com/empleos

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Bruce Momjian

bruce@momjian.us

over 21 years ago

In reply to: Arnau Rebassa (#1)

Re: Random not so random

"Arnau Rebassa" <arebassa@hotmail.com> writes:

select * from messages order by random() limit 1;

in the table messages I have more than 200 messages and a lot of times, the
message retrieved is the same. Anybody knows how I could do a more "random"
random?

What OS is this? Postgres is just using your OS's random()/srandom() calls. On
some platforms these may be poorly implemented and not very random.

However of the various choices available I think random/srandom are a good
choice. I'm surprised you're finding it not very random.

Incidentally, are you reconnecting every time or is it that multiple calls in
a single session are returning the same record? It ought not make a difference
as Postgres is careful to seed the random number generator with something
reasonable though.

In a quick test of my own on linux with glibc 2.3.2.ds1 (no, I have no idea
what the ds1 means) It seems fairly random to me:

test=> create table test4 as (select (select case when b.b then a else a end from test order by random() limit 1) as b from b limit 1000);
SELECT
test=> select count(*),b from test4 group by b;
count | b
-------+---
210 | 5
195 | 4
183 | 3
203 | 2
209 | 1
(5 rows)

And the same thing holds if I test just the low order bits too:

--
greg

Arnau Rebassa

arebassa@hotmail.com

over 21 years ago

In reply to: Bruce Momjian (#3)

Re: Random not so random

Hi Greg,

What OS is this? Postgres is just using your OS's random()/srandom() calls.
On
some platforms these may be poorly implemented and not very random.

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it that multiple calls
in
a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message. The idea is I
have a lilbrary of messages and I want to pick one of it randomly. I don't
know if there is the possibility to seed the random number generator
manually, anybody knows it?

Thanks to all

--
Arnau

_________________________________________________________________
ï¿½Cuï¿½nto vale tu auto? Tips para mantener tu carro. ï¿½De todo en MSN Latino
Autos! http://latino.msn.com/autos/

Import Notes

Resolved by subject fallback

Neil Conway

neilc@samurai.com

over 21 years ago

In reply to: Arnau Rebassa (#4)

Re: Random not so random

Arnau Rebassa wrote:

I don't know if there is the possibility to seed the random number
generator manually, anybody knows it?

setseed()

-Neil

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Arnau Rebassa (#4)

Re: Random not so random

"Arnau Rebassa" <arebassa@hotmail.com> writes:

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it that multiple calls
in a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message.

Hmm. postmaster.c does this during startup of each backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

which would ordinarily be fairly good at mixing things up. On some
platforms I might worry that the microseconds part of gettimeofday
might only have a few bits of accuracy, but I don't think that's an
issue on Linux.

It occurs to me that you might be seeing predictability as an indirect
result of something else you are doing that somehow tends to synchronize
the backend start times. Are you connecting from a cron script that
would tend to be launched at the same relative instant within a second?

It might improve matters to make the code do something like

srandom((unsigned int) (now.tv_sec ^ now.tv_usec));

regards, tom lane

D. Stimits

stimits@comcast.net

over 21 years ago

In reply to: Tom Lane (#6)

Re: Random not so random

Tom Lane wrote:

"Arnau Rebassa" <arebassa@hotmail.com> writes:

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it that multiple calls
in a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message.

Hmm. postmaster.c does this during startup of each backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

If it uses the same seed from the connection, then all randoms within a
connect that has not reconnected will use the same seed. Which means the
same sequence will be generated each time, which is why it is
pseudo-random and not random. For it to be random not just the first
call of a new connection, but among all calls of new connection, it
would have to seed it based on time at the moment of query and not at
the moment of connect. A pseudo-random generator using the same seed
will generate the same sequence.

D. Stimits, stimits AT comcast DOT net

Bruno Wolff III

bruno@wolff.to

over 21 years ago

In reply to: Tom Lane (#6)

Re: Random not so random

On Mon, Oct 04, 2004 at 10:14:19 -0400,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

It occurs to me that you might be seeing predictability as an indirect
result of something else you are doing that somehow tends to synchronize
the backend start times. Are you connecting from a cron script that
would tend to be launched at the same relative instant within a second?

It might improve matters to make the code do something like

srandom((unsigned int) (now.tv_sec ^ now.tv_usec));

Using /dev/urandom, where available, might be another option. However, some
people may not want their entropy pool getting 4 bytes used up on every
connection start up.

Marco Colombo

pgsql@esiway.net

over 21 years ago

In reply to: Tom Lane (#6)

Re: Random not so random

On Mon, 4 Oct 2004, Tom Lane wrote:

"Arnau Rebassa" <arebassa@hotmail.com> writes:

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it that multiple calls
in a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message.

Hmm. postmaster.c does this during startup of each backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

which would ordinarily be fairly good at mixing things up. On some
platforms I might worry that the microseconds part of gettimeofday
might only have a few bits of accuracy, but I don't think that's an
issue on Linux.

It occurs to me that you might be seeing predictability as an indirect
result of something else you are doing that somehow tends to synchronize
the backend start times. Are you connecting from a cron script that
would tend to be launched at the same relative instant within a second?

It might improve matters to make the code do something like

srandom((unsigned int) (now.tv_sec ^ now.tv_usec));

How about reading from /dev/urandom on platforms that support it?

Actually, that should be done each time the random() function
is evaluated. (I have no familiarity with the code, so please
bear with me if the suggestion is unsound). I'd even add a parameter
for "really" random data to be provided, by reading /dev/random
instead of /dev/urandom (but read(2) may block).

How about the following:
random() = random(0) = traditional random()
random(1) = best effort random() via /dev/urandom
random(2) = wait for really random bits via /dev/random

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#10

Bruno Wolff III

bruno@wolff.to

over 21 years ago

In reply to: Marco Colombo (#9)

Re: Random not so random

On Mon, Oct 04, 2004 at 18:58:41 +0200,
Marco Colombo <pgsql@esiway.net> wrote:

Actually, that should be done each time the random() function
is evaluated. (I have no familiarity with the code, so please

That may be overkill, since I don't think that random has been advertised
as a secure or even particularly strong random number generator.

bear with me if the suggestion is unsound). I'd even add a parameter
for "really" random data to be provided, by reading /dev/random
instead of /dev/urandom (but read(2) may block).

You don't want to use /dev/random. You aren't going to get better random
numbers that way and blocking reads is a big problem.

How about the following:
random() = random(0) = traditional random()
random(1) = best effort random() via /dev/urandom
random(2) = wait for really random bits via /dev/random

It might be nice to have a secure random function available in postgres.
Just using /dev/urandom is probably good enough to provide this service.

#11

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: D. Stimits (#7)

Re: Random not so random

"D. Stimits" <stimits@comcast.net> writes:

Tom Lane wrote:

Hmm. postmaster.c does this during startup of each backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

If it uses the same seed from the connection, then all randoms within a
connect that has not reconnected will use the same seed. Which means the
same sequence will be generated each time, which is why it is
pseudo-random and not random. For it to be random not just the first
call of a new connection, but among all calls of new connection, it
would have to seed it based on time at the moment of query and not at
the moment of connect. A pseudo-random generator using the same seed
will generate the same sequence.

Did you read what I said? Or experiment?

regards, tom lane

#12

Marco Colombo

marco@esi.it

over 21 years ago

In reply to: Bruno Wolff III (#10)

Re: Random not so random

On Mon, 4 Oct 2004, Bruno Wolff III wrote:

On Mon, Oct 04, 2004 at 18:58:41 +0200,
Marco Colombo <pgsql@esiway.net> wrote:

Actually, that should be done each time the random() function
is evaluated. (I have no familiarity with the code, so please

That may be overkill, since I don't think that random has been advertised
as a secure or even particularly strong random number generator.

bear with me if the suggestion is unsound). I'd even add a parameter
for "really" random data to be provided, by reading /dev/random
instead of /dev/urandom (but read(2) may block).

You don't want to use /dev/random. You aren't going to get better random
numbers that way and blocking reads is a big problem.

Sure you are. As far as the entropy pool isn't empty, /dev/random
won't block, and thus there's no difference in behaviour.
When you're short of random bits, /dev/random blocks, /dev/urandom
falls back to a PRNG + hash (I think SHA1). Under these conditions,
/dev/urandom output has 0 "entropy" at all: an attacker can predict
the output after short observation provided that he can break SHA1.
That is, anything that uses /dev/urandom (when the kernel pool is
empty) is just as safe as SHA1 is.

I agree that for a general purpose 'good' random() function,
/dev/urandom is enough (as opposed to a plain-old PRNG).
In some applications, you may need the extra security provided
by /dev/random: its output (_when_ is available) it's always
truly random (as long as you trust the kernel, of course - there
have been bugs in the past in Linux about overestimating the randomness
of certain sources, but they've been corrected AFAIK).

How about the following:
random() = random(0) = traditional random()
random(1) = best effort random() via /dev/urandom
random(2) = wait for really random bits via /dev/random

It might be nice to have a secure random function available in postgres.
Just using /dev/urandom is probably good enough to provide this service.

Why not all of them. The problem is how to handle a potentially
blocking read in /dev/random (actually _any_ disk read may block
as well). Just warn people not to use random(2) unless they really
know what they're doing...

I don't think the read syscall overhead is noticeable (in Linux at least).
But for sure we can't afford to _open_ /dev/urandom each time...
backends will have to keep an extra fd open just for /dev/urandom... hmm...
I can't think of any better way of doing that.

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it

#13

Harald Fuchs

hf0722x@protecting.net

over 21 years ago

In reply to: Arnau Rebassa (#4)

Re: Random not so random

In article <20041004155742.GA8488@wolff.to>,
Bruno Wolff III <bruno@wolff.to> writes:

On Mon, Oct 04, 2004 at 10:14:19 -0400,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

It occurs to me that you might be seeing predictability as an indirect
result of something else you are doing that somehow tends to synchronize
the backend start times. Are you connecting from a cron script that
would tend to be launched at the same relative instant within a second?

It might improve matters to make the code do something like

srandom((unsigned int) (now.tv_sec ^ now.tv_usec));

Using /dev/urandom, where available, might be another option. However, some
people may not want their entropy pool getting 4 bytes used up on every
connection start up.

I think we don't need the randomness provided by /dev/[u]random. How
about XORing in getpid?

#14

Michael Fuhr

mike@fuhr.org

over 21 years ago

In reply to: Harald Fuchs (#13)

Re: Random not so random

On Tue, Oct 05, 2004 at 02:39:13PM +0200, Harald Fuchs wrote:

I think we don't need the randomness provided by /dev/[u]random. How
about XORing in getpid?

What about making the seeding mechanism and perhaps random()'s
behavior configurable?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

#15

Tom Lane

tgl@sss.pgh.pa.us

over 21 years ago

In reply to: Harald Fuchs (#13)

Re: Random not so random

Harald Fuchs <hf0722x@protecting.net> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

It might improve matters to make the code do something like
srandom((unsigned int) (now.tv_sec ^ now.tv_usec));

I think we don't need the randomness provided by /dev/[u]random. How
about XORing in getpid?

That sounds like a fine compromise --- it'll ensure a reasonable-size
set of possible seeds, it's at least marginally less predictable than
now.tv_sec, and it's perfectly portable. No one in their right mind
expects random(3) to be cryptographically secure anyway, so doing more
doesn't seem warranted.

The various proposals to create a more-secure, less-portable variant
of random() don't seem appropriate to me for beta. But I'd not object
to someone whipping up a contrib module for 8.1 or beyond.

regards, tom lane

#16

D. Stimits

stimits@comcast.net

over 21 years ago

In reply to: Arnau Rebassa (#4)

Re: Random not so random

Vivek Khera wrote:

"DS" == D Stimits <stimits@comcast.net> writes:

DS> If it uses the same seed from the connection, then all randoms within
DS> a connect that has not reconnected will use the same seed. Which means
DS> the same sequence will be generated each time, which is why it is
DS> pseudo-random and not random. For it to be random not just the first
DS> call of a new connection, but among all calls of new connection, it
DS> would have to seed it based on time at the moment of query and not at
DS> the moment of connect. A pseudo-random generator using the same seed
DS> will generate the same sequence.

You clearly demonstrate you do not understand the purpose of a seed in
a PRNG, nor how PRNG's in general work. You want to seed it once and
only once per process, not every time you issue a query. And nobody
said to use the same seed every time, either.

Sorry, at the time I don't believe PRNG was part of the conversation,
that came after (maybe I missed that). What I saw were functions based
on one-way hashes...is that incorrect? For one-way hashes and
pseudo-random generators using some form of hash a different seed should
be used or else the pattern will be the same when using the same data.
From the man page on srandom():

The srandom() function sets its argument as the seed for a new
sequence of pseudo-random
integers to be returned by random(). These sequences are
repeatable by calling srandom()
with the same seed value. If no seed value is provided, the
random() function is automat-
ically seeded with a value of 1.

The srandom() caught my eye in the earlier email, not PRNG. You are
welcome to re-use srandom() without a new seed if you want.

D. Stimits, stimits AT comcast DOT net

Import Notes

Reply to msg id not found: x7brfg6eib.fsf@yertle.int.kciLink.com

#17

Dann Corbit

DCorbit@connx.com

over 21 years ago

In reply to: D. Stimits (#16)

Re: Random not so random

A better way would be to seed a Mersenne Twister PRNG at server startup
time and then use the same generator for all subsequent calls.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html

The period is exceptionally long, and it has many excellent properties.

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of D. Stimits
Sent: Monday, October 04, 2004 7:23 AM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Random not so random

Tom Lane wrote:

"Arnau Rebassa" <arebassa@hotmail.com> writes:

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it

that multiple

calls
in a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message.

Hmm. postmaster.c does this during startup of each backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

If it uses the same seed from the connection, then all
randoms within a
connect that has not reconnected will use the same seed.
Which means the
same sequence will be generated each time, which is why it is
pseudo-random and not random. For it to be random not just the first
call of a new connection, but among all calls of new connection, it
would have to seed it based on time at the moment of query and not at
the moment of connect. A pseudo-random generator using the same seed
will generate the same sequence.

D. Stimits, stimits AT comcast DOT net

---------------------------(end of
broadcast)---------------------------
TIP 8: explain analyze is your friend

Import Notes

Resolved by subject fallback

#18

Dann Corbit

DCorbit@connx.com

over 21 years ago

In reply to: Dann Corbit (#17)

Re: Random not so random

Here is a tarball for MT:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.t
gz

It has a BSD license. What could be better?

Show quoted text

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Dann Corbit
Sent: Tuesday, October 05, 2004 9:34 AM
To: pgsql-general@postgresql.org
Cc: stimits@comcast.net
Subject: Re: [GENERAL] Random not so random

A better way would be to seed a Mersenne Twister PRNG at
server startup time and then use the same generator for all
subsequent calls.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html

The period is exceptionally long, and it has many excellent
properties.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of D. Stimits
Sent: Monday, October 04, 2004 7:23 AM
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Random not so random

Tom Lane wrote:

"Arnau Rebassa" <arebassa@hotmail.com> writes:

I'm using a debian linux as OS with a 2.4 kernel running on it.

Incidentally, are you reconnecting every time or is it

that multiple

calls
in a single session are returning the same record?

I'm reconnecting each time I want to retrieve a message.

Hmm. postmaster.c does this during startup of each

backend process:

gettimeofday(&now, &tz);
srandom((unsigned int) now.tv_usec);

If it uses the same seed from the connection, then all
randoms within a
connect that has not reconnected will use the same seed.
Which means the
same sequence will be generated each time, which is why it is
pseudo-random and not random. For it to be random not just

the first

call of a new connection, but among all calls of new connection, it
would have to seed it based on time at the moment of query

and not at

the moment of connect. A pseudo-random generator using the

same seed

will generate the same sequence.

D. Stimits, stimits AT comcast DOT net

---------------------------(end of
broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of
broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so
that your
message can get through to the mailing list cleanly

Import Notes

Resolved by subject fallback

#19

Vivek Khera

khera@kcilink.com

over 21 years ago

In reply to: Arnau Rebassa (#4)

Re: Random not so random

"DS" == D Stimits <stimits@comcast.net> writes:

DS> If it uses the same seed from the connection, then all randoms within
DS> a connect that has not reconnected will use the same seed. Which means
DS> the same sequence will be generated each time, which is why it is
DS> pseudo-random and not random. For it to be random not just the first
DS> call of a new connection, but among all calls of new connection, it
DS> would have to seed it based on time at the moment of query and not at
DS> the moment of connect. A pseudo-random generator using the same seed
DS> will generate the same sequence.

You clearly demonstrate you do not understand the purpose of a seed in
a PRNG, nor how PRNG's in general work. You want to seed it once and
only once per process, not every time you issue a query. And nobody
said to use the same seed every time, either.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D. Khera Communications, Inc.
Internet: khera@kciLink.com Rockville, MD +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/

#20

Harald Fuchs

hf0722x@protecting.net

over 21 years ago

In reply to: Dann Corbit (#17)

Re: Random not so random

In article <D425483C2C5C9F49B5B7A41F89441547055523@postal.corporate.connx.com>,
"Dann Corbit" <DCorbit@connx.com> writes:

A better way would be to seed a Mersenne Twister PRNG at server startup
time and then use the same generator for all subsequent calls.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html

The period is exceptionally long, and it has many excellent properties.

I think you're slightly missing the point. Every PRNG always returns
the same output swquence for the same seed, and that's the problem
with the current implementation: it might happen that two backends get
the same seed.

#21

Bruno Wolff III

bruno@wolff.to

over 21 years ago

In reply to: Marco Colombo (#12)

#22

Michael Fuhr

mike@fuhr.org

over 21 years ago

In reply to: Michael Fuhr (#14)

#23

Marco Colombo