ORDER BY random() LIMIT 1 slowness

Started by Gavin M. Royover 23 years ago17 messagesgeneral

gmr@justsportsusa.com

over 23 years ago

I have a query where i just want to randomly pick out one row of the
table. The query as I am running it looks like:

SELECT * FROM poetry ORDER BY random() LIMIT 1;

There are only roughly 35,000 rows of data and there is no way that I
have found to specify what is randomly being ordered, I assume it's
picking the primary key. The explain output looks like:

QUERY PLAN:
Limit (cost=49279.75..49279.75 rows=1 width=1062) (actual
time=8503.83..8503.84 rows=1 loops=1)
-> Sort (cost=49279.75..49365.68 rows=34375 width=1062) (actual
time=8503.82..8503.83 rows=2 loops=1)
Sort Key: random()
-> Seq Scan on poetry (cost=0.00..5029.75 rows=34375
width=1062) (actual time=0.12..2503.35 rows=34376 loops=1)
Total runtime: 8526.31 msec

I have different variations on the runtime, all between 5000 and 15000
msec. Anyone have any ideas on how to speed this up? I've thought
about doing a count query to get the total count and then use limit 1
offset #, where offset # is a number supplied by my program, but it
would be preferred to do this in query.

Thanks,

Gavin

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

"Gavin M. Roy" <gmr@justsportsusa.com> writes:

SELECT * FROM poetry ORDER BY random() LIMIT 1;
[ is slow for 35000 rows ]

Yeah. Basically this query is implemented as
(a) select all 35000 rows of "poetry";
(b) compute a random() value for each row;
(c) sort by the random() values;
(d) take the first row, discard the rest.

The good news: this gives you a pretty-durn-random selection. The bad
news: you didn't really care about choosing a random ordering of the
other 34999 rows, but it computed one anyway.

This problem's been discussed before, but I've not seen any really
ideal answer. SQL is not designed to offer unpredictable results ;-)

If you have a reasonably-uniformly-distributed index key on the rows,
you can try something like

select * from poetry where key > (scale * random() + min)
order by key limit 1;

(where scale and min are chosen to make the result cover the range of
index keys). Given an index on "key" this should pick a result quickly
via an indexscan+limit plan. However, if the distribution of keys isn't
real uniform then you won't get real random results --- keys just above
larger gaps will be selected with unfairly high probability. You also
have to know the min and max keys accurately.

I recall some folks have suggested generating an actual random column
(ie, something initialized with "default random()" and suitably indexed)
but this seems like a lot of overhead ... you'd have to be mighty
concerned with the exactness of your random selection to put up with
that, I think.

There are some other suggestions in the archives, IIRC.

regards, tom lane

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

Gavin,

Assuming that you have a serial column rand on poetry and you did not
delete any row,
here is my suggestion:

CREATE TABLE poetry ( rand SERIAL, ... );

SELECT * FROM poetry WHERE rand = (
SELECT int8( curval( 'poetry_rand_seq') * random()));

JLL

Tom Lane wrote:

Show quoted text

"Gavin M. Roy" <gmr@justsportsusa.com> writes:

SELECT * FROM poetry ORDER BY random() LIMIT 1;
[ is slow for 35000 rows ]

Yeah. Basically this query is implemented as
(a) select all 35000 rows of "poetry";
(b) compute a random() value for each row;
(c) sort by the random() values;
(d) take the first row, discard the rest.

The good news: this gives you a pretty-durn-random selection. The bad
news: you didn't really care about choosing a random ordering of the
other 34999 rows, but it computed one anyway.

This problem's been discussed before, but I've not seen any really
ideal answer. SQL is not designed to offer unpredictable results ;-)

If you have a reasonably-uniformly-distributed index key on the rows,
you can try something like

select * from poetry where key > (scale * random() + min)
order by key limit 1;

(where scale and min are chosen to make the result cover the range of
index keys). Given an index on "key" this should pick a result quickly
via an indexscan+limit plan. However, if the distribution of keys isn't
real uniform then you won't get real random results --- keys just above
larger gaps will be selected with unfairly high probability. You also
have to know the min and max keys accurately.

I recall some folks have suggested generating an actual random column
(ie, something initialized with "default random()" and suitably indexed)
but this seems like a lot of overhead ... you'd have to be mighty
concerned with the exactness of your random selection to put up with
that, I think.

There are some other suggestions in the archives, IIRC.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Tom Lane (#2)

Re: ORDER BY random() LIMIT 1 slowness

Tom Lane <tgl@sss.pgh.pa.us> writes:

"Gavin M. Roy" <gmr@justsportsusa.com> writes:

SELECT * FROM poetry ORDER BY random() LIMIT 1;
[ is slow for 35000 rows ]

Yeah. Basically this query is implemented as
(a) select all 35000 rows of "poetry";
(b) compute a random() value for each row;
(c) sort by the random() values;
(d) take the first row, discard the rest.

If you can generate a random value from your application layer you could do

select * from poetry LIMIT 1 OFFSET <random value>

Can offset values be placeholders in prepared queries? If not then this has
that disadvantage.

--
greg

SZŰCS Gábor

surrano@mailbox.hu

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Tuesday, December 17, 2002 5:04 PM

Gavin,

Assuming that you have a serial column rand on poetry and you did not
delete any row,
here is my suggestion:

CREATE TABLE poetry ( rand SERIAL, ... );

SELECT * FROM poetry WHERE rand = (
SELECT int8( curval( 'poetry_rand_seq') * random()));

Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's)
bound to session and has no meaning before the first call to nextval()?
7.2.1 says the following; has it changed in 7.3(.*)?

---------------------------- cut here ------------------------------
tir=> create sequence test_seq;
CREATE
tir=> select currval('test_seq');
ERROR: test_seq.currval is not yet defined in this session
tir=> select nextval('test_seq');
nextval
---------
1
(1 row)

tir=> select currval('test_seq');
currval
---------
1
(1 row)
---------------------------- cut here ------------------------------

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: SZŰCS Gábor (#5)

Re: ORDER BY random() LIMIT 1 slowness

=?iso-8859-1?Q?SZUCS_G=E1bor?= <surrano@mailbox.hu> writes:

CREATE TABLE poetry ( rand SERIAL, ... );

SELECT * FROM poetry WHERE rand = (
SELECT int8( curval( 'poetry_rand_seq') * random()));

Mmmm... It usually doesn't work for me.

Yeah ... better would be

SELECT * FROM poetry WHERE rand = (
SELECT int8( (select last_value from poetry_rand_seq) * random()));

Personally though, I'd skip the sequence entirely and do

create table poetry (...,
rand float8 default random());
create index on poetry.rand

select * from poetry where rand > random() order by rand limit 1;

A difficulty with either of these approaches is that the system won't
optimize comparisons involving random() into indexscans. To get around
that, you'd have to hide the random() call inside a user-defined
function that is (bogusly) marked cachable (or in 7.3, "stable" would be
the best choice). At the moment I think it'd also work to stick the
random() call inside a subselect, but the UDF approach is less likely to
get broken by future changes.

regards, tom lane

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

Gabor,

You are right about the missing 'r', but I think you missed my point.
You should modify your table so that it has a serial field and reload
it.

JLL

P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
work under 7.3

SZUCS Gábor wrote:

Show quoted text

----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Tuesday, December 17, 2002 5:04 PM

Gavin,

Assuming that you have a serial column rand on poetry and you did not
delete any row,
here is my suggestion:

CREATE TABLE poetry ( rand SERIAL, ... );

SELECT * FROM poetry WHERE rand = (
SELECT int8( curval( 'poetry_rand_seq') * random()));

Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's)
bound to session and has no meaning before the first call to nextval()?
7.2.1 says the following; has it changed in 7.3(.*)?

---------------------------- cut here ------------------------------
tir=> create sequence test_seq;
CREATE
tir=> select currval('test_seq');
ERROR: test_seq.currval is not yet defined in this session
tir=> select nextval('test_seq');
nextval
---------
1
(1 row)

tir=> select currval('test_seq');
currval
---------
1
(1 row)
---------------------------- cut here ------------------------------

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

SZŰCS Gábor

surrano@mailbox.hu

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

Dear Jean-Luc,

I don't think my simplified example missed any of your solution's features.
The essence, in my eyes, is that it has nothing to do with tables. It's only
related to sequences.

In short, you _cannot_ use currval() in any single _session_ until you use
nextval() in the same session, even if you created the sequence in the very
same session. Using a serial field in a table or using the sequence directly
is indifferent.

Or I'm missing something here.

As for Tom's solution:

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
Sent: Wednesday, December 18, 2002 4:56 PM

Personally though, I'd skip the sequence entirely and do

create table poetry (...,
rand float8 default random());
create index on poetry.rand

select * from poetry where rand > random() order by rand limit 1;

I'm not sure it's as flat as a random number should be. I have some relation
to mathematics but can't see it clearly right now. I fear it's more likely a
normal distribution, not linear (or whatsits called). But if I needed
something like this, I'd be happy with this solution anyway.

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------
----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Wednesday, December 18, 2002 5:55 PM

Gabor,

You are right about the missing 'r', but I think you missed my point.
You should modify your table so that it has a serial field and reload
it.

JLL

P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
work under 7.3

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

OK Gabor,

I'm the one who misunderstood.

To me, it seem to be a bug (or at least a mis-feature) that one cannot
call currval() before calling nextval().

Does anyone know why it should be like this?

JLL

SZUCS Gábor wrote:

Show quoted text

Dear Jean-Luc,

I don't think my simplified example missed any of your solution's features.
The essence, in my eyes, is that it has nothing to do with tables. It's only
related to sequences.

In short, you _cannot_ use currval() in any single _session_ until you use
nextval() in the same session, even if you created the sequence in the very
same session. Using a serial field in a table or using the sequence directly
is indifferent.

Or I'm missing something here.

As for Tom's solution:

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
Sent: Wednesday, December 18, 2002 4:56 PM

Personally though, I'd skip the sequence entirely and do

create table poetry (...,
rand float8 default random());
create index on poetry.rand

select * from poetry where rand > random() order by rand limit 1;

I'm not sure it's as flat as a random number should be. I have some relation
to mathematics but can't see it clearly right now. I fear it's more likely a
normal distribution, not linear (or whatsits called). But if I needed
something like this, I'd be happy with this solution anyway.

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------
----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Wednesday, December 18, 2002 5:55 PM

Gabor,

You are right about the missing 'r', but I think you missed my point.
You should modify your table so that it has a serial field and reload
it.

JLL

P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
work under 7.3

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#10

Alvaro Herrera

alvherre@dcc.uchile.cl

over 23 years ago

In reply to: Jean-Luc Lachance (#9)

Re: ORDER BY random() LIMIT 1 slowness

On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote:

OK Gabor,

I'm the one who misunderstood.

To me, it seem to be a bug (or at least a mis-feature) that one cannot
call currval() before calling nextval().

Does anyone know why it should be like this?

It doesn't make sense to call currval() if you haven't called nextval()
before. The meaning of currval() is "the value that was last assigned
to you". If you haven't called nextval(), there isn't a value assigned
to you.

If you want to know what was the last value the sequence gave to anyway,
SELECT last_value FROM sequence. But be aware that this is
non-transaction safe, non-isolatable, non-anything.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Entristecido, Wutra
echa a Freyr a rodar
y a nosotros al mar" (cancion de Las Barreras)

#11

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: Gavin M. Roy (#1)

Re: ORDER BY random() LIMIT 1 slowness

Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

JLL

Alvaro Herrera wrote:

Show quoted text

On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote:

OK Gabor,

I'm the one who misunderstood.

To me, it seem to be a bug (or at least a mis-feature) that one cannot
call currval() before calling nextval().

Does anyone know why it should be like this?

It doesn't make sense to call currval() if you haven't called nextval()
before. The meaning of currval() is "the value that was last assigned
to you". If you haven't called nextval(), there isn't a value assigned
to you.

If you want to know what was the last value the sequence gave to anyway,
SELECT last_value FROM sequence. But be aware that this is
non-transaction safe, non-isolatable, non-anything.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Entristecido, Wutra
echa a Freyr a rodar
y a nosotros al mar" (cancion de Las Barreras)

#12

scott.marlowe

scott.marlowe@ihs.com

over 23 years ago

In reply to: Jean-Luc Lachance (#9)

Re: ORDER BY random() LIMIT 1 slowness

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

OK Gabor,

I'm the one who misunderstood.

To me, it seem to be a bug (or at least a mis-feature) that one cannot
call currval() before calling nextval().

Does anyone know why it should be like this?

First, read this page:

http://developer.postgresql.org/docs/postgres/functions-sequence.html

which answers a bit of that question.

the real issue with sequences is that in order to be transactionally safe,
they have to live outside of all transactions.

The purpose of the sequence manipulation functions is to interact with
sequence's in ways that ensure that the same sequence number is never used
by two different transactions. Let me illustrate with a pair of
concurrent transactions, A and B:

A: begin;
B: begin;
A: select currval('seq'); <- client stores this value
B: select currval('seq'); <- ditto
A: insert into table (name, id) values ('john',idnum);
B: insert into table (name, id) values ('sue',idnum);
A: commit;
B: commit;

See the problem with the above? It's why you can't use currval to get the
sequence number if you haven't called nextval, setval, or some other
fuction that has changed the sequence, and why using it will cause an
error. Let's fix the above queries:

(seq=20)

A: begin;
B: begin;
A: select nextval('seq'); <- client doesn't store this (but could)
B: select nextval('seq'); <- client stores 22
A: insert into table (name, id) values ('john',currval('seq'));
B: insert into table (name, id) values ('sue',idnum);
A: commit;
B: commit;

All is well.

Note that if A: were to roll back, B would still complete, but we would
have a hole in our sequence for number 21. this is normal. The price we
pay for having sequences be safe in transactions is that they live outside
of transactions, and the functions that provide the interface are what are
transactionally aware.

#13

scott.marlowe

scott.marlowe@ihs.com

over 23 years ago

In reply to: Jean-Luc Lachance (#11)

Re: ORDER BY random() LIMIT 1 slowness

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

no, that would be like walking around with a gun pointed at your foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe. currval working without
a nextval or setval before it is dangerous in the exterme to transactions.

#14

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: scott.marlowe (#13)

Re: ORDER BY random() LIMIT 1 slowness

Scott,

I understand it all.

If a programmer understand that currval() return the last_(used)_value
and did not himself call nextval() he should be aware of the caveat.

I did not want to make a big fuss of it. I will just use select
last_value myself since I am already aware of the caveat. :)

JLL

"scott.marlowe" wrote:

Show quoted text

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

no, that would be like walking around with a gun pointed at your foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe. currval working without
a nextval or setval before it is dangerous in the exterme to transactions.

#15

scott.marlowe

scott.marlowe@ihs.com

over 23 years ago

In reply to: Jean-Luc Lachance (#14)

Re: ORDER BY random() LIMIT 1 slowness

But the problem is that we'd immediately be innundated on the
pgsql-general list by people who were using currval() in an unsafe way but
didn't know it until they went into production and watched their
application develop serious issues. I.e. they have a right to expect the
sequence functions to operate in a transaction safe manner. It's not
something that is likely to ever change, which is, imnsho, a good thing.

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Show quoted text

Scott,

I understand it all.

If a programmer understand that currval() return the last_(used)_value
and did not himself call nextval() he should be aware of the caveat.

I did not want to make a big fuss of it. I will just use select
last_value myself since I am already aware of the caveat. :)

JLL

"scott.marlowe" wrote:

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

no, that would be like walking around with a gun pointed at your foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe. currval working without
a nextval or setval before it is dangerous in the exterme to transactions.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

#16

Simon Mitchell

pgsql@jseb.com

over 23 years ago

In reply to: scott.marlowe (#15)

Re: ORDER BY random() LIMIT 1 slowness

Hi,
If you need the last gobal value of the sequence just query the
sequence directly.

select last_value, increment_by from xyz_seq;
last_value | increment_by
------------+--------------
355 | 1
(1 row)
or just

select last_value from xyz_seq;
last_value
------------
355
(1 row)

I think that sequences are great in postgresql after going from oracle
to mysql to postgresql.

Regards,
Simon

scott.marlowe wrote:

Show quoted text

But the problem is that we'd immediately be innundated on the
pgsql-general list by people who were using currval() in an unsafe way
but didn't know it until they went into production and watched their
application develop serious issues. I.e. they have a right to expect
the sequence functions to operate in a transaction safe manner. It's
not something that is likely to ever change, which is, imnsho, a good
thing.

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Scott,

I understand it all.

If a programmer understand that currval() return the last_(used)_value
and did not himself call nextval() he should be aware of the caveat.

I did not want to make a big fuss of it. I will just use select
last_value myself since I am already aware of the caveat. :)

JLL

"scott.marlowe" wrote:

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Alvara,

But instead of returning an error, currval() should return
last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

no, that would be like walking around with a gun pointed at your
foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe. currval working
without
a nextval or setval before it is dangerous in the exterme to
transactions.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Import Notes

Resolved by subject fallback

#17

Jean-Luc Lachance

jllachan@nsd.ca

over 23 years ago

In reply to: scott.marlowe (#15)

Re: ORDER BY random() LIMIT 1 slowness

Have another suggestion then -- add lastval().

We would have:

nextval() transaction safe
currval() transaction safe (nextval() must be called first)
lastval() transaction unsafe

JLL

"scott.marlowe" wrote:

Show quoted text

But the problem is that we'd immediately be innundated on the
pgsql-general list by people who were using currval() in an unsafe way but
didn't know it until they went into production and watched their
application develop serious issues. I.e. they have a right to expect the
sequence functions to operate in a transaction safe manner. It's not
something that is likely to ever change, which is, imnsho, a good thing.

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Scott,

I understand it all.

If a programmer understand that currval() return the last_(used)_value
and did not himself call nextval() he should be aware of the caveat.

I did not want to make a big fuss of it. I will just use select
last_value myself since I am already aware of the caveat. :)

JLL

"scott.marlowe" wrote:

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse). I think it
would be more usefull that way.

no, that would be like walking around with a gun pointed at your foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe. currval working without
a nextval or setval before it is dangerous in the exterme to transactions.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org