Performance of ORDER BY RANDOM to select random rows?

Started by Victor Hooiover 12 years ago5 messagesgeneral
Jump to latest
#1Victor Hooi
victorhooi@yahoo.com

Hi,

I have a Django application where we need to pull random rows out of a
table.

According to the Django documentation:

https://docs.djangoproject.com/en/dev/ref/models/querysets/#order-by

Note: order_by('?') queries may be expensive and slow, depending on the

database backend you’re using.

My understanding is that order_by('?') in the Django ORM will generate a
query with ORDER BY RANDOM.

This blog post:

http://www.peterbe.com/plog/getting-random-rows-postgresql-django

also seems to suggest that using ORDER BY RANDOM() will perform poorly on
Postgres.

I'm just wondering if this is still the case?

I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY
BY RANDOM did not seem substantially to generating random integers in
Python and picking those out (and handling non-existent rows).

Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Cheers,
Victor

In reply to: Victor Hooi (#1)
Re: Performance of ORDER BY RANDOM to select random rows?

On Thu, Aug 08, 2013 at 12:01:17PM +1000, Victor Hooi wrote:

I'm just wondering if this is still the case?

Yes. Order by random() is and, most likely, will be slow. Not sure if
there is any engine that could make it fast.

I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY
BY RANDOM did not seem substantially to generating random integers in
Python and picking those out (and handling non-existent rows).

I think you accidentally a word.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Sergey Konoplev
gray.ru@gmail.com
In reply to: Victor Hooi (#1)
Re: Performance of ORDER BY RANDOM to select random rows?

On Wed, Aug 7, 2013 at 7:01 PM, Victor Hooi <victorhooi@yahoo.com> wrote:

also seems to suggest that using ORDER BY RANDOM() will perform poorly on
Postgres.

I'm just wondering if this is still the case?

I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY
BY RANDOM did not seem substantially to generating random integers in Python
and picking those out (and handling non-existent rows).

Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Unfortunately, It has not. However, there always is a workaround. You
can get a random results fast by WITH RECURSIVE query.

WITH RECURSIVE r AS (
WITH b AS (SELECT min(id), max(id) FROM table1)
(
SELECT id, min, max, array[]::integer[] AS a, 0 AS n
FROM table1, b
WHERE id > min + (max - min) * random()
LIMIT 1
) UNION ALL (
SELECT t.id, min, max, a || t.id, r.n + 1 AS n
FROM table1 AS t, r
WHERE
t.id > min + (max - min) * random() AND
t.id <> all(a) AND
r.n + 1 < 10
LIMIT 1
)
)
SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

The general idea is that we get a random value between min(id) and
max(id) and then get the first row with id bigger than this value.
Then we repeat until we get 10 of such rows, checking that this id has
not been retrieved earlier.

Surely, the probability of appearing one or another value in the
result depends on the distribution of id values in the table, but in
the most cases I faced it works good.

I had an idea to play with pg_stats.histogram_bounds to work around
the described issue, but it was never so critical for tasks I solved.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Victor Hooi
victorhooi@yahoo.com
In reply to: Sergey Konoplev (#3)
Re: Performance of ORDER BY RANDOM to select random rows?

Hi,

@Hubert/@Sergey - Thanks for your response.

Hmm, aha, so the ORDER BY RANDOM behaviour hasn't changed - just to confirm
- this means that Postgres will duplicate the table, add a new column,
generate random numbers for every record, then sort by that new column,
right?

I've just read the above anecdotally on the internet, but I'm curious if
the actual implementation is documented somewhere officially apart from the
source? Running the query through EXPLAIN didn't seem to tell me much
additional information.

@Sergey - Thanks for the tip about using WITH RECURSIVE. I'm actually doing
something similar in my application code in Django - basically take the max
id, then generate a random integer between 0 and max id. However, it is
dependent on how evenly distributed the record IDs are - in our case, if we
delete a large number of records, it might affect things.

Are there any other database backends that do things differently?

(I know that SQL Server suggests using NEWID to do things -
http://msdn.microsoft.com/en-us/library/cc441928.aspx).

Cheers,
Victor

On Fri, Aug 9, 2013 at 10:43 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

Show quoted text

On Wed, Aug 7, 2013 at 7:01 PM, Victor Hooi <victorhooi@yahoo.com> wrote:

also seems to suggest that using ORDER BY RANDOM() will perform poorly on
Postgres.

I'm just wondering if this is still the case?

I just ran those benchmarks on my system (Postgres 9.2.4), and using

ORDERY

BY RANDOM did not seem substantially to generating random integers in

Python

and picking those out (and handling non-existent rows).

Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Unfortunately, It has not. However, there always is a workaround. You
can get a random results fast by WITH RECURSIVE query.

WITH RECURSIVE r AS (
WITH b AS (SELECT min(id), max(id) FROM table1)
(
SELECT id, min, max, array[]::integer[] AS a, 0 AS n
FROM table1, b
WHERE id > min + (max - min) * random()
LIMIT 1
) UNION ALL (
SELECT t.id, min, max, a || t.id, r.n + 1 AS n
FROM table1 AS t, r
WHERE
t.id > min + (max - min) * random() AND
t.id <> all(a) AND
r.n + 1 < 10
LIMIT 1
)
)
SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

The general idea is that we get a random value between min(id) and
max(id) and then get the first row with id bigger than this value.
Then we repeat until we get 10 of such rows, checking that this id has
not been retrieved earlier.

Surely, the probability of appearing one or another value in the
result depends on the distribution of id values in the table, but in
the most cases I faced it works good.

I had an idea to play with pg_stats.histogram_bounds to work around
the described issue, but it was never so critical for tasks I solved.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

#5Sergey Konoplev
gray.ru@gmail.com
In reply to: Victor Hooi (#4)
Re: Performance of ORDER BY RANDOM to select random rows?

On Sun, Aug 11, 2013 at 9:59 PM, Victor Hooi <victorhooi@yahoo.com> wrote:

Hmm, aha, so the ORDER BY RANDOM behaviour hasn't changed - just to confirm
- this means that Postgres will duplicate the table, add a new column,
generate random numbers for every record, then sort by that new column,
right?

It doesn't duplicate the table, it sec scans it and uses top-N sort if
we use limit, and memory or disc sort depending on the data size if we
don't use limit.

I've just read the above anecdotally on the internet, but I'm curious if the
actual implementation is documented somewhere officially apart from the
source? Running the query through EXPLAIN didn't seem to tell me much
additional information.

I can not say about official docs, but you will find a good sorting
explanation here
http://www.depesz.com/2013/05/09/explaining-the-unexplainable-part-3/

@Sergey - Thanks for the tip about using WITH RECURSIVE. I'm actually doing
something similar in my application code in Django - basically take the max
id, then generate a random integer between 0 and max id. However, it is
dependent on how evenly distributed the record IDs are - in our case, if we
delete a large number of records, it might affect things.

You can try to look at pg_stats.histogram_bounds to work the issue
around, however it is just my assumption, I have newer tried it.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general