Re: Selecting random rows efficiently

Started by Jeffover 22 years ago8 messages

threshar@torgo.978.org

over 22 years ago

On Sat, 30 Aug 2003, Richard Jones wrote:

Hi,
i have a table of around 3 million rows from which i regularly (twice a second
at the moment) need to select a random row from

currently i'm doing "order by rand() limit 1" - but i suspect this is
responsible for the large load on my db server - i guess that PG is doing far
too much work just to pick one row.

If you have an int id (aka serial) column then it is simple - just pick a
random number between 1 and currval('id_seq')...

or offset rand() limit 1 perhaps?

since you want random ther eis no need to bother with an order and that'll
save a sort.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

Import Notes

Reply to msg id not found: 200308301309.03412.rj@last.fm

Rod Taylor

rbt@rbt.ca

over 22 years ago

In reply to: Jeff (#1)

i was hoping there was some trickery with sequences that would allow me to
easily pick a random valid sequence number..?

I would suggest renumbering the data.

ALTER SEQUENCE ... RESTART WITH 1;
UPDATE table SET pkey = DEFAULT;

Of course, PostgreSQL may have trouble with that update due to
evaluation of the unique constraint immediately -- so drop the primary
key first, and add it back after.

Import Notes

Reply to msg id not found: 200308301425.51127.rj@last.fm

Richard Jones

rj@last.fm

over 22 years ago

In reply to: Jeff (#1)

On Saturday 30 August 2003 1:08 pm, you wrote:

On Sat, 30 Aug 2003, Richard Jones wrote:

Hi,
i have a table of around 3 million rows from which i regularly (twice a
second at the moment) need to select a random row from

currently i'm doing "order by rand() limit 1" - but i suspect this is
responsible for the large load on my db server - i guess that PG is doing
far too much work just to pick one row.

If you have an int id (aka serial) column then it is simple - just pick a
random number between 1 and currval('id_seq')...

or offset rand() limit 1 perhaps?

since you want random ther eis no need to bother with an order and that'll
save a sort.

Yes, the pkey is a SERIAL but the problem is that the sequence is rather
sparse

for example, it goes something like 1 -> 5000 then 100000->100000 and then
2000000->upwards

this is due to chunks being deleted etc..

if i pick a random number for the key it will not be a random enough
distribution, because the sequence is sparse.. sometimes it will pick a key
that doesnt exist.

i'm currently reading all the keys into an array and selecting randoms from
there - but this is no good long-term as i need to refresh the array of keys
to take into account newly added rows to the table (daily)

i was hoping there was some trickery with sequences that would allow me to
easily pick a random valid sequence number..?

Thanks,
Rich.

Show quoted text

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Richard Jones (#3)

Richard Jones <rj@last.fm> writes:

i have a table of around 3 million rows from which i regularly (twice a
second at the moment) need to select a random row from

i was hoping there was some trickery with sequences that would allow me to
easily pick a random valid sequence number..?

There is no magic bullet here, but if you expect that requirement to
persist then it is worth your trouble to expend effort on a real
solution. A real solution in my mind would look like

1. Add a column "random_id float8 default random()". The idea here
is that you assign a random ID to each row as it is created.

2. Add an index on the above column.

3. Your query now looks like

SELECT * FROM table WHERE random_id >= random()
ORDER BY random_id LIMIT 1;

This gives you a plan on the order of

Limit (cost=0.00..0.17 rows=1 width=8)
-> Index Scan using fooi on foo (cost=0.00..57.00 rows=334 width=8)
Filter: (random_id >= random())

which is fast and gives a genuinely random result row. At least up
until you have enough rows that there start being duplicate random_ids,
which AFAIK would be 2 billion rows with a decent random()
implementation. If you're concerned about that, you could periodically
re-randomize with
UPDATE table SET random_id = random();
so that any rows that were "hidden" because they had a duplicate
random_id have another shot at being choosable. But with only a few mil
rows I don't think you need to worry.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Tom Lane (#4)

I said:

3. Your query now looks like
SELECT * FROM table WHERE random_id >= random()
ORDER BY random_id LIMIT 1;

Correction: the above won't give quite the right query because random()
is marked as a volatile function. You can hide the random() call inside
a user-defined function that you (misleadingly) mark stable, or you can
just stick it into a sub-select:

regression=# explain select * from foo WHERE random_id >= (select random())
regression-# ORDER BY random_id LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=0.01..0.15 rows=1 width=8)
InitPlan
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Index Scan using fooi on foo (cost=0.00..45.50 rows=334 width=8)
Index Cond: (random_id >= $0)
(5 rows)

This technique is probably safer against future planner changes,
however:

regression=# create function oneshot_random() returns float8 as
regression-# 'select random()' language sql stable;
CREATE FUNCTION
regression=# explain select * from foo WHERE random_id >= oneshot_random()
regression-# ORDER BY random_id LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=0.00..0.14 rows=1 width=8)
-> Index Scan using fooi on foo (cost=0.00..46.33 rows=334 width=8)
Index Cond: (random_id >= oneshot_random())
(3 rows)

The point here is that an indexscan boundary condition has to use stable
or immutable functions. By marking oneshot_random() stable, you
essentially say that it's okay to evaluate it only once per query,
rather than once at each row.

regards, tom lane

Ron Johnson

ron.l.johnson@cox.net

over 22 years ago

In reply to: Rod Taylor (#2)

On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:

i was hoping there was some trickery with sequences that would allow me to
easily pick a random valid sequence number..?

I would suggest renumbering the data.

ALTER SEQUENCE ... RESTART WITH 1;
UPDATE table SET pkey = DEFAULT;

Of course, PostgreSQL may have trouble with that update due to
evaluation of the unique constraint immediately -- so drop the primary
key first, and add it back after.

And if there are child tables, they'd all have to be updated, too.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Whatever may be the moral ambiguities of the so-called
demoratic nations and however serious may be their failure to
conform perfectly to their democratic ideals, it is sheer moral
perversity to equate the inconsistencies of a democratic
civilization with the brutalities which modern tyrannical states
practice."
Reinhold Nieburhr, ca. 1940

Russell Garrett

rg@tcslon.com

over 22 years ago

In reply to: Ron Johnson (#6)

Considering that we'd have to index the random field too, it'd be neater in
the long term to re-number the primary key. Although, being a primary key,
that's foreign-keyed from absolutely everywhere, so that'd probably take an
amusingly long time.

...and no we're not from Micronesia, we're from ever so slightly less exotic
London. Though Micronesia might be nice...

Russ (also from last.fm but without the fancy address)

pgsql-performance-owner@postgresql.org wrote:

Show quoted text

On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:

i was hoping there was some trickery with sequences that would
allow me to easily pick a random valid sequence number..?

I would suggest renumbering the data.

ALTER SEQUENCE ... RESTART WITH 1;
UPDATE table SET pkey = DEFAULT;

Of course, PostgreSQL may have trouble with that update due to
evaluation of the unique constraint immediately -- so drop the
primary key first, and add it back after.

And if there are child tables, they'd all have to be updated, too.

scott.marlowe

scott.marlowe@ihs.com

over 22 years ago

In reply to: Russell Garrett (#7)

Can you just create an extra serial column and make sure that one is
always in order and no holes in it? (i.e. a nightly process, etc...)???

If so, then something like this truly flies:

select * from accounts where aid = (select cast(floor(random()*100000)+1 as int));

My times on it on a 100,000 row table are < 1 millisecond.

Note that you have to have a hole free sequence AND know how many rows
there are, but if you can meet those needs, this is screamingly fast.

On Sat, 30 Aug 2003, Russell Garrett wrote:

Show quoted text

Considering that we'd have to index the random field too, it'd be neater in
the long term to re-number the primary key. Although, being a primary key,
that's foreign-keyed from absolutely everywhere, so that'd probably take an
amusingly long time.

...and no we're not from Micronesia, we're from ever so slightly less exotic
London. Though Micronesia might be nice...

Russ (also from last.fm but without the fancy address)

pgsql-performance-owner@postgresql.org wrote:

On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:

i was hoping there was some trickery with sequences that would
allow me to easily pick a random valid sequence number..?

I would suggest renumbering the data.

ALTER SEQUENCE ... RESTART WITH 1;
UPDATE table SET pkey = DEFAULT;

Of course, PostgreSQL may have trouble with that update due to
evaluation of the unique constraint immediately -- so drop the
primary key first, and add it back after.

And if there are child tables, they'd all have to be updated, too.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend