random record from small set

Started by Jeff Davisabout 21 years ago6 messagesgeneral

pgsql@j-davis.com

about 21 years ago

I am trying to retrieve a random record (according to a chance
attribute) from a small set of records, each with a "chance" attribute.
This may eventually be somwhat of a performance concern, so I'd like to
make sure I'm doing this right.

Here's what I have so far:

create table r1 (
i int,
chance numeric
)
create or replace function randrec() returns int as $$
$res = spi_exec_query('select i,chance from r1');
$r = rand;
$accum = 0;
$i = 0;
while($accum < $r) {
$accum += $res->{rows}[$i++]->{chance}
}
return $res->{rows}[$i-1]->{i};
$$ language plperl;

test=# select * from r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

That seems to work, in that out of 10k times, I got the following
numbers of each:
1 2479
2 1959
3 1522
4 950
5 3090

But I have a few questions:
* Am I right to use NUMERIC for the chance attribute?
* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?
* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00. Will that be a performance problem for
operations on the table that don't modify the chance attribute?
* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once? Is
there a point at which performance could be a serious problem if there
are a large number of items to select among?

Regards,
Jeff Davis

Michael Fuhr

mike@fuhr.org

about 21 years ago

In reply to: Jeff Davis (#1)

Re: random record from small set

On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:

* Am I right to use NUMERIC for the chance attribute?

I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others. I used the
sample data you posted and the PL/pgSQL function shown later in
this message.

* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?

I'd suggest looping through the records so you can't possibly end
up in an infinite loop.

* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00.

If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.

Will that be a performance problem for operations on the table that
don't modify the chance attribute?

Any trigger that you didn't otherwise need will cause a performance
hit. I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement. On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance. I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.

* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once?

I think it does. I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:

CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
r double precision := random();
accum double precision := 0.0;
row record;
BEGIN
FOR row IN SELECT * FROM r1 LOOP
accum := accum + row.chance;
IF accum >= r THEN
EXIT;
END IF;
END LOOP;

RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT * FROM r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2467
2 | 1939
3 | 1536
4 | 1016
5 | 3042
(5 rows)
Time: 3300.710 ms

Here are the results using the PL/Perl function you posted:

SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2501
2 | 2040
3 | 1463
4 | 994
5 | 3002
(5 rows)
Time: 8765.584 ms

I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Jeff Davis

pgsql@j-davis.com

about 21 years ago

In reply to: Michael Fuhr (#2)

Re: random record from small set

Thanks very much for the information. I had very similar results on my
machine. I will take your advice and use the double-precision values,
since it doesn't affect the probability significantly anyway. As far as
the constraint trigger, I will see if it becomes a problem before I
worry about its performance. As far as whether those values add up to
1.0, I'll just check to make sure it's fairly close to 1.0 :)

The only real difference that I saw was that I didn't notice much
difference if the underlying table's chance attribute was double
precision vs. numeric. I used your generate_series()-based query.

Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
attribute. One thing that I noticed there was that I had to CREATE OR
REPLACE the plpgsql function for that to work. Perhaps it was a bug?
Here's a test case:

test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 17775 1
test=# create or replace function err() returns int as $$ DECLARE accum
double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language
plpgsql;
CREATE FUNCTION
test=# insert into t1 values(2); INSERT 17778 1
test=# select err();
err
-----
2
(1 row)
test=# alter table t1 alter column i type numeric;
ALTER TABLE
test=# select err();
err
-----
10
(1 row)

And if you keep playing around with the type and values you can get
other errors like:
ERROR: type "double precision" value out of range: underflow
CONTEXT: PL/pgSQL function "err" line 1 at assignment
Or:
ERROR: invalid memory alloc request size 4294967290
CONTEXT: PL/pgSQL function "randrec" line 7 at assignment

If any more information would be helpful someone let me know. It looks a
little like a bug; perhaps we should throw an error when dependent
functions are called after the underlying types have changed? Or I
suppose if we can recognize that, it might as well recompile the
function and proceed without error.

Regards,
Jeff Davis

Show quoted text

On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote:

On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:

* Am I right to use NUMERIC for the chance attribute?

I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others. I used the
sample data you posted and the PL/pgSQL function shown later in
this message.

* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?

I'd suggest looping through the records so you can't possibly end
up in an infinite loop.

* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00.

If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.

Will that be a performance problem for operations on the table that
don't modify the chance attribute?

Any trigger that you didn't otherwise need will cause a performance
hit. I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement. On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance. I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.

* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once?

I think it does. I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:

CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
r double precision := random();
accum double precision := 0.0;
row record;
BEGIN
FOR row IN SELECT * FROM r1 LOOP
accum := accum + row.chance;
IF accum >= r THEN
EXIT;
END IF;
END LOOP;

RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;

SELECT * FROM r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2467
2 | 1939
3 | 1536
4 | 1016
5 | 3042
(5 rows)
Time: 3300.710 ms

Here are the results using the PL/Perl function you posted:

SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2501
2 | 2040
3 | 1463
4 | 994
5 | 3002
(5 rows)
Time: 8765.584 ms

I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.

Jan Poslusny

pajout@gingerall.cz

about 21 years ago

In reply to: Jeff Davis (#1)

Re: random record from small set

And what about another data representation like

create table r1 (
i int,
chance_from numeric,
chance_to numeric
)
, you can select one random row in one select, for instance
select * from r1 where chance_from <= $rnd and chance_to > $rnd;

I see these advantages
- Only one select.
- Indices can improve performance if r1 has many rows.

and disadvantage
- Tricky update

Jeff Davis wrote:

Show quoted text

I am trying to retrieve a random record (according to a chance
attribute) from a small set of records, each with a "chance" attribute.
This may eventually be somwhat of a performance concern, so I'd like to
make sure I'm doing this right.

Here's what I have so far:

create table r1 (
i int,
chance numeric
)
create or replace function randrec() returns int as $$
$res = spi_exec_query('select i,chance from r1');
$r = rand;
$accum = 0;
$i = 0;
while($accum < $r) {
$accum += $res->{rows}[$i++]->{chance}
}
return $res->{rows}[$i-1]->{i};
$$ language plperl;

test=# select * from r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

That seems to work, in that out of 10k times, I got the following
numbers of each:
1 2479
2 1959
3 1522
4 950
5 3090

But I have a few questions:
* Am I right to use NUMERIC for the chance attribute?
* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?
* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00. Will that be a performance problem for
operations on the table that don't modify the chance attribute?
* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once? Is
there a point at which performance could be a serious problem if there
are a large number of items to select among?

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Jan Poslusny

pajout@gingerall.cz

about 21 years ago

In reply to: Jan Poslusny (#4)

Re: random record from small set

Or
create table r1 (
i int,
chance_from numeric
)
and
select * from r1 where chance_from <= $rnd order by chance_from desc
limit 1;
which can be easier updated...
Just ideas, I has never tested it...

Jan Poslusny wrote:

Show quoted text

And what about another data representation like

create table r1 (
i int,
chance_from numeric,
chance_to numeric
)
, you can select one random row in one select, for instance
select * from r1 where chance_from <= $rnd and chance_to > $rnd;

I see these advantages
- Only one select.
- Indices can improve performance if r1 has many rows.
and disadvantage
- Tricky update

Jeff Davis wrote:

I am trying to retrieve a random record (according to a chance
attribute) from a small set of records, each with a "chance" attribute.
This may eventually be somwhat of a performance concern, so I'd like to
make sure I'm doing this right.

Here's what I have so far:

create table r1 (
i int,
chance numeric
)
create or replace function randrec() returns int as $$
$res = spi_exec_query('select i,chance from r1');
$r = rand;
$accum = 0;
$i = 0;
while($accum < $r) {
$accum += $res->{rows}[$i++]->{chance}
}
return $res->{rows}[$i-1]->{i};
$$ language plperl;

test=# select * from r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30

That seems to work, in that out of 10k times, I got the following
numbers of each:
1 2479
2 1959
3 1522
4 950
5 3090

But I have a few questions:
* Am I right to use NUMERIC for the chance attribute?
* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?
* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00. Will that be a performance problem for
operations on the table that don't modify the chance attribute?
* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once? Is
there a point at which performance could be a serious problem if there
are a large number of items to select among?

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Greg Sabino Mullane

greg@turnstep.com

about 21 years ago

In reply to: Jeff Davis (#1)

Re: random record from small set

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here's what I have so far:

If you go that route, make sure you check for edge cases, such
as reaching the end of the rows without hitting your number:

while($accum < $r) {
die qq{Ran out of rows!\n} if ! defined $res->{rows}[$i];

Also, your query should be "select i,chance from r1 ORDER BY random()"
else you are getting back the same order each time (until a row is
changed) which certainly reduces the randomness.

Anyway, here's another solution, which shifts as much work as possible
off of the actual random row call, and uses a trigger to keep things
in sync. I switched the 'chance' from 0.25 to 25 (numeric to int) to make
things easier to read.

UPDATE r1 SET chance = chance*100;
ALTER TABLE r1 ALTER COLUMN chance TYPE INTEGER;

CREATE TABLE r2(integer);

CREATE OR REPLACE FUNCTION r1_cleanup() RETURNS trigger LANGUAGE plpgsql AS
$$
DECLARE
mychance integer;
BEGIN
IF TG_OP = 'DELETE' THEN
DELETE FROM r2 WHERE id = OLD.i;
ELSE
IF TG_OP = 'UPDATE' THEN
DELETE FROM r2 WHERE id = OLD.i or id = NEW.i;
END IF;
SELECT chance FROM r1 WHERE i=NEW.i INTO mychance;
LOOP
mychance := mychance - 1;
EXIT WHEN mychance < 0;
INSERT INTO r2 VALUES (NEW.i);
END LOOP;
END IF;
RETURN NULL;
END;
$$;

CREATE TRIGGER r1_trigger AFTER INSERT or UPDATE or DELETE ON r1
FOR EACH ROW EXECUTE PROCEDURE r1_cleanup();

UPDATE r1 SET i=i; -- To initially populate r2

SELECT id FROM r2 ORDER BY random() LIMIT 1; -- repeat as needed

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200502152252
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFCEsOvvJuQZxSWSsgRAjysAJ9X3JpMfuXV2ST049bhCWuJOp6Y1ACg/sNx
PXqxVlfvlsKMTBDDhsh3BmU=
=7/IE
-----END PGP SIGNATURE-----