tablesample performance

Started by Andy Colsonover 9 years ago8 messagesgeneral
Jump to latest
#1Andy Colson
andy@squeakycode.net

I wanted to report an awesome performance boost using tablesample.

In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;

When the table was smaller it worked fine, but the performance has
slowly gotten worse. This morning I was getting around 8 transactions a
second.

I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;

And now I'm getting 376 transactions a second!

Thank you dev's! Thank you PG 9.5!

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Francisco Olarte
folarte@peoplecall.com
In reply to: Andy Colson (#1)
Re: tablesample performance

On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote:

I wanted to report an awesome performance boost using tablesample.
In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;
When the table was smaller it worked fine, but the performance has slowly
gotten worse. This morning I was getting around 8 transactions a second.

Which is not a surprise, as it has to at least read all the rows and
generate a random() for each one and keep track of the minimum.

I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;

This should be faster, but to me it seems it does a different thing.
This seems to select each row of the table with probability 1% and
return the first selected, i.e., something similar to

select one into x from ones where random()>0.01 limit 1.

Which has the ( diminishing with table size ) risk of selecting zero
rows and is going to select one of the first 100 or so rows with high
probability, unless I'm missing something.

I say this because docs state ir returns a 'randomly chosen', sample,
not a 'randomly ORDERED' one, and the straightforward implementation
of sampling returns rows in the primitive scan order. I supose it
could be easily tested by selecting bernouilli(100), but have not
server access now to verify it.

With a big table it seems:

select one into x from ones where random()>0.01 order by random() limit 1
or
select one into x from ones tablesample bernoulli(1) order by random() limit 1;

Is more similar to what you originally did ( and the run time should
possibly be something in between ).

I would recomend you to execute the function and verify it does what
you want ( as you say it's fast, I would try selecting a several
thousands and eyeballing the result, if it does what I fear the
grouping should be obvious ).

Maybe you do not mind it, in which case it's ok, but a one minute run
should let you know wahat you are exactly doing.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Andy Colson
andy@squeakycode.net
In reply to: Francisco Olarte (#2)
Re: tablesample performance

On 10/18/2016 11:44 AM, Francisco Olarte wrote:

On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote:

I wanted to report an awesome performance boost using tablesample.
In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;
When the table was smaller it worked fine, but the performance has slowly
gotten worse. This morning I was getting around 8 transactions a second.

Which is not a surprise, as it has to at least read all the rows and
generate a random() for each one and keep track of the minimum.

I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;

This should be faster, but to me it seems it does a different thing.
This seems to select each row of the table with probability 1% and
return the first selected, i.e., something similar to

select one into x from ones where random()>0.01 limit 1.

Which has the ( diminishing with table size ) risk of selecting zero
rows and is going to select one of the first 100 or so rows with high
probability, unless I'm missing something.

I say this because docs state ir returns a 'randomly chosen', sample,
not a 'randomly ORDERED' one, and the straightforward implementation
of sampling returns rows in the primitive scan order. I supose it
could be easily tested by selecting bernouilli(100), but have not
server access now to verify it.

With a big table it seems:

select one into x from ones where random()>0.01 order by random() limit 1
or
select one into x from ones tablesample bernoulli(1) order by random() limit 1;

Is more similar to what you originally did ( and the run time should
possibly be something in between ).

I would recomend you to execute the function and verify it does what
you want ( as you say it's fast, I would try selecting a several
thousands and eyeballing the result, if it does what I fear the
grouping should be obvious ).

Maybe you do not mind it, in which case it's ok, but a one minute run
should let you know wahat you are exactly doing.

Francisco Olarte.

Ah, yes, you're right, there is a bit of a difference there.

Speed wise:
1) select one from ones order by random() limit 1;

about 360ms

2) select one from ones tablesample bernoulli(1) limit 1 ;

about 4ms

3) select one from ones tablesample bernoulli(1) order by random() limit 1;

about 80ms

Using the third option in batch, I'm getting about 15 transactions a second.

Oddly:
select one from ones tablesample bernoulli(0.25) order by random()

takes almost 80ms also.

bernoulli(0.25) returns 3k rows
bernoulli(1) returns 14k rows

Thanks,

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andy Colson (#3)
Re: tablesample performance

Andy Colson <andy@squeakycode.net> writes:

On 10/18/2016 11:44 AM, Francisco Olarte wrote:

This should be faster, but to me it seems it does a different thing.

Ah, yes, you're right, there is a bit of a difference there.

If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.

The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.
Maybe you could select "tablesample system_rows(100)" or so and then
do the order-by-random trick on that sample. This would be a lot
faster than selecting 100 random rows with either built-in sample
method, since the rows it grabs will be consecutive.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Francisco Olarte
folarte@peoplecall.com
In reply to: Andy Colson (#3)
Re: tablesample performance

Andy:

On Tue, Oct 18, 2016 at 7:17 PM, Andy Colson <andy@squeakycode.net> wrote:

Ah, yes, you're right, there is a bit of a difference there.

Speed wise:
1) select one from ones order by random() limit 1;

about 360ms

2) select one from ones tablesample bernoulli(1) limit 1 ;

about 4ms

3) select one from ones tablesample bernoulli(1) order by random() limit 1;

about 80ms

Expected. It would be nice if you had provided some tbale structure / size data.

Using the third option in batch, I'm getting about 15 transactions a second.

Oddly:
select one from ones tablesample bernoulli(0.25) order by random()
takes almost 80ms also.

mmm, it depends a lot on you total rows and average rows per

bernoulli(0.25) returns 3k rows
bernoulli(1) returns 14k rows

This hints at 1M4 rows (14k / 1%). If your rows are small and you have
more than 400 rows per page I would expect that, as .25% sample would
hit every page.

Tome hinted you at an extension. Also, if you are in a function (
which can loop ) you can do a little trick, instead of bernouilli(1)
use bernouilli (N/table_size). This way you will select very few rows
and speed up the last phase. Anyway, I fear bernouilly must read all
the table too, to be able to discard randomly, so you may not win
nothing ( I would compare the query time against a simple 'count(one)
query', to have a benchmark of how much time the server expends
reading the table. I would bet for 'about 80 ms'.

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: tablesample performance

On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andy Colson <andy@squeakycode.net> writes:

On 10/18/2016 11:44 AM, Francisco Olarte wrote:

This should be faster, but to me it seems it does a different thing.

Ah, yes, you're right, there is a bit of a difference there.

If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.

The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.

Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?

It wasn't ever intended to be biased and bernoulli in particular ought
to have a strict no bias.

Happy to patch if we agree.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#6)
Re: tablesample performance

Simon Riggs <simon@2ndquadrant.com> writes:

On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.

The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.

Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?

The standard tablesample methods are constrained by other requirements,
such as repeatability. I am not sure that loading this one on top of
that is a good idea. The bias I referred to above is *not* the fault
of the sample methods, rather it's the fault of using "LIMIT 1".

It does seem like maybe it'd be nice for tsm_system_rows to start at a
randomly chosen entry in the first block it visits, rather than always
dumping that entire block. Then "tablesample system_rows(1)" would
actually give you a pretty random row, and I think we aren't giving up
any useful properties it has now.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#7)
Re: tablesample performance

On 18 October 2016 at 22:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.

The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.

Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?

The standard tablesample methods are constrained by other requirements,
such as repeatability. I am not sure that loading this one on top of
that is a good idea. The bias I referred to above is *not* the fault
of the sample methods, rather it's the fault of using "LIMIT 1".

Hmm, yeh, that would make it a little too much of a special case.

It does seem like maybe it'd be nice for tsm_system_rows to start at a
randomly chosen entry in the first block it visits, rather than always
dumping that entire block. Then "tablesample system_rows(1)" would
actually give you a pretty random row, and I think we aren't giving up
any useful properties it has now.

OK, will patch that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general