tablesample performance
I wanted to report an awesome performance boost using tablesample.
In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;
When the table was smaller it worked fine, but the performance has
slowly gotten worse. This morning I was getting around 8 transactions a
second.
I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;
And now I'm getting 376 transactions a second!
Thank you dev's! Thank you PG 9.5!
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote:
I wanted to report an awesome performance boost using tablesample.
In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;
When the table was smaller it worked fine, but the performance has slowly
gotten worse. This morning I was getting around 8 transactions a second.
Which is not a surprise, as it has to at least read all the rows and
generate a random() for each one and keep track of the minimum.
I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;
This should be faster, but to me it seems it does a different thing.
This seems to select each row of the table with probability 1% and
return the first selected, i.e., something similar to
select one into x from ones where random()>0.01 limit 1.
Which has the ( diminishing with table size ) risk of selecting zero
rows and is going to select one of the first 100 or so rows with high
probability, unless I'm missing something.
I say this because docs state ir returns a 'randomly chosen', sample,
not a 'randomly ORDERED' one, and the straightforward implementation
of sampling returns rows in the primitive scan order. I supose it
could be easily tested by selecting bernouilli(100), but have not
server access now to verify it.
With a big table it seems:
select one into x from ones where random()>0.01 order by random() limit 1
or
select one into x from ones tablesample bernoulli(1) order by random() limit 1;
Is more similar to what you originally did ( and the run time should
possibly be something in between ).
I would recomend you to execute the function and verify it does what
you want ( as you say it's fast, I would try selecting a several
thousands and eyeballing the result, if it does what I fear the
grouping should be obvious ).
Maybe you do not mind it, in which case it's ok, but a one minute run
should let you know wahat you are exactly doing.
Francisco Olarte.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 10/18/2016 11:44 AM, Francisco Olarte wrote:
On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote:
I wanted to report an awesome performance boost using tablesample.
In my stored function I was getting a random row using:
select one into x from ones order by random() limit 1;
When the table was smaller it worked fine, but the performance has slowly
gotten worse. This morning I was getting around 8 transactions a second.Which is not a surprise, as it has to at least read all the rows and
generate a random() for each one and keep track of the minimum.I just replaced it with:
select one into x from ones tablesample bernoulli(1) limit 1;This should be faster, but to me it seems it does a different thing.
This seems to select each row of the table with probability 1% and
return the first selected, i.e., something similar toselect one into x from ones where random()>0.01 limit 1.
Which has the ( diminishing with table size ) risk of selecting zero
rows and is going to select one of the first 100 or so rows with high
probability, unless I'm missing something.I say this because docs state ir returns a 'randomly chosen', sample,
not a 'randomly ORDERED' one, and the straightforward implementation
of sampling returns rows in the primitive scan order. I supose it
could be easily tested by selecting bernouilli(100), but have not
server access now to verify it.With a big table it seems:
select one into x from ones where random()>0.01 order by random() limit 1
or
select one into x from ones tablesample bernoulli(1) order by random() limit 1;Is more similar to what you originally did ( and the run time should
possibly be something in between ).I would recomend you to execute the function and verify it does what
you want ( as you say it's fast, I would try selecting a several
thousands and eyeballing the result, if it does what I fear the
grouping should be obvious ).Maybe you do not mind it, in which case it's ok, but a one minute run
should let you know wahat you are exactly doing.Francisco Olarte.
Ah, yes, you're right, there is a bit of a difference there.
Speed wise:
1) select one from ones order by random() limit 1;
about 360ms
2) select one from ones tablesample bernoulli(1) limit 1 ;
about 4ms
3) select one from ones tablesample bernoulli(1) order by random() limit 1;
about 80ms
Using the third option in batch, I'm getting about 15 transactions a second.
Oddly:
select one from ones tablesample bernoulli(0.25) order by random()
takes almost 80ms also.
bernoulli(0.25) returns 3k rows
bernoulli(1) returns 14k rows
Thanks,
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andy Colson <andy@squeakycode.net> writes:
On 10/18/2016 11:44 AM, Francisco Olarte wrote:
This should be faster, but to me it seems it does a different thing.
Ah, yes, you're right, there is a bit of a difference there.
If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.
The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.
Maybe you could select "tablesample system_rows(100)" or so and then
do the order-by-random trick on that sample. This would be a lot
faster than selecting 100 random rows with either built-in sample
method, since the rows it grabs will be consecutive.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andy:
On Tue, Oct 18, 2016 at 7:17 PM, Andy Colson <andy@squeakycode.net> wrote:
Ah, yes, you're right, there is a bit of a difference there.
Speed wise:
1) select one from ones order by random() limit 1;about 360ms
2) select one from ones tablesample bernoulli(1) limit 1 ;
about 4ms
3) select one from ones tablesample bernoulli(1) order by random() limit 1;
about 80ms
Expected. It would be nice if you had provided some tbale structure / size data.
Using the third option in batch, I'm getting about 15 transactions a second.
Oddly:
select one from ones tablesample bernoulli(0.25) order by random()
takes almost 80ms also.
mmm, it depends a lot on you total rows and average rows per
bernoulli(0.25) returns 3k rows
bernoulli(1) returns 14k rows
This hints at 1M4 rows (14k / 1%). If your rows are small and you have
more than 400 rows per page I would expect that, as .25% sample would
hit every page.
Tome hinted you at an extension. Also, if you are in a function (
which can loop ) you can do a little trick, instead of bernouilli(1)
use bernouilli (N/table_size). This way you will select very few rows
and speed up the last phase. Anyway, I fear bernouilly must read all
the table too, to be able to discard randomly, so you may not win
nothing ( I would compare the query time against a simple 'count(one)
query', to have a benchmark of how much time the server expends
reading the table. I would bet for 'about 80 ms'.
Francisco Olarte.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andy Colson <andy@squeakycode.net> writes:
On 10/18/2016 11:44 AM, Francisco Olarte wrote:
This should be faster, but to me it seems it does a different thing.
Ah, yes, you're right, there is a bit of a difference there.
If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.
Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?
It wasn't ever intended to be biased and bernoulli in particular ought
to have a strict no bias.
Happy to patch if we agree.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Simon Riggs <simon@2ndquadrant.com> writes:
On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.
Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?
The standard tablesample methods are constrained by other requirements,
such as repeatability. I am not sure that loading this one on top of
that is a good idea. The bias I referred to above is *not* the fault
of the sample methods, rather it's the fault of using "LIMIT 1".
It does seem like maybe it'd be nice for tsm_system_rows to start at a
randomly chosen entry in the first block it visits, rather than always
dumping that entire block. Then "tablesample system_rows(1)" would
actually give you a pretty random row, and I think we aren't giving up
any useful properties it has now.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 18 October 2016 at 22:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If you don't want to have an implicit bias towards earlier blocks,
I don't think that either standard tablesample method is really what
you want.The contrib/tsm_system_rows tablesample method is a lot closer, in
that it will start at a randomly chosen block, but if you just do
"tablesample system_rows(1)" then you will always get the first row
in whichever block it lands on, so it's still not exactly unbiased.Is there a reason why we can't fix the behaviours of the three methods
mentioned above by making them all start at a random block and a
random item between min and max?The standard tablesample methods are constrained by other requirements,
such as repeatability. I am not sure that loading this one on top of
that is a good idea. The bias I referred to above is *not* the fault
of the sample methods, rather it's the fault of using "LIMIT 1".
Hmm, yeh, that would make it a little too much of a special case.
It does seem like maybe it'd be nice for tsm_system_rows to start at a
randomly chosen entry in the first block it visits, rather than always
dumping that entire block. Then "tablesample system_rows(1)" would
actually give you a pretty random row, and I think we aren't giving up
any useful properties it has now.
OK, will patch that.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general