statistic target and sample rate

Started by Luca Ferrarialmost 5 years ago2 messagesgeneral

fluca1978@gmail.com

almost 5 years ago

Hi all,
I've a doubt about the sampling rate about the statistic target.
Documentation states that:

"The largest statistics target among the columns being analyzed
determines the number of table rows sampled to prepare the statistics.
Increasing the target causes a proportional increase in the time and
space needed to do ANALYZE."

My doubt is about how the "number of table rows sampled" is computed,
so let's do by example a single column table example:

testdb=> create table stat as select v from generate_series( 1, 1000000 ) v;
SELECT 1000000
testdb=> analyze verbose stat;
INFO: analyzing "public.stat"
INFO: "stat": scanned 4425 of 4425 pages, containing 1000000 live
rows and 0 dead rows; 30000 rows in sample, 1000000 estimated total
rows
ANALYZE

The table starts with one million tuples, and statistic target at
default value (100). It samples 30k rows, that is roughly 1/3 pf the
table, while I was expecting 100k tuples (since I assume that the max
value for statistics target, 10000, correspnds to the whole table).

I can get a more close number of sampled rows considering the sampling
on the pages of the relation, but I'm not sure this is the right way
it operates:

testdb=> alter table stat alter column v set statistics 500;
ALTER TABLE
testdb=> analyze verbose stat;
INFO: analyzing "public.stat"
INFO: "stat": scanned 4425 of 4425 pages, containing 1000000 live
rows and 0 dead rows; 150000 rows in sample, 1000000 estimated total
rows
ANALYZE
testdb=> select reltuples / relpages::real * attstattarget from
pg_class c join pg_attribute a on a.attrelid = c.oid where c.relname =
'stat' and a.attname = 'v';
?column?
--------------------
112994.34661865234
(1 riga)

Therefore my question is about how the statistic collectore decides
about the number of tuples to be sampled.

Thanks,
Luca
-

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Luca Ferrari (#1)

Re: statistic target and sample rate

Luca Ferrari <fluca1978@gmail.com> writes:

Therefore my question is about how the statistic collectore decides
about the number of tuples to be sampled.

It's basically 300 times the largest statistics target:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=0c9591415e4b97dd5c5e693af1860294284a1575;hb=HEAD#l1919

Per that comment, there is good math backing this choice for the task
of making a histogram. It's a little shakier for other sorts of
statistics --- notably, for n_distinct estimation, the error can still
be really bad.

regards, tom lane