pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Started by Tomas Vondraover 10 years ago19 messages

tomas.vondra@2ndquadrant.com

over 10 years ago

Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped
over a pretty bad n_distinct underestimate, causing OOM in HashAgg
(which somehow illustrates the importance of the memory-bounded hashagg
patch Jeff Davis is working on).

The problem is Q18, particularly this simple subquery:

select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313;

which is planned like this:

QUERY PLAN
---------------------------------------------------------------------------------
HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity) > '313'::double precision)
-> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128
width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

tpch=# select count(distinct l_orderkey) from lineitem;
count
------------
4500000000
(1 row)

That's a helluva difference - not the usual one or two orders of
magnitude, but 10000x underestimate.

The usual thing to do in this case is increasing statistics target, and
while this improves the estimate, the improvement is rather small:

statistics target estimate difference
--------------------------------------------------
100 429491 10000
1000 4240418 1000
10000 42913759 100

I find the pattern rather strange - every time the statistics target
increases 10x, the difference decreases 10x - maybe that's natural, but
the perfect proportionality is suspicious IMHO.

Also, this is a quite large dataset - the table has ~18 billion rows,
and even with target=10000 we're sampling only 3M rows, which is ~0.02%.
That's a tiny sample, so inaccuracy is naturally expected, but OTOH the
TPC-H dataset is damn uniform - there's pretty much no skew in the
distributions AFAIK. So I'd expect a slightly better result.

With target=10000 the plan switches to GroupAggregate, because the
estimate gets sufficient to exceed work_mem (2GB). But it's still way
off, and it's mostly just a lucky coincidence.

So I'm wondering if there's some bug because of the dataset size (an
integer overflow or something like), so I added a bunch of logging into
the estimator, logging all the parameters computed:

target=100 (samplerows=30000)
-----------------------------
WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977
nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000
denom=2024.046627 stadistinct=429491.094029
WARNING: ndistinct estimate attnum=1 attname=l_orderkey
current=429491.09 adaptive=443730.00

target=1000 (samplerows=300000)
-------------------------------
WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644
nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000
denom=20491.658538 stadistinct=4240418.111618
WARNING: ndistinct estimate attnum=1 attname=l_orderkey
current=4240418.11 adaptive=4375171.00

target=10000 (samplerows=3000000)
---------------------------------
WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799
nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000
denom=202578.313396 stadistinct=42913759.396282
WARNING: ndistinct estimate attnum=1 attname=l_orderkey
current=42913759.40 adaptive=44449882.00

It's totalrows=18000049031 in all cases. The logs also show estimate
produced by the adaptive estimate (discussed in a separate thread), but
apparently that does not change the estimates much :-(

Any ideas?

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#1)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped
over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which
somehow illustrates the importance of the memory-bounded hashagg patch Jeff
Davis is working on).

The problem is Q18, particularly this simple subquery:

select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313;

which is planned like this:

QUERY PLAN

---------------------------------------------------------------------------------
HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity) > '313'::double precision)
-> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128
width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

tpch=# select count(distinct l_orderkey) from lineitem;
count
------------
4500000000
(1 row)

That's a helluva difference - not the usual one or two orders of
magnitude, but 10000x underestimate.

Is the row order in the table correlated with the value l_orderkey?

Could you create copy of the table ordered at random, and see if it
exhibits the same estimation issue?

Cheers,

Jeff

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Jeff Janes (#2)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On 06/19/2015 08:32 PM, Jeff Janes wrote:

On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I
tripped over a pretty bad n_distinct underestimate, causing OOM in
HashAgg (which somehow illustrates the importance of the
memory-bounded hashagg patch Jeff Davis is working on).

The problem is Q18, particularly this simple subquery:

select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313;

which is planned like this:

QUERY PLAN
---------------------------------------------------------------------------------
HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity) > '313'::double precision)
-> Seq Scan on lineitem (cost=0.00..508509923.28
rows=18000048128 width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

tpch=# select count(distinct l_orderkey) from lineitem;
count
------------
4500000000
(1 row)

That's a helluva difference - not the usual one or two orders of
magnitude, but 10000x underestimate.

Is the row order in the table correlated with the value l_orderkey?

The statistics look like this:

so yes, it's pretty much perfectly correlated.

Could you create copy of the table ordered at random, and see if it
exhibits the same estimation issue?

Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't
really see why that should matter, because the sample is supposed to be
random.

But I think you might be on to something, because I manually collected a
random sample with 30k rows (by explicitly generating 30k random TIDs),
and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which effectively
looks something like this (this is derived from the logs, so not
perfectly accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#3)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

But I think you might be on to something, because I manually collected a
random sample with 30k rows (by explicitly generating 30k random TIDs), and
I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from
lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which effectively
looks something like this (this is derived from the logs, so not perfectly
accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

The sample is not truly random. The two-stage sampling method causes too
few blocks to have exactly one row chosen from them, and too many to have
either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

See analysis here:

/messages/by-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

If we assume all the blocks have the same tuple density, then it is easy to
correct this. But without that assumption of constant tuple density, I
don't know how to get a truly random sample while still skipping most of
the table.

Cheers,

Jeff

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Jeff Janes (#4)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On 06/19/2015 09:48 PM, Jeff Janes wrote:

On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

But I think you might be on to something, because I manually
collected a random sample with 30k rows (by explicitly generating
30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which
effectively looks something like this (this is derived from the
logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

The sample is not truly random. The two-stage sampling method causes
too few blocks to have exactly one row chosen from them, and too many to
have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've
logged the block numbers for all the 30k sampled tuples (target=100) and
I get this statistics for number of repetitions:

cnt | count
-----+-------
1 | 11020
2 | 5637
3 | 1800
4 | 450
5 | 94
6 | 6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have
2 tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0
and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS
blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;

cnt | count
-----+-------
1 | 29994
2 | 3

So yeah, not really random.

See analysis here:

/messages/by-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

If we assume all the blocks have the same tuple density, then it is
easy to correct this. But without that assumption of constant tuple
density, I don't know how to get a truly random sample while still
skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that
uncommon (e.g. table storing time-series data etc.), and this blowup is
quite bad ...

I don't think we need to really assume the density to be constant, maybe
we can verify that while collecting the sample? I mean, we're already
reading the pages, so we can track the density, and either do the
correction or not.

Also, doesn't Vitter do pretty much the same assumption implicitly,
otherwise it couldn't skipping some of the blocks?

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#5)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Fri, Jun 19, 2015 at 1:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 06/19/2015 09:48 PM, Jeff Janes wrote:

On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:

But I think you might be on to something, because I manually
collected a random sample with 30k rows (by explicitly generating
30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt
from lineitem_sample group by 1) foo group by 1;

cnt | count
-----+-------
1 | 29998
2 | 1
(2 rows)

That's quite different compared to what analyze gets, which
effectively looks something like this (this is derived from the
logs, so not perfectly accurate - I only have f1, ndistinct,
nmultiple):

cnt | count
-----+-------
1 | 27976
2 | 976
3 | 24

Am I wrong or is the sample not that random?

The sample is not truly random. The two-stage sampling method causes
too few blocks to have exactly one row chosen from them, and too many to
have either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

Yeah, I came to the same conclusion after a bit of experimenting. I've
logged the block numbers for all the 30k sampled tuples (target=100) and I
get this statistics for number of repetitions:

cnt | count
-----+-------
1 | 11020
2 | 5637
3 | 1800
4 | 450
5 | 94
6 | 6

so 11020 blocks have exactly 1 tuple sampled from them, 5637 blocks have 2
tuples sampled etc.

With truly random sampling (just generating 30k random numbers between 0
and 328509442 (number of pages of this particular table), I get this:

test=# select cnt, count(*) from (select (328509442 * random())::int AS
blockno, count(*) AS cnt from blocks group by 1) foo group by 1 order by 1;

cnt | count
-----+-------
1 | 29994
2 | 3

So yeah, not really random.

See analysis here:

/messages/by-id/CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com

Thanks.

If we assume all the blocks have the same tuple density, then it is

easy to correct this. But without that assumption of constant tuple
density, I don't know how to get a truly random sample while still
skipping most of the table.

Hmmm, that's probably true. OTOH correlated columns are not all that
uncommon (e.g. table storing time-series data etc.), and this blowup is
quite bad ...

True, but we don't know how big of a problem the density-skew problem might
be (since the current algorithm isn't sensitive to it). It might be just
as big of a problem. Tom mentioned some ancient history in the above
mentioned thread that made me think the density skew was enough of a
problem to motivate the current system.

I don't think we need to really assume the density to be constant, maybe
we can verify that while collecting the sample? I mean, we're already
reading the pages, so we can track the density, and either do the
correction or not.

Maybe. I don't know how that would work. We would have to keep two
samples, and dynamically decide which to use. And what if the decision is
that both density skew is a problem and that value clustering is also a
problem?

I wonder if the n_distinct could be tweaked so that it counted any given
value only once for each block it finds it in? So instead of asking "how
many times was this value sampled", ask "in how many blocks was this value
sampled at least once"?

Also, doesn't Vitter do pretty much the same assumption implicitly,
otherwise it couldn't skipping some of the blocks?

Vitter samples an unstructured stream in a single pass, and is unbiased.
The explicit block sampling is not part of Vitter, it is something we
bolted on top of it.

My solution was to just unbolt the block sampling from the top, and let it
sample the rows (still 300 * stats_target of them) from the whole table
rather than a random 300 * 100 blocks of the table. On tables of the size
I was worried about, block sampling was not very helpful anyway. Reading
30,000 blocks out of 250,000 blocks lead to no meaningful IO advantage on
my hardware. Any advantage from skipped blocks was made up for (and
sometimes exceeded) by fouling up the read-ahead mechanisms.

With 1000 times more blocks, that probably won't work for you.

But I do wonder, how much time does it take to read a random 1/10, 1/100,
1/1000, 1/10000 of a table of your size, just from an IO perspective? How
much are we gaining by doing the block sample?

Cheers,

Jeff

Feng Tian

ftian@vitessedata.com

over 10 years ago

In reply to: Tomas Vondra (#1)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped
over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which
somehow illustrates the importance of the memory-bounded hashagg patch Jeff
Davis is working on).

The problem is Q18, particularly this simple subquery:

select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313;

which is planned like this:

QUERY PLAN

---------------------------------------------------------------------------------
HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12)
Group Key: l_orderkey
Filter: (sum(l_quantity) > '313'::double precision)
-> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128
width=12)
(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

tpch=# select count(distinct l_orderkey) from lineitem;
count
------------
4500000000
(1 row)

That's a helluva difference - not the usual one or two orders of
magnitude, but 10000x underestimate.

The usual thing to do in this case is increasing statistics target, and
while this improves the estimate, the improvement is rather small:

statistics target estimate difference
--------------------------------------------------
100 429491 10000
1000 4240418 1000
10000 42913759 100

I find the pattern rather strange - every time the statistics target
increases 10x, the difference decreases 10x - maybe that's natural, but the
perfect proportionality is suspicious IMHO.

Also, this is a quite large dataset - the table has ~18 billion rows, and
even with target=10000 we're sampling only 3M rows, which is ~0.02%. That's
a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H
dataset is damn uniform - there's pretty much no skew in the distributions
AFAIK. So I'd expect a slightly better result.

With target=10000 the plan switches to GroupAggregate, because the
estimate gets sufficient to exceed work_mem (2GB). But it's still way off,
and it's mostly just a lucky coincidence.

So I'm wondering if there's some bug because of the dataset size (an
integer overflow or something like), so I added a bunch of logging into the
estimator, logging all the parameters computed:

target=100 (samplerows=30000)
-----------------------------
WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977
nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000
denom=2024.046627 stadistinct=429491.094029
WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=429491.09
adaptive=443730.00

target=1000 (samplerows=300000)
-------------------------------
WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644
nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000
denom=20491.658538 stadistinct=4240418.111618
WARNING: ndistinct estimate attnum=1 attname=l_orderkey
current=4240418.11 adaptive=4375171.00

target=10000 (samplerows=3000000)
---------------------------------
WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799
nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000
denom=202578.313396 stadistinct=42913759.396282
WARNING: ndistinct estimate attnum=1 attname=l_orderkey
current=42913759.40 adaptive=44449882.00

It's totalrows=18000049031 in all cases. The logs also show estimate
produced by the adaptive estimate (discussed in a separate thread), but
apparently that does not change the estimates much :-(

Any ideas?

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

While better sample/stats is important for choosing a good plan, in this
query, hash agg is really the right plan. If a sort agg is chosen, the
performance will be really really bad. The patch that Jeff is working on
is critical for a decent TPCH number (unless you have unlimited amount of
memory).

Thanks,

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#1)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm currently running some tests on a 3TB TPC-H data set, and I tripped over
a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow
illustrates the importance of the memory-bounded hashagg patch Jeff Davis is
working on).

Stupid question, but why not just override it using ALTER TABLE ...
ALTER COLUMN ... SET (n_distinct = ...)?

I think it's been discussed quite often on previous threads that you
need to sample an awful lot of the table to get a good estimate for
n_distinct. We could support that, but it would be expensive, and it
would have to be done again every time the table is auto-analyzed.
The above syntax supports nailing the estimate to either an exact
value or a percentage of the table, and I'm not sure why that isn't
good enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Feng Tian (#7)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Hi,

On 06/20/2015 08:54 AM, Feng Tian wrote:

While better sample/stats is important for choosing a good plan, in
this query, hash agg is really the right plan. If a sort agg is
chosen, the performance will be really really bad. The patch that
Jeff is working on is critical for a decent TPCH number (unless you
have unlimited amount of memory).

I do agree that Jeff's memory-bounded hashagg patch is very important
feature, and in fact we spent a fair amount of time discussing it in
Ottawa. So I'm looking forward to getting that soon ;-)

But I don't think hashagg is going to be very good in this particular
case. With a 3TB dataset, the query runs out of memory on a machine with
256GB of RAM. So let's assume a complete hash table has ~256GB. With
work_mem=1GB that means only ~1/256 of the table can be processed in one
batch, so we'll process the first 1/256 of the table, and write out the
remaining 99% into batches. Then we'll read the batches one by one, and
process those. The table has ~2.5TB, so we'll read 2.5TB, write out
~2.49TB into batches, and then read those ~2.49TB again. At least that's
how I understand Jeff's memory-bounded hashagg proposal.

The sort may perform worse in the general case, but in this case there's
an index on the column, and the table is almost perfectly correlated by
that column (due to generating the orders one by one, but it seems
plausible it'd be the same in reality, assuming the orders are numbered
using a sequence). So doing the sort by an indexscan seems rather cheap,
and you only need to scan the table once.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#8)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On 06/20/2015 04:17 PM, Robert Haas wrote:

On Wed, Jun 17, 2015 at 1:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I'm currently running some tests on a 3TB TPC-H data set, and I
tripped over a pretty bad n_distinct underestimate, causing OOM in
HashAgg (which somehow illustrates the importance of the
memory-bounded hashagg patch Jeff Davis is working on).

Stupid question, but why not just override it using ALTER TABLE ...
ALTER COLUMN ... SET (n_distinct = ...)?

Sure, I'll do that, and it's probably the only way to do that at the
moment. But I wasn't really sure why we're producing so poor estimate
initially, even considering how small the sample is, it seemed a bit too
bad, and the proportionality to statistics target seemed really strange.

Also, if we could produce better n_distinct estimate, wouldn't that be
awesome? I'd much rather not force users to use the manual override if
possible.

I think it's been discussed quite often on previous threads that you
need to sample an awful lot of the table to get a good estimate for
n_distinct. We could support that, but it would be expensive, and it
would have to be done again every time the table is auto-analyzed.
The above syntax supports nailing the estimate to either an exact
value or a percentage of the table, and I'm not sure why that isn't
good enough.

Not really. Using a larger sample would certainly help in most cases, in
this case the main problem is that the sample is not as random needed.
It produces way more duplicate values than it should, which then utterly
confuses the estimator which assumes random sample. This happens because
the column is

I've lobotomized the sampling a bit to really produce a random set of
blocks first, and that produces way better estimates:

statistics target estimate random
-----------------------------------------------------------------
100 429491 (10000x) 334430766 (14x)
1000 4240418 (1000x) 439010499 (10x)

Also, the number of sampled blocks is not that different. With target
100, the current sampler reads ~2900 blocks, while a completely random
sampler uses 3000 blocks. So, where's the benefit?

I'm sure there are counter-examples, but that's true for all estimators.
I do realize the random sampler assumes the tuple density is about the
same on all blocks (so that it can sample the blocks directly), but ISTM
the current block sampler has to do basically the same assumption to
skip some of the blocks.

Have to read the previous threads on this topic I guess ...

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Feng Tian

ftian@vitessedata.com

over 10 years ago

In reply to: Tomas Vondra (#9)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Sat, Jun 20, 2015 at 7:56 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

On 06/20/2015 08:54 AM, Feng Tian wrote:

While better sample/stats is important for choosing a good plan, in
this query, hash agg is really the right plan. If a sort agg is
chosen, the performance will be really really bad. The patch that
Jeff is working on is critical for a decent TPCH number (unless you
have unlimited amount of memory).

I do agree that Jeff's memory-bounded hashagg patch is very important
feature, and in fact we spent a fair amount of time discussing it in
Ottawa. So I'm looking forward to getting that soon ;-)

But I don't think hashagg is going to be very good in this particular
case. With a 3TB dataset, the query runs out of memory on a machine with
256GB of RAM. So let's assume a complete hash table has ~256GB. With
work_mem=1GB that means only ~1/256 of the table can be processed in one
batch, so we'll process the first 1/256 of the table, and write out the
remaining 99% into batches. Then we'll read the batches one by one, and
process those. The table has ~2.5TB, so we'll read 2.5TB, write out ~2.49TB
into batches, and then read those ~2.49TB again. At least that's how I
understand Jeff's memory-bounded hashagg proposal.

The sort may perform worse in the general case, but in this case there's
an index on the column, and the table is almost perfectly correlated by
that column (due to generating the orders one by one, but it seems
plausible it'd be the same in reality, assuming the orders are numbered
using a sequence). So doing the sort by an indexscan seems rather cheap,
and you only need to scan the table once.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I have not read Jeff's patch, but here is how I think hash agg should work,

Hash agg scan lineitem table, perform aggregation in memory. Once workmem
is exhausted, it write intermediate state to disk, bucket by bucket. When
lineitem table is finished, it reads all tuples from one bucket back,
combining intermediate state and finalize the aggregation. I saw a quite
extensive discussion on combining aggregation on the dev list, so I assume
it will be added.

Assume after modulo an efficient size for I/O, workmem is bigger than the
square root of data after aggregation, the above algorithm can finish by
write out and read back only once.

For TPCH 3T, lineitem table has about 20 billion rows, 4 or 5 billion
orders. For the simple subquery, one need to
1. scan table, 3TB I/O
2. write out intermediate state, 4 billion * size of (key column +
intermediate state ~ 20 bytes) = 80GB
3. read back 80GB.

If sort is used, also assume workmem bigger than sqrt of data, you need to
scan table, write out about 20B * 20 ~ 400GB, read back 400GB. Sort may
have to do extra rounds of merge, but let's ignore that.

Hash agg has better performace, because,
1. less I/O
2. hash is a linear algorithm, compared to sort at n*lg(n).

#12

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Feng Tian (#11)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Feng Tian wrote:

I have not read Jeff's patch, but here is how I think hash agg should work,

I think you should discuss that in Jeff's thread, not here.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Feng Tian (#11)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Hi,

On 06/20/2015 05:29 PM, Feng Tian wrote:

I have not read Jeff's patch, but here is how I think hash agg should work,

Hash agg scan lineitem table, perform aggregation in memory. Once
workmem is exhausted, it write intermediate state to disk, bucket by
bucket. When lineitem table is finished, it reads all tuples from one
bucket back, combining intermediate state and finalize the aggregation.
I saw a quite extensive discussion on combining aggregation on the
dev list, so I assume it will be added.

That's not really how the proposed patch works, and the fact that we
don't have a good way to serialize/deserialize the aggregate state etc.
There are also various corner cases how you can end up with writing much
more data than you assumed, but let's discuss that in the thread about
the patch, not here.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Jeff Janes (#6)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:

Hmmm, that's probably true. OTOH correlated columns are not all that
uncommon (e.g. table storing time-series data etc.), and this blowup
is quite bad ...

True, but we don't know how big of a problem the density-skew
problem might be (since the current algorithm isn't sensitive to it).
It might be just as big of a problem. Tom mentioned some ancient
history in the above mentioned thread that made me think the density
skew was enough of a problem to motivate the current system.

I don't think we need to really assume the density to be constant,
maybe we can verify that while collecting the sample? I mean, we're
already reading the pages, so we can track the density, and either
do the correction or not.

Maybe. I don't know how that would work.

I was thinking about something like

(1) compute average tuples per page

(2) generate random sample of 'samplesize' blocks

(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the block)

(4) repeat until you have sample of the requested size

The first weak spot is that this requires a knowledge of reltuples,
which is something we currently estimate based on the sample, so it
would have to use the old values I guess.

Also, I'm not sure this works that great with blocks that have higher
density - you'd like to choose them with a higher probability, but if
you've already selected them there's no point in repeating that.

We would have to keep two samples, and dynamically decide which to
use. And what if the decision is that both density skew is a problem
and that value clustering is also a problem?

I don't see why we would have to keep two samples? The idea is that
sampling the blocks truly randomly solves the clustering issue, and the
correction I mentioned above solves the density skew problem.

I wonder if the n_distinct could be tweaked so that it counted any
given value only once for each block it finds it in? So instead of
asking "how many times was this value sampled", ask "in how many
blocks was this value sampled at least once"?

I don't think that's a really good idea. Imagine a column with a very
low cardinality, so that every block you sample has just a very few
distinct values. That's going to get very nasty very soon, especially
when combined with estimators assuming a truly random sample (which is
exactly the issue with the current ndistinct estimator).

Also, doesn't Vitter do pretty much the same assumption implicitly,
otherwise it couldn't skipping some of the blocks?

Vitter samples an unstructured stream in a single pass, and is
unbiased. The explicit block sampling is not part of Vitter, it is
something we bolted on top of it.

Yeah, you're right of course. Sorry for not being quite precise. We're
using a variant of the S algorithm, based on the idea that we know the
number of blocks at the beginning - but that's pretty much the place
where we assume uniform density of rows per block, no? Because S is
skipping blocks using that assumption implicitly.

My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table. On tables of
the size I was worried about, block sampling was not very helpful
anyway. Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.

With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective? How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large
tables. With a 2.5TB table, you don't really care whether analyze
collects 5GB or 8GB sample, the difference is rather minor compared to
I/O generated by the other queries etc. The current sample is already
random enough not to work well with read-ahead, and it scans only a
slightly lower number of blocks.

And if the larger "random" sample results in better plans (especially
plans without OOM) ...

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Jim Nasby

Jim.Nasby@BlueTreble.com

over 10 years ago

In reply to: Tomas Vondra (#14)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On 6/20/15 12:55 PM, Tomas Vondra wrote:

Well, actually I think it would be even more appropriate for very large
tables. With a 2.5TB table, you don't really care whether analyze
collects 5GB or 8GB sample, the difference is rather minor compared to
I/O generated by the other queries etc. The current sample is already
random enough not to work well with read-ahead, and it scans only a
slightly lower number of blocks.

Have we ever looked at generating new stats as part of a seqscan? I
don't know how expensive the math is but if it's too much to push to a
backend perhaps a bgworker could follow behind the seqscan.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#14)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:

I don't think we need to really assume the density to be constant,
maybe we can verify that while collecting the sample? I mean, we're
already reading the pages, so we can track the density, and either
do the correction or not.

Maybe. I don't know how that would work.

I was thinking about something like

(1) compute average tuples per page

(2) generate random sample of 'samplesize' blocks

(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the block)

(4) repeat until you have sample of the requested size

Are you producing a block sample or a row sample? If a block is sampled,
then what do you feed it to?

Are you just picking one row out of each block that survives step 3? If
so, that would be similar to my idea of counting a given value only once
per block it was in, except I was thinking of applying that only to
n_distinct, not to the entire stats collection process.

...

My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table. On tables of
the size I was worried about, block sampling was not very helpful
anyway. Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.

With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective? How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very large
tables. With a 2.5TB table, you don't really care whether analyze collects
5GB or 8GB sample, the difference is rather minor compared to I/O generated
by the other queries etc.

My answer was to take out the block sampling entirely and read the whole
table. That is what probably won't work for you. (Come to think of it, I
was hesitant to deploy custom code to production, so I never actually
deployed that. Instead I cranked up the stats target and let the chips fall
where they may)

I think you want to go back to the table to find new blocks to replace ones
which were included in the original block sample but ended up having no
tuples in the end tuple sample, either because they were low density
blocks, or just through the luck of the draw. But I don't see how fixes
the problem, unless you also prevent a block from contributing more than 1
tuple. And at that point, I worry about the uneven density again. If a
block has 2 rows selected by pure random chance, I think it would be fine
to keep only one (or, go find another random block to pull one row from as
a replacement). But if it has 2 rows selected because it has more rows
than average, then we would have to pull the replacement from a random
block *of similar density* to avoid swapping one bias for another one.

Cheers,

Jeff

#17

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#10)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi Tomas,

I've lobotomized the sampling a bit to really produce a random set of

blocks first, and that produces way better estimates:

statistics target estimate random
-----------------------------------------------------------------
100 429491 (10000x) 334430766 (14x)
1000 4240418 (1000x) 439010499 (10x)

Also, the number of sampled blocks is not that different. With target 100,
the current sampler reads ~2900 blocks, while a completely random sampler
uses 3000 blocks. So, where's the benefit?

I don't know you did. The block sampling is already random, unless there
is some overlooked bug, it can't be made more random. Could you post the
patch?

As I understand it, with a target of 100, it should be sampling exactly
30,000 blocks. Some of those blocks will end up having no rows chosen from
them (just by chance), but the blocks were still read. If a few thousand
of those blocks end up with no tuples in the final sample, the motivation
for that was not to save IO, is just an artifact of how the random sampling
work.

Thanks,

Jeff

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Jeff Janes (#16)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

On 06/22/2015 07:21 AM, Jeff Janes wrote:

On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:

I don't think we need to really assume the density to be
constant,
maybe we can verify that while collecting the sample? I
mean, we're
already reading the pages, so we can track the density, and
either
do the correction or not.

Maybe. I don't know how that would work.

I was thinking about something like

(1) compute average tuples per page

(2) generate random sample of 'samplesize' blocks

(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the
block)

(4) repeat until you have sample of the requested size

Are you producing a block sample or a row sample? If a block is
sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a random
row sample, but we need to choose which blocks to sample first. So you
can just generate K random numbers (where K is the requested sample
size). If you get the block number once, sample one row, if you get the
block twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density,
which allows you to choose block first and then independently a row from
that block. Hence the correction idea I outlined in the previous message.

Are you just picking one row out of each block that survives step 3?
If so, that would be similar to my idea of counting a given value
only once per block it was in, except I was thinking of applying that
only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. The
problem with the current sampling is that it does not produce a random
sample of rows, but something skewed, which causes serious trouble in
the estimators processing the sample. I think this 'count only once'
would result in similar issues (e.g. for low-cardinality columns).

My answer was to take out the block sampling entirely and read the
whole table. That is what probably won't work for you. (Come to think
of it, I was hesitant to deploy custom code to production, so I never
actually deployed that. Instead I cranked up the stats target and let
the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But the
estimates would be very accurate! ;-)

I think you want to go back to the table to find new blocks to
replace ones which were included in the original block sample but
ended up having no tuples in the end tuple sample, either because
they were low density blocks, or just through the luck of the draw.
But I don't see how fixes the problem, unless you also prevent a
block from contributing more than 1 tuple. And at that point, I worry
about the uneven density again. If a block has 2 rows selected by
pure random chance, I think it would be fine to keep only one (or, go
find another random block to pull one row from as a replacement). But
if it has 2 rows selected because it has more rows than average, then
we would have to pull the replacement from a random block *of similar
density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I
have to think about that a bit more.

The basic idea behind density correction is that when you get a block
with density significantly different from the average (let's say more
than 10%?), you can either treat it as a partial block (density below
average), or a collection of multiple blocks (density above average),
and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has half
the average density, treat do a random coin toss and either keep or
discard the tuple.

With high density blocks, it's probably a complicated, and I think it
really means treating it as multiple 'virtual' blocks, and only keep
samples from the first one.

And of course, this needs to be repeated if some of the rows get evicted
because of the correction mechanism. Which will lead to slightly larger
samples, but I don't see that as a major problem (and the difference is
not that big, at least not in the case discussed in this thread previously).

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Jeff Janes (#17)

1 attachment(s)

Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Hi,

On 06/22/2015 07:47 AM, Jeff Janes wrote:

On Sat, Jun 20, 2015 at 8:28 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

Hi Tomas,

I've lobotomized the sampling a bit to really produce a random set
of blocks first, and that produces way better estimates:

statistics target estimate random
-----------------------------------------------------------------
100 429491 (10000x) 334430766 (14x)
1000 4240418 (1000x) 439010499 (10x)

Also, the number of sampled blocks is not that different. With
target 100, the current sampler reads ~2900 blocks, while a
completely random sampler uses 3000 blocks. So, where's the benefit?

I don't know you did. The block sampling is already random, unless
there is some overlooked bug, it can't be made more random. Could you
post the patch?

Attached is an early experimental version of the sampling with density
correction. It breaks the SYSTEM sampling for TABLESAMPLE, because it
changes signature of one of the BlockSampler methods, but that can be
fixed later. I also suspect it fails with some small tables that would
get sampled completely with the current patch.

The density-correction idea is quite simple, and the sampling works like
this:

(1) randomly sample the blocks

Generate random sequence of block numbers, and track how many
times a particular block was sampled - so if we got block number
K twice, we know we should sample 2 tuples from that block.

(2) sample the tuples

For each of the blocks, sample as many tuples as needed. Also
note the number of tuples on the block.

(3) determine maximum density

Find how what is the maximum number of tuples per block.

(4) resample the tuples

For each block, resample the tuples using ratio vs. maximum
density. So if a block has 50% density, each sampled tuple will
be evicted with ~50% probability.

(5) replace the evicted tuples (not implemented)

In step (4) some of the tuples will get evicted from the sample,
so we should replace them with other tuples. Not yet done, so the
sample may get smaller than requested.

And it seems to be working quite fine. I've done various tests with
tables specifically crafted to use different tuple widths for different
values, like for example this:

create table test as
select
(case when i < 500000 then 1 else 2 end) AS id,
(case when i < 500000 then 1 else null end)::bigint as val1,
(case when i < 500000 then 1 else null end)::bigint as val2,
(case when i < 500000 then 1 else null end)::bigint as val3
from generate_series(1,1000000) s(i);

which creates a table with tuples containing 50% id=1 and 50% id=2, with
the tuples id=1 being much wider (as id=2 have NULLs at the end). And it
works fine - the numbers are pretty much the same as current estimates.

It also significantly improves the ndistinct estimates - for example on
the TPC-H data set, I do get ndistinct=-1 for the l_orderkey column even
with statistics target=100, which is way better than the 10000x
under-estimate produced by current algorithm.

I do take this as a demonstration that our ndistinct estimator is not
such crap as some claim it to be, but that we're merely feeding it
skewed samples. I still think the adaptive estimator is a neat idea, but
it can't be fixed while keeping the same skewed sampling. That has to be
addressed first ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

sampling-experimental.patchtext/x-diff; name=sampling-experimental.patchDownload

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..8669bd8 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -97,6 +97,8 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static int perform_density_correction(int samplerows, HeapTuple *rows,
+									  BlockSampler bs);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -974,18 +976,15 @@ acquire_sample_rows(Relation onerel, int elevel,
 					HeapTuple *rows, int targrows,
 					double *totalrows, double *totaldeadrows)
 {
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
+	int			samplerows = 0; /* total # rows collected */
 	double		liverows = 0;	/* # live rows seen */
 	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
 	BlockNumber totalblocks;
 	TransactionId OldestXmin;
 	BlockSamplerData bs;
-	ReservoirStateData rstate;
 
 	Assert(targrows > 0);
-
+// pg_usleep(15*1000*1000);
 	totalblocks = RelationGetNumberOfBlocks(onerel);
 
 	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
@@ -993,20 +992,28 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 	/* Prepare for sampling block numbers */
 	BlockSampler_Init(&bs, totalblocks, targrows, random());
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
 
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
+		SampledBlock targblock = BlockSampler_Next(&bs);
 		Buffer		targbuffer;
 		Page		targpage;
 		OffsetNumber targoffset,
 					maxoffset;
 
+		HeapTuple  *blockrows;
+		int			nblockrows;
+
 		vacuum_delay_point();
 
+		/* prefetch */
+		{
+			int b;
+			for (b = bs.cblock; (b < Min(bs.nblocks, bs.cblock+16)); b++)
+				PrefetchBuffer(onerel, MAIN_FORKNUM, bs.blocks[b].blockno);
+		}
+
 		/*
 		 * We must maintain a pin on the target page's buffer to ensure that
 		 * the maxoffset value stays good (else concurrent VACUUM might delete
@@ -1016,12 +1023,15 @@ acquire_sample_rows(Relation onerel, int elevel,
 		 * tuple, but since we aren't doing much work per tuple, the extra
 		 * lock traffic is probably better avoided.
 		 */
-		targbuffer = ReadBufferExtended(onerel, MAIN_FORKNUM, targblock,
+		targbuffer = ReadBufferExtended(onerel, MAIN_FORKNUM, targblock->blockno,
 										RBM_NORMAL, vac_strategy);
 		LockBuffer(targbuffer, BUFFER_LOCK_SHARE);
 		targpage = BufferGetPage(targbuffer);
 		maxoffset = PageGetMaxOffsetNumber(targpage);
 
+		blockrows = (HeapTuple*)palloc0(maxoffset * sizeof(HeapTuple));
+		nblockrows = 0;
+
 		/* Inner loop over all tuples on the selected page */
 		for (targoffset = FirstOffsetNumber; targoffset <= maxoffset; targoffset++)
 		{
@@ -1044,7 +1054,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 				continue;
 			}
 
-			ItemPointerSet(&targtuple.t_self, targblock, targoffset);
+			ItemPointerSet(&targtuple.t_self, targblock->blockno, targoffset);
 
 			targtuple.t_tableOid = RelationGetRelid(onerel);
 			targtuple.t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
@@ -1118,66 +1128,49 @@ acquire_sample_rows(Relation onerel, int elevel,
 			}
 
 			if (sample_it)
-			{
-				/*
-				 * The first targrows sample rows are simply copied into the
-				 * reservoir. Then we start replacing tuples in the sample
-				 * until we reach the end of the relation.  This algorithm is
-				 * from Jeff Vitter's paper (see full citation below). It
-				 * works by repeatedly computing the number of tuples to skip
-				 * before selecting a tuple, which replaces a randomly chosen
-				 * element of the reservoir (current set of tuples).  At all
-				 * times the reservoir is a true random sample of the tuples
-				 * we've passed over so far, so when we fall off the end of
-				 * the relation we're done.
-				 */
-				if (numrows < targrows)
-					rows[numrows++] = heap_copytuple(&targtuple);
-				else
-				{
-					/*
-					 * t in Vitter's paper is the number of records already
-					 * processed.  If we need to compute a new S value, we
-					 * must use the not-yet-incremented value of samplerows as
-					 * t.
-					 */
-					if (rowstoskip < 0)
-						rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-					if (rowstoskip <= 0)
-					{
-						/*
-						 * Found a suitable tuple, so save it, replacing one
-						 * old tuple at random
-						 */
-						int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
+				blockrows[nblockrows++] = heap_copytuple(&targtuple);
+		}
 
-						Assert(k >= 0 && k < targrows);
-						heap_freetuple(rows[k]);
-						rows[k] = heap_copytuple(&targtuple);
-					}
+		/* Now release the lock and pin on the page */
+		UnlockReleaseBuffer(targbuffer);
 
-					rowstoskip -= 1;
-				}
+		/* remember how many tuples were on the block */
+		targblock->ntuples = nblockrows;
 
-				samplerows += 1;
-			}
+		/* if we have more tuples than needed, we'll evict some first */
+		while (targblock->ntarget < nblockrows)
+		{
+			int r = pg_lrand48() % nblockrows;
+			heap_freetuple(blockrows[r]);
+			blockrows[r] = blockrows[nblockrows - 1];
+			nblockrows--;
 		}
 
-		/* Now release the lock and pin on the page */
-		UnlockReleaseBuffer(targbuffer);
+		/* now just copy the tuples into the sample */
+		{
+			int i;
+
+			/* never should be able to overflow the sample here */
+			Assert(nblockrows + samplerows <= targrows);
+
+			for (i = 0; i < nblockrows; i++)
+				rows[samplerows++] = blockrows[i];
+		}
 	}
 
 	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
+	 * We need to sort the collected tuples by position (itempointer).
+	 * It's not worth worrying about corner cases where the tuples are
+	 * already sorted.
+	 */
+	if (samplerows == targrows)
+		qsort((void *) rows, samplerows, sizeof(HeapTuple), compare_rows);
+
+	/*
+	 * Perform tuple-density correction - find the block with the
+	 * highest density, scale samples for the less dense blocks.
 	 */
-	if (numrows == targrows)
-		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+	samplerows = perform_density_correction(samplerows, rows, &bs);
 
 	/*
 	 * Estimate total numbers of rows in relation.  For live rows, use
@@ -1187,10 +1180,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 	 */
 	*totalrows = vac_estimate_reltuples(onerel, true,
 										totalblocks,
-										bs.m,
+										bs.nblocks,
 										liverows);
-	if (bs.m > 0)
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	if (bs.nblocks > 0)
+		*totaldeadrows = floor((deadrows / bs.nblocks) * totalblocks + 0.5);
 	else
 		*totaldeadrows = 0.0;
 
@@ -1202,11 +1195,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 					"containing %.0f live rows and %.0f dead rows; "
 					"%d rows in sample, %.0f estimated total rows",
 					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
+					bs.nblocks, totalblocks,
 					liverows, deadrows,
-					numrows, *totalrows)));
+					samplerows, *totalrows)));
 
-	return numrows;
+	elog(WARNING, "samplerows=%d", samplerows);
+
+	return samplerows;
 }
 
 /*
@@ -2675,3 +2670,69 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
+static int
+perform_density_correction(int samplerows, HeapTuple *rows, BlockSampler bs)
+{
+	int	i, j;
+	int	starttuple = 0;
+	int	max_tuples = 0;
+
+	/* walk through blocks, find highest density */
+	for (i = 0; i < bs->nblocks; i++)
+		max_tuples = Max(bs->blocks[i].ntuples, max_tuples);
+
+	/*
+	 * Correct the sample - for each block check whether the density
+	 * is lower than the maximum density, and discard some of the
+	 * tuples using the (density/maxdensity) probability.
+	 */
+	for (i = 0; i < bs->nblocks; i++)
+	{
+		double discard_ratio = (bs->blocks[i].ntuples / (double)max_tuples);
+
+		/* not really needed (no tuples will get evicted) */
+		if (bs->blocks[i].ntuples == max_tuples)
+			continue;
+
+		/*
+		 * We know both the blocks and tuples are sorted in the
+		 * same way (by tid), so we'll use that to efficiently
+		 * match those two arrays.
+		 */
+		for (j = starttuple; j < samplerows; j++)
+		{
+			int tupleblock = ItemPointerGetBlockNumber(&rows[j]->t_self);
+
+			/* first tuple for the next block, so terminate */
+			if (tupleblock > bs->blocks[i].blockno)
+				break;
+
+			/* remember the next starting position */
+			starttuple++;
+
+			/*
+			 * If it's really a tuple for the current block, do the
+			 * correction, i.e. discard part of the tuples randomly.
+			 * We'll just put NULL there and compact the array in
+			 * one go later (although it might probably be done here).
+			 */
+			if (tupleblock == bs->blocks[i].blockno)
+			{
+				if (pg_erand48(bs->randstate) > discard_ratio)
+				{
+					heap_freetuple(rows[j]);
+					rows[j] = NULL;
+				}
+			}
+		}
+	}
+
+	/* compact the array of sampled rows (remove the NULLs) */
+	j = 0;
+	for (i = 0; i < samplerows; i++)
+		if (rows[i] != NULL)
+			rows[j++] = rows[i];
+
+	return j;
+}
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index aaf1d6c..836250a 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -19,6 +19,7 @@
 
 #include "utils/sampling.h"
 
+static int compare_sampled_blocks(const void *a, const void *b);
 
 /*
  * BlockSampler_Init -- prepare for random sampling of blocknumbers
@@ -37,6 +38,8 @@ void
 BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 				  long randseed)
 {
+	int i;
+
 	bs->N = nblocks;			/* measured table size */
 
 	/*
@@ -44,71 +47,56 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	 * more than samplesize blocks, here is the place to do it.
 	 */
 	bs->n = samplesize;
-	bs->t = 0;					/* blocks scanned so far */
-	bs->m = 0;					/* blocks selected so far */
 
 	sampler_random_init_state(randseed, bs->randstate);
+
+	/* generate random blocks */
+	bs->blocks = NULL;
+	bs->nblocks = 0;
+
+	if (bs->N > 0)
+	{
+		bs->blocks = (SampledBlock)palloc0(sizeof(SampledBlockData) * bs->n);
+
+		for (i = 0; i < bs->n; i++)
+		{
+			bs->blocks[i].blockno = pg_lrand48() % bs->N;
+			bs->blocks[i].nsampled = 0;
+			bs->blocks[i].ntarget  = 1;
+		}
+
+		qsort((void *) bs->blocks, bs->n, sizeof(SampledBlockData),
+			  compare_sampled_blocks);
+
+		bs->nblocks = 1;
+		for (i = 1; i < bs->n; i++)
+		{
+			/* existing block, so just increment number of target tuples */
+			if (bs->blocks[bs->nblocks-1].blockno == bs->blocks[i].blockno)
+			{
+				bs->blocks[bs->nblocks-1].ntarget += 1;
+				continue;
+			}
+
+			/* otherwise we have a new block number, so move it */
+			bs->nblocks++;
+			bs->blocks[bs->nblocks-1] = bs->blocks[i];
+		}
+	}
+
+	bs->cblock = 0;
 }
 
 bool
 BlockSampler_HasMore(BlockSampler bs)
 {
-	return (bs->t < bs->N) && (bs->m < bs->n);
+	return (bs->cblock < bs->nblocks);
 }
 
-BlockNumber
+SampledBlock
 BlockSampler_Next(BlockSampler bs)
 {
-	BlockNumber K = bs->N - bs->t;		/* remaining blocks */
-	int			k = bs->n - bs->m;		/* blocks still to sample */
-	double		p;				/* probability to skip block */
-	double		V;				/* random */
-
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
-
-	if ((BlockNumber) k >= K)
-	{
-		/* need all the rest */
-		bs->m++;
-		return bs->t++;
-	}
-
-	/*----------
-	 * It is not obvious that this code matches Knuth's Algorithm S.
-	 * Knuth says to skip the current block with probability 1 - k/K.
-	 * If we are to skip, we should advance t (hence decrease K), and
-	 * repeat the same probabilistic test for the next block.  The naive
-	 * implementation thus requires a sampler_random_fract() call for each
-	 * block number.  But we can reduce this to one sampler_random_fract()
-	 * call per selected block, by noting that each time the while-test
-	 * succeeds, we can reinterpret V as a uniform random number in the range
-	 * 0 to p. Therefore, instead of choosing a new V, we just adjust p to be
-	 * the appropriate fraction of its former value, and our next loop
-	 * makes the appropriate probabilistic test.
-	 *
-	 * We have initially K > k > 0.  If the loop reduces K to equal k,
-	 * the next while-test must fail since p will become exactly zero
-	 * (we assume there will not be roundoff error in the division).
-	 * (Note: Knuth suggests a "<=" loop condition, but we use "<" just
-	 * to be doubly sure about roundoff error.)  Therefore K cannot become
-	 * less than k, which means that we cannot fail to select enough blocks.
-	 *----------
-	 */
-	V = sampler_random_fract(bs->randstate);
-	p = 1.0 - (double) k / (double) K;
-	while (V < p)
-	{
-		/* skip */
-		bs->t++;
-		K--;					/* keep K == N - t */
-
-		/* adjust p to be new cutoff point in reduced range */
-		p *= 1.0 - (double) k / (double) K;
-	}
-
-	/* select */
-	bs->m++;
-	return bs->t++;
+	return &bs->blocks[bs->cblock++];
 }
 
 /*
@@ -283,3 +271,18 @@ anl_get_next_S(double t, int n, double *stateptr)
 	*stateptr = oldrs.W;
 	return result;
 }
+
+static int
+compare_sampled_blocks(const void *a, const void *b)
+{
+	SampledBlock sa = (SampledBlock)a;
+	SampledBlock sb = (SampledBlock)b;
+
+	if (sa->blockno < sb->blockno)
+		return -1;
+
+	if (sa->blockno > sb->blockno)
+		return 1;
+
+	return 0;
+}
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index 1653ed0..9c03190 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -24,15 +24,29 @@ extern void sampler_random_init_state(long seed,
 extern double sampler_random_fract(SamplerRandomState randstate);
 
 /* Block sampling methods */
+typedef struct
+{
+	BlockNumber	blockno;		/* number of sampled block */
+	int			ntarget;		/* tuples to sample (in this round) */
+	int			nsampled;		/* number of sampled tuples */
+	int			ntuples;		/* number of tuples on block */
+} SampledBlockData;
+
+typedef SampledBlockData* SampledBlock;
 
 /* Data structure for Algorithm S from Knuth 3.4.2 */
 typedef struct
 {
-	BlockNumber N;				/* number of blocks, known in advance */
-	int			n;				/* desired sample size */
-	BlockNumber t;				/* current block number */
-	int			m;				/* blocks selected so far */
-	SamplerRandomState randstate;		/* random generator state */
+	BlockNumber N;					/* number of blocks, known in advance */
+	int			n;					/* desired sample size (tuples) */
+
+	/* block sample */
+	SampledBlock	blocks;		/* preallocated to have 'n' blocks */
+	int				nblocks;	/* number of valid blocks */
+	int				nsampled;	/* sampled tuples */
+	int				cblock;		/* current block (index infto blocks) */
+
+	SamplerRandomState randstate;	/* random generator state */
 } BlockSamplerData;
 
 typedef BlockSamplerData *BlockSampler;
@@ -40,14 +54,13 @@ typedef BlockSamplerData *BlockSampler;
 extern void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 				  int samplesize, long randseed);
 extern bool BlockSampler_HasMore(BlockSampler bs);
-extern BlockNumber BlockSampler_Next(BlockSampler bs);
+extern SampledBlock BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-
 typedef struct
 {
-	double		W;
-	SamplerRandomState randstate;		/* random generator state */
+       double          W;
+       SamplerRandomState randstate;           /* random generator state */
 } ReservoirStateData;
 
 typedef ReservoirStateData *ReservoirState;