BUG #7853: Incorrect statistics in table with many dead rows.

Started by Nonameabout 13 years ago11 messagesbugs
Jump to latest
#1Noname
jimbob@seagate.com

The following bug has been logged on the website:

Bug reference: 7853
Logged by: James Skaggs
Email address: jimbob@seagate.com
PostgreSQL version: 8.4.14
Operating system: RHEL6
Description:

After "analyze verbose", the table shows 158 million rows. A select count(1)
yields 13.8 million rows.

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 2123642 pages, containing 184517 live
rows and 2115512 dead rows; 30000 rows in sample, 158702435 estimated total
rows

Here are the table statistics.

Sequential Scans 81853
Sequential Tuples Read 578848425234
Index Scans 1976513672
Index Tuples Fetched 2183339860
Tuples Inserted 65122575
Tuples Updated 308883671
Tuples Deleted 51238760
Tuples HOT Updated 2242897
Live Tuples 163981972
Dead Tuples 7056493
Heap Blocks Read 43483331819
Heap Blocks Hit 43121456487
Index Blocks Read 134539277
Index Blocks Hit 13606451182
Toast Blocks Read
Toast Blocks Hit
Toast Index Blocks Read
Toast Index Blocks Hit
Last Vacuum 2013-02-04 10:06:44.058743-07
Last Autovacuum 2013-02-04 16:11:34.289823-07
Last Analyze 2013-02-04 14:22:27.848547-07
Last Autoanalyze 2013-02-01 17:37:29.855553-07
Table Size 17 GB
Toast Table Size none
Indexes Size 34 GB
Query returned successfully with no result in 4094 ms.

Bad statistics led to a bad plan. We will cluster the table today to see if
that fixes it, but I think statistics should be correct, regardless of the
state of a table. BTW, Coverity product requries 8.x, and we'll upgrade to
8.4.15 today. Didn't see anything about better statistics in the 8.4.15
changelog.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Noname (#1)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

"jimbob@seagate.com" <jimbob@seagate.com> wrote:

INFO:  analyzing "public.stream_file"
INFO:  "stream_file": scanned 30000 of 2123642 pages, containing
184517 live rows and 2115512 dead rows; 30000 rows in sample,
158702435 estimated total rows

184517 live rows in 30000 randomly sampled pages out of 2123642
total pages, means that the statistics predict that a select
count(*) will find about  13 million live rows to count.

After "analyze verbose", the table shows 158 million rows. A
select count(1) yields 13.8 million rows.

OK, the estimate was 13 million and there were actually 13.8
million, but it is a random sample used to generate estimates.
That seems worse than average, but close enough to be useful.
The 158.7 million total rows includes dead rows, which must be
visited to determine visibility, but will not be counted because
they are not visible to the counting transaction.  Having over 90%
of your table filled with dead rows is a bad situation to be in,
from a performance standpoint.  You should use aggressive
maintenance (like VACUUM FULL or CLUSTER) to fix the existing
extreme bloat, and then review your autovacuum settings and overall
vacuum regimen to prevent future bloat.

This does not look like a bug from the information provided so far.

-Kevin

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3James R Skaggs
james.r.skaggs@seagate.com
In reply to: Kevin Grittner (#2)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

Kevin -

Agreed, we shouldn't have so many dead rows. Our autovacuum is set on but
has default parameters. We are clustering today. This, of course, means
downtime and inconvenience to the users.

Here is the troublesome query:

select

sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt,

sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as comment_line_cnt,

sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt

from

( select

max(sf.current_code_line_count) as cdlc,

max(sf.current_comment_line_count) as cmlc,

max(sf.current_blank_line_count) as bllc

from

stream_file sf

group by sf.file_path_id, sf.current_source_md5

) as t1;

And the explain plan. Against a optimized database, the plan was better,
as well as execution

Aggregate (cost=32073953.42..32073953.43 rows=1 width=12) (actual
time=77394.354..77394.354 rows=1 loops=1)
-> GroupAggregate (cost=29271854.25..31808889.97 rows=15146482
width=53) (actual time=62490.514..77333.105 rows=168591 loops=1)
-> Sort (cost=29271854.25..29650516.30 rows=151464819 width=53)
(actual time=62490.492..73098.009 rows=14403547 loops=1)
Sort Key: sf.file_path_id, sf.current_source_md5
Sort Method: external merge Disk: 942440kB
-> Seq Scan on stream_file sf (cost=0.00..3514999.19
rows=151464819 width=53) (actual time=0.327..20620.230 rows=14403547
loops=1)
Total runtime: 77836.949 ms

Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224

On Tue, Feb 5, 2013 at 3:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Show quoted text

"jimbob@seagate.com" <jimbob@seagate.com> wrote:

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 2123642 pages, containing
184517 live rows and 2115512 dead rows; 30000 rows in sample,
158702435 estimated total rows

184517 live rows in 30000 randomly sampled pages out of 2123642
total pages, means that the statistics predict that a select
count(*) will find about 13 million live rows to count.

After "analyze verbose", the table shows 158 million rows. A
select count(1) yields 13.8 million rows.

OK, the estimate was 13 million and there were actually 13.8
million, but it is a random sample used to generate estimates.
That seems worse than average, but close enough to be useful.
The 158.7 million total rows includes dead rows, which must be
visited to determine visibility, but will not be counted because
they are not visible to the counting transaction. Having over 90%
of your table filled with dead rows is a bad situation to be in,
from a performance standpoint. You should use aggressive
maintenance (like VACUUM FULL or CLUSTER) to fix the existing
extreme bloat, and then review your autovacuum settings and overall
vacuum regimen to prevent future bloat.

This does not look like a bug from the information provided so far.

-Kevin

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: James R Skaggs (#3)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

James R Skaggs <james.r.skaggs@seagate.com> wrote:

Agreed, we shouldn't have so many dead rows.  Our autovacuum is
set on but has default parameters.   We are clustering today.
This, of course, means downtime and inconvenience to the users.

Right, which is why it's important to figure out why the bloat
happened.  Sometimes it is unavoidable, like when you delete 90% of
the rows in your table or a long-lived "idle in transaction"
connection prevents autovacuum from being able to do its work
normally.  To prevent further downtime it is important to figure
out what happened and make appropriate changes to your monitoring
or vacuuming.

Here is the troublesome query:

select
    sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt,
    sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as comment_line_cnt,
    sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt
  from
    ( select
        max(sf.current_code_line_count) as cdlc,
        max(sf.current_comment_line_count) as cmlc,
        max(sf.current_blank_line_count) as bllc
      from
        stream_file sf
      group by sf.file_path_id, sf.current_source_md5
    ) as t1;

Well, I don't see that the planner has a lot of choice there
besides whether to use a sort or a hash to do the inner
aggregation.  Are you saying that prior to the bloat it used a hash
aggregation, and that was faster?  And that you feel that it should
be using that even with the bloat?  That the dead rows seem to be
getting included in the statistics, driving to the slower plan, and
you feel they should be omitted?

Note that I'm not aruing one way or another on these points at the
moment; I'm just trying to understand your point clearly.

-Kevin

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5James R Skaggs
james.r.skaggs@seagate.com
In reply to: Kevin Grittner (#4)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

Kevin -

Here is the plan *after *we clustered on the PK index..it was probably that
way before the bloat, but I don't know.

Aggregate (cost=514557.38..514557.40 rows=1 width=12) (actual
time=8823.973..8823.974 rows=1 loops=1)
-> HashAggregate (cost=471029.03..492793.20 rows=1243667 width=53)
(actual time=8725.324..8789.926 rows=159369 loops=1)"
-> Seq Scan on stream_file sf (cost=0.00..315570.68 rows=12436668
width=53) (actual time=0.013..1152.600 rows=12436753 loops=1)
Total runtime: 8838.395 ms"

I guess the statistics should reflect the true costs. So if the "bloated"
statistics are representative of the true cost, I'm all for them.

We are clustering the database as we speak, so I can't force a hash
aggregate to see if that is the right choice or not.

Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224

On Tue, Feb 5, 2013 at 4:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Show quoted text

James R Skaggs <james.r.skaggs@seagate.com> wrote:

Agreed, we shouldn't have so many dead rows. Our autovacuum is
set on but has default parameters. We are clustering today.
This, of course, means downtime and inconvenience to the users.

Right, which is why it's important to figure out why the bloat
happened. Sometimes it is unavoidable, like when you delete 90% of
the rows in your table or a long-lived "idle in transaction"
connection prevents autovacuum from being able to do its work
normally. To prevent further downtime it is important to figure
out what happened and make appropriate changes to your monitoring
or vacuuming.

Here is the troublesome query:

select
sum(case when t1.cdlc < 0 then 0 else t1.cdlc end) as code_line_cnt,
sum(case when t1.cmlc < 0 then 0 else t1.cmlc end) as

comment_line_cnt,

sum(case when t1.bllc < 0 then 0 else t1.bllc end) as blank_line_cnt
from
( select
max(sf.current_code_line_count) as cdlc,
max(sf.current_comment_line_count) as cmlc,
max(sf.current_blank_line_count) as bllc
from
stream_file sf
group by sf.file_path_id, sf.current_source_md5
) as t1;

Well, I don't see that the planner has a lot of choice there
besides whether to use a sort or a hash to do the inner
aggregation. Are you saying that prior to the bloat it used a hash
aggregation, and that was faster? And that you feel that it should
be using that even with the bloat? That the dead rows seem to be
getting included in the statistics, driving to the slower plan, and
you feel they should be omitted?

Note that I'm not aruing one way or another on these points at the
moment; I'm just trying to understand your point clearly.

-Kevin

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Kevin Grittner (#2)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

"jimbob@seagate.com" <jimbob@seagate.com> wrote:

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 2123642 pages, containing
184517 live rows and 2115512 dead rows; 30000 rows in sample,
158702435 estimated total rows

184517 live rows in 30000 randomly sampled pages out of 2123642
total pages, means that the statistics predict that a select
count(*) will find about 13 million live rows to count.

After "analyze verbose", the table shows 158 million rows. A
select count(1) yields 13.8 million rows.

OK, the estimate was 13 million and there were actually 13.8
million, but it is a random sample used to generate estimates.
That seems worse than average, but close enough to be useful.
The 158.7 million total rows includes dead rows, which must be
visited to determine visibility, but will not be counted because
they are not visible to the counting transaction.

To clarify here, the 158.7 million estimate does not *intentionally*
include dead rows. As you say, the ANALYZE did get a very good
instantaneous estimate of the number of live rows. However, ANALYZE
doesn't over-write the old estimate, it averages its estimate into the
old one. After the table shape changes dramatically, the ANALYZE
needs to be run repeatedly before the estimate will converge to the
new reality. (Of course a cluster or vacuum full will blow away the
old statistics, so the next analyze after that will solely determine
the new statistics.)

I agree, not a bug.

Cheers,

Jeff

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7James R Skaggs
james.r.skaggs@seagate.com
In reply to: Jeff Janes (#6)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

Okay, I have some more info.

Some background info. This one table gets so many changes, I CLUSTER it
each night. However, after I do this. The statistics still appear to be
incorrect. Even after I do a "select pg_stat_reset();" Followed by 3
ANALYZE at default_statistics_target as 1, 10, and 100

select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del,
n_tup_hot_upd
from pg_stat_all_tables
('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L)

Is it possible that there are still dead tuples after a CLUSTER?

Explain Analyze thinks we have 112M rows (in fact there are 10M), so it
needs do do a disc sort with work_mem=1GB..

"Aggregate (cost=23622814.39..23622814.40 rows=1 width=12) (actual
time=57512.462..57512.462 rows=1 loops=1)"
" -> GroupAggregate (cost=21536989.70..23425506.64 rows=11274728
width=53) (actual time=50825.396..57457.421 rows=202394 loops=1)"
" Output: max(sf.current_code_line_count),
max(sf.current_comment_line_count), max(sf.current_blank_line_count)"
" -> Sort (cost=21536989.70..21818857.90 rows=112747280 width=53)
(actual time=50825.386..55004.916 rows=7683730 loops=1)"
" Output: sf.current_code_line_count,
sf.current_comment_line_count, sf.current_blank_line_count,
sf.file_path_id, sf.current_source_md5"
" Sort Key: sf.file_path_id, sf.current_source_md5"
" Sort Method: external merge Disk: 502288kB"
" -> Seq Scan on stream_file sf (cost=0.00..2604208.80
rows=112747280 width=53) (actual time=0.033..27922.485 rows=7683730
loops=1)"
" Output: sf.current_code_line_count,
sf.current_comment_line_count, sf.current_blank_line_count,
sf.file_path_id, sf.current_source_md5"
"Total runtime: 57693.835 ms"

Now we do many, many ANALYZE VERBOSE, and converge on the correct value,
which is known to be about 10M rows.

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 1476736 pages, containing 158846
live rows and 2175512 dead rows; 30000 rows in sample, 112747282 estimated
total rows
Query returned successfully with no result in 9172 ms.

... 200X !...

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 1480611 pages, containing 158776
live rows and 2170410 dead rows; 30000 rows in sample, 9769236 estimated
total rows
Query returned successfully with no result in 441 ms.

Now, the optimizer thinks we can do a has aggregate in memory and we get
better performance.

"Aggregate (cost=1734729.12..1734729.14 rows=1 width=12) (actual
time=33816.049..33816.049 rows=1 loops=1)"
" -> HashAggregate (cost=1700534.50..1717631.81 rows=976989 width=53)
(actual time=33535.083..33712.787 rows=202404 loops=1)"
" Output: max(sf.current_code_line_count),
max(sf.current_comment_line_count), max(sf.current_blank_line_count)"
" -> Seq Scan on stream_file sf (cost=0.00..1578410.89
rows=9769889 width=53) (actual time=392.435..26278.143 rows=7710223
loops=1)"
" Output: sf.id, sf.current_blank_line_count,
sf.current_code_line_count, sf.current_comment_line_count,
sf.current_source_md5, sf.component_id, sf.current_file_instance_id,
sf.current_file_state_id, sf.file_path_id, sf.stream_element_id"
"Total runtime: 33822.707 ms"

But later in the day, the statistics revert back to the 100M number! Any
ideas? Is there some kind of cache that is remembering the old statistics.?

Thanks and Regards,

James Skaggs
IT/FIS Longmont
SeaTel: 8 684 1048
General: +1 612 367 6224

On Sun, Feb 10, 2013 at 1:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Show quoted text

On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

"jimbob@seagate.com" <jimbob@seagate.com> wrote:

INFO: analyzing "public.stream_file"
INFO: "stream_file": scanned 30000 of 2123642 pages, containing
184517 live rows and 2115512 dead rows; 30000 rows in sample,
158702435 estimated total rows

184517 live rows in 30000 randomly sampled pages out of 2123642
total pages, means that the statistics predict that a select
count(*) will find about 13 million live rows to count.

After "analyze verbose", the table shows 158 million rows. A
select count(1) yields 13.8 million rows.

OK, the estimate was 13 million and there were actually 13.8
million, but it is a random sample used to generate estimates.
That seems worse than average, but close enough to be useful.
The 158.7 million total rows includes dead rows, which must be
visited to determine visibility, but will not be counted because
they are not visible to the counting transaction.

To clarify here, the 158.7 million estimate does not *intentionally*
include dead rows. As you say, the ANALYZE did get a very good
instantaneous estimate of the number of live rows. However, ANALYZE
doesn't over-write the old estimate, it averages its estimate into the
old one. After the table shape changes dramatically, the ANALYZE
needs to be run repeatedly before the estimate will converge to the
new reality. (Of course a cluster or vacuum full will blow away the
old statistics, so the next analyze after that will solely determine
the new statistics.)

I agree, not a bug.

Cheers,

Jeff

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#6)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

On Sun, Feb 10, 2013 at 12:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Feb 5, 2013 at 2:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

OK, the estimate was 13 million and there were actually 13.8
million, but it is a random sample used to generate estimates.
That seems worse than average, but close enough to be useful.
The 158.7 million total rows includes dead rows, which must be
visited to determine visibility, but will not be counted because
they are not visible to the counting transaction.

To clarify here, the 158.7 million estimate does not *intentionally*
include dead rows. As you say, the ANALYZE did get a very good
instantaneous estimate of the number of live rows. However, ANALYZE
doesn't over-write the old estimate, it averages its estimate into the
old one. After the table shape changes dramatically, the ANALYZE
needs to be run repeatedly before the estimate will converge to the
new reality. (Of course a cluster or vacuum full will blow away the
old statistics, so the next analyze after that will solely determine
the new statistics.)

I was incredibly wrong here. The cluster or vacuum do not blow away the
stats so that the next analyze gets to solely determine them. Rather, they
impose their own idea of live tuples, and then analyze can only update that
incrementally as it averages itself into the old value.

Worse, the two methods have very different ideas of what constitutes a live
tuple. ANALYZE thinks tuples that are visible to a current/recent snapshot
are live. While CLUSTER and VACUUM think tuples that are possibly visible
to anyone are live.

I would say that this is a bug, or at least approaching to being one. It
is not obvious whether reltuples and n_live_tuples should count the
"recently dead", but it should either be one way or the other and not an
unholy mixture of the two.

As it is now, a cluster or simple vacuum will snap n_live_tuples so that it
counts recently dead, then analyze will slowly converge it to excludes
recently dead, and then the next vacuum will snap it back again.

Of course, all of this only comes into play in the presence of very
long-lived transactions that prevent tuples from going away. Otherwise the
number recently dead is small enough not to matter.

create table foo as select (random()*1000000)::integer as val from
generate_series(1,50000000);

In a different session, open a transaction and leave it open: begin; create
temp table adlfkj (x serial);

Back in the main session:

delete from foo where val > 100;

run this repeatedly and watch the rows estimate slowly decay:

ANALYZE verbose foo; explain select count(*) from foo;

Then run this and watch it instantly spring back:

VACUUM VERBOSE foo ; explain select count(*) from foo;

Cheers,

Jeff

#9jimbob
skaggs.james@gmail.com
In reply to: Jeff Janes (#8)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

So, I have some observations. Is this what you are seeing as well?

So when we CLUSTER a table heavily-updated table:

CLUSTER does appear to reset *n_dead_tup*, *n_tup_ins*, *n_tup_del*,
*n_tup_hot_upd*, but NOT *n_live_tup*

pg_stat_reset() truly clears out all the statistics counters. I tried this
because *n_live_tup* is not correct.

A subsequent ANALYZE will update *n_dead_tup* and *n_live_tup* to some
values that could not possibly be based on the newly CLUSTERed table

So, how to get correct statistics for a heavily updated table? In my
experience, we only need to get the exponent correct, but we're not even
getting that.

BTW, I've upgraded to 8.4.15.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/BUG-7853-Incorrect-statistics-in-table-with-many-dead-rows-tp5743845p5746602.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10jimbob
skaggs.james@gmail.com
In reply to: jimbob (#9)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

BTW "jimbob" and "James.R.Skaggs" are the same person. I just didn't want to
use my "work" email for this....

--
View this message in context: http://postgresql.1045698.n5.nabble.com/BUG-7853-Incorrect-statistics-in-table-with-many-dead-rows-tp5743845p5747000.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#11Jeff Janes
jeff.janes@gmail.com
In reply to: James R Skaggs (#7)
Re: BUG #7853: Incorrect statistics in table with many dead rows.

On Fri, Feb 22, 2013 at 3:41 PM, James R Skaggs
<james.r.skaggs@seagate.com>wrote:

Okay, I have some more info.

Some background info. This one table gets so many changes, I CLUSTER it
each night. However, after I do this. The statistics still appear to be
incorrect. Even after I do a "select pg_stat_reset();" Followed by 3
ANALYZE at default_statistics_target as 1, 10, and 100

select relname, n_live_tup, n_dead_tup, n_tup_ins, n_tup_del,
n_tup_hot_upd
from pg_stat_all_tables
('stream_file', 109061143L, 103913868L, 0L, 14201L, 0L)

Is it possible that there are still dead tuples after a CLUSTER?

Yes. A cluster must bring along any tuples which are possibly visible to
any open transaction. Your root problem seems to be that you have
long-open transactions which are preventing vacuum from doing its thing,
which leads you try clustering, but the long-open transaction prevents that
from doing its things effectively as well.

Perhaps PG could deal with this situation more gracefully, but
fundamentally you have to figure why you have these ancient transactions
lying around, and fix them or kill them.

Cheers,

Jeff