[PATCH] Never convert n_distinct < 2 values to a ratio when computing stats

Started by Dan McGeealmost 14 years ago3 messages
#1Dan McGee
dan@archlinux.org

This is a bit of a corner case in all honesty, but if you have a short
table (under 20 rows), the 10% heuristic used that decides whether
distinct values scale with the row count will result in rather odd
values for stadistinct in pg_statistic, such as '-0.2' or '-0.666667',
rather than the expected '2'. Additionally, this can cause only one of
{t, f} to appear in the most common values array.

Does this actually affect query planning in any way? Probably not, but
it is extremely odd to look at pg_stats for these columns, and the
solution seems easy.
---

The only other minor changes included here were to make it clear when we were
comparing float values, so use 0.0 instead of 0.

Example stats output from the database I noticed this on:

archweb=# SELECT c.relname, a.attname, pg_stat_get_live_tuples(c.oid) AS n_live_tup, stadistinct, stanullfrac, stawidth, stavalues1, stanumbers1 FROM pg_statistic s JOIN pg_class c ON c.oid = s.starelid JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum LEFT JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_type t ON t.oid = a.atttypid WHERE NOT a.attisdropped AND nspname = 'public' AND t.typname = 'bool' ORDER BY stadistinct, n_live_tup;
relname | attname | n_live_tup | stadistinct | stanullfrac | stawidth | stavalues1 | stanumbers1
-------------------------------+---------------+------------+-------------+-------------+----------+------------+-----------------------
mirrors_mirrorprotocol | is_download | 3 | -0.666667 | 0 | 1 | {t} | {0.666667}
arches | agnostic | 3 | -0.666667 | 0 | 1 | {f} | {0.666667}
repos | staging | 10 | -0.2 | 0 | 1 | {f,t} | {0.7,0.3}
repos | testing | 10 | -0.2 | 0 | 1 | {f,t} | {0.7,0.3}
devel_pgpsignature | valid | 264 | 1 | 0 | 1 | {t} | {1}
packages_flagrequest | is_spam | 415 | 1 | 0 | 1 | {f} | {1}
donors | visible | 716 | 1 | 0 | 1 | {t} | {1}
auth_user | is_superuser | 95 | 2 | 0 | 1 | {f,t} | {0.957895,0.0421053}
user_profiles | notify | 95 | 2 | 0 | 1 | {t,f} | {0.957895,0.0421053}
auth_user | is_active | 95 | 2 | 0 | 1 | {t,f} | {0.621053,0.378947}
auth_user | is_staff | 95 | 2 | 0 | 1 | {f,t} | {0.873684,0.126316}
releng_iso | active | 158 | 2 | 0 | 1 | {f,t} | {0.893333,0.106667}
mirrors_mirror | isos | 180 | 2 | 0 | 1 | {t,f} | {0.972678,0.0273224}
mirrors_mirror | active | 180 | 2 | 0 | 1 | {t,f} | {0.672131,0.327869}
mirrors_mirror | public | 180 | 2 | 0 | 1 | {t,f} | {0.978142,0.0218579}
mirrors_mirrorurl | has_ipv6 | 379 | 2 | 0 | 1 | {f,t} | {0.709763,0.290237}
mirrors_mirrorurl | has_ipv4 | 379 | 2 | 0 | 1 | {t} | {0.997361}
packages_flagrequest | is_legitimate | 415 | 2 | 0 | 1 | {t,f} | {0.992754,0.00724638}
packages_signoffspecification | enabled | 1130 | 2 | 0 | 1 | {t,f} | {0.977578,0.0224215}
packages_signoffspecification | known_bad | 1130 | 2 | 0 | 1 | {f,t} | {0.993722,0.00627803}
mirrors_mirrorlog | is_success | 12715 | 2 | 0 | 1 | {t,f} | {0.953345,0.0466552}
package_depends | optional | 28592 | 2 | 0 | 1 | {f,t} | {0.880322,0.119678}
package_files | is_directory | 225084 | 2 | 0 | 1 | {f,t} | {0.829933,0.170067}
(23 rows)

src/backend/commands/analyze.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9cd6e67..995ed9d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2110,7 +2110,7 @@ compute_minimal_stats(VacAttrStatsP stats,
 		 * least 2 instances in the sample.
 		 */
 		if (track_cnt < track_max && toowide_cnt == 0 &&
-			stats->stadistinct > 0 &&
+			stats->stadistinct > 0.0 &&
 			track_cnt <= num_mcv)
 		{
 			/* Track list includes all values seen, and all will fit */
@@ -2122,7 +2122,7 @@ compute_minimal_stats(VacAttrStatsP stats,
 			double		avgcount,
 						mincount;

- if (ndistinct < 0)
+ if (ndistinct < 0.0)
ndistinct = -ndistinct * totalrows;
/* estimate # of occurrences in sample of a typical value */
avgcount = (double) samplerows / ndistinct;
@@ -2434,12 +2434,12 @@ compute_scalar_stats(VacAttrStatsP stats,
}

 		/*
-		 * If we estimated the number of distinct values at more than 10% of
-		 * the total row count (a very arbitrary limit), then assume that
-		 * stadistinct should scale with the row count rather than be a fixed
-		 * value.
+		 * If we estimated the number of distinct values at more than 2 total
+		 * values (a boolean) and more than 10% of the total row count (a very
+		 * arbitrary limit), then assume that stadistinct should scale with
+		 * the row count rather than be a fixed value.
 		 */
-		if (stats->stadistinct > 0.1 * totalrows)
+		if (stats->stadistinct > 2.0 && stats->stadistinct > 0.1 * totalrows)
 			stats->stadistinct = -(stats->stadistinct / totalrows);

/*
@@ -2457,7 +2457,7 @@ compute_scalar_stats(VacAttrStatsP stats,
* but we prefer to treat such values as MCVs if at all possible.)
*/
if (track_cnt == ndistinct && toowide_cnt == 0 &&
- stats->stadistinct > 0 &&
+ stats->stadistinct > 0.0 &&
track_cnt <= num_mcv)
{
/* Track list includes all values seen, and all will fit */
@@ -2470,7 +2470,7 @@ compute_scalar_stats(VacAttrStatsP stats,
mincount,
maxmincount;

- if (ndistinct < 0)
+ if (ndistinct < 0.0)
ndistinct = -ndistinct * totalrows;
/* estimate # of occurrences in sample of a typical value */
avgcount = (double) samplerows / ndistinct;
--
1.7.9.4

#2Robert Haas
robertmhaas@gmail.com
In reply to: Dan McGee (#1)
Re: [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats

On Sat, Mar 24, 2012 at 12:17 AM, Dan McGee <dan@archlinux.org> wrote:

This is a bit of a corner case in all honesty, but if you have a short
table (under 20 rows), the 10% heuristic used that decides whether
distinct values scale with the row count will result in rather odd
values for stadistinct in pg_statistic, such as '-0.2' or '-0.666667',
rather than the expected '2'. Additionally, this can cause only one of
{t, f} to appear in the most common values array.

Does this actually affect query planning in any way? Probably not, but
it is extremely odd to look at pg_stats for these columns, and the
solution seems easy.

But the stats aren't there to be looked at, but rather to guide query
planning. If at execution time there are 100 rows in the table,
should we still assume that there are only 2 distinct values in the
table, or that it's gone up to about 50 distinct values? It's hard to
say, but there's no apparent reason to think that the number of
distinct values will scale up for a large table but not a small table.

The bit about maybe not getting both t and f as MCVs on a Boolean does
seem a little worrying, but I'm not sure whether it actually affects
query planning in a materially negative way. Can you demonstrate a
case where it matters?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats

Robert Haas <robertmhaas@gmail.com> writes:

The bit about maybe not getting both t and f as MCVs on a Boolean does
seem a little worrying, but I'm not sure whether it actually affects
query planning in a materially negative way. Can you demonstrate a
case where it matters?

If we were trying to force that to happen it would be wrong anyway.
Consider a column that contains *only* "t", or at least has so few
"f"'s that "f" appears never or only once in the selected sample.
(IIRC there is a clamp that prevents selecting anything as an MCV
unless it appears at least twice in the sample.)

Like Robert, I'm not convinced whether or not this is a reasonable
change, but arguing for it on the basis of boolean columns doesn't
seem very sound.

regards, tom lane