Bogus ANALYZE results for an otherwise-unique column with many nulls

Started by Tom Laneover 9 years ago12 messages

tgl@sss.pgh.pa.us

over 9 years ago

1 attachment(s)

I looked into the problem described at
/messages/by-id/VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so. If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then unique1 else null end as unique1 from tenk1;
SELECT 10000
regression=# analyze manynulls;
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=261.00..756.50 rows=1 width=244) (actual time=4.632..14.729 rows=8973 loops=1)
Hash Cond: (t.unique1 = m.unique1)
-> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.015..2.683 rows=10000 loops=1)
-> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.553..4.553 rows=1027 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 165kB
-> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.019..2.668 rows=10000 loops=1)
Planning time: 0.808 ms
Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
attname | null_frac | n_distinct
---------+-----------+------------
unique1 | 0.8973 | -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
attname | null_frac | n_distinct
---------+-----------+------------
unique1 | 0.8973 | -0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=261.00..847.69 rows=8973 width=244) (actual time=4.501..13.888 rows=8973 loops=1)
Hash Cond: (t.unique1 = m.unique1)
-> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.031..4.959 rows=10000 loops=1)
-> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.404..4.404 rows=1027 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 165kB
-> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.034..2.576 rows=10000 loops=1)
Planning time: 1.388 ms
Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found. An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong. The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values. We didn't back-patch
that fix, so probably we shouldn't back-patch this either. On the other
hand, it is far more open-and-shut that this is wrong. Thoughts?

regards, tom lane

Attachments:

another-analyze-with-nulls-fix.patchtext/x-diff; charset=us-ascii; name=another-analyze-with-nulls-fix.patchDownload

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 ****
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
  				 nmultiple == track_cnt)
--- 2049,2059 ----
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
  				 nmultiple == track_cnt)
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 ****
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
--- 2429,2439 ----
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 ****
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0;
  	}
  	else if (null_cnt > 0)
  	{
--- 2759,2765 ----
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  	}
  	else if (null_cnt > 0)
  	{

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Tom Lane (#1)

1 attachment(s)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

I wrote:

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane

Attachments:

analyze-unique-with-nulls-2.patchtext/x-diff; charset=us-ascii; name=analyze-unique-with-nulls-2.patchDownload

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*************** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 ****
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
  				 nmultiple == track_cnt)
--- 2049,2059 ----
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
  				 nmultiple == track_cnt)
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 ****
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
--- 2429,2439 ----
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
*************** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 ****
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0;
  	}
  	else if (null_cnt > 0)
  	{
--- 2759,2765 ----
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  	}
  	else if (null_cnt > 0)
  	{
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 0f851ea..817453c 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
*************** compute_tsvector_stats(VacAttrStats *sta
*** 295,301 ****
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0;
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
--- 295,301 ----
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index fcb71d3..56504fc 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*************** compute_range_stats(VacAttrStats *stats,
*** 203,209 ****
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 		stats->stadistinct = -1.0;
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
--- 203,211 ----
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 
! 		/* Estimate that non-null values are unique */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cc2a9a1..56943f2 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** double
*** 4738,4743 ****
--- 4738,4744 ----
  get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
  {
  	double		stadistinct;
+ 	double		stanullfrac = 0.0;
  	double		ntuples;
  
  	*isdefault = false;
*************** get_variable_numdistinct(VariableStatDat
*** 4745,4751 ****
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry, or can get a better
! 	 * value than is in pg_statistic.
  	 */
  	if (HeapTupleIsValid(vardata->statsTuple))
  	{
--- 4746,4753 ----
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry, or can get a better
! 	 * value than is in pg_statistic.  Grab stanullfrac too if we can find it
! 	 * (otherwise, assume no nulls, for lack of any better idea).
  	 */
  	if (HeapTupleIsValid(vardata->statsTuple))
  	{
*************** get_variable_numdistinct(VariableStatDat
*** 4754,4759 ****
--- 4756,4762 ----
  
  		stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
  		stadistinct = stats->stadistinct;
+ 		stanullfrac = stats->stanullfrac;
  	}
  	else if (vardata->vartype == BOOLOID)
  	{
*************** get_variable_numdistinct(VariableStatDat
*** 4777,4783 ****
  			{
  				case ObjectIdAttributeNumber:
  				case SelfItemPointerAttributeNumber:
! 					stadistinct = -1.0; /* unique */
  					break;
  				case TableOidAttributeNumber:
  					stadistinct = 1.0;	/* only 1 value */
--- 4780,4786 ----
  			{
  				case ObjectIdAttributeNumber:
  				case SelfItemPointerAttributeNumber:
! 					stadistinct = -1.0; /* unique (and all non null) */
  					break;
  				case TableOidAttributeNumber:
  					stadistinct = 1.0;	/* only 1 value */
*************** get_variable_numdistinct(VariableStatDat
*** 4799,4808 ****
  	 * If there is a unique index or DISTINCT clause for the variable, assume
  	 * it is unique no matter what pg_statistic says; the statistics could be
  	 * out of date, or we might have found a partial unique index that proves
! 	 * the var is unique for this query.
  	 */
  	if (vardata->isunique)
! 		stadistinct = -1.0;
  
  	/*
  	 * If we had an absolute estimate, use that.
--- 4802,4812 ----
  	 * If there is a unique index or DISTINCT clause for the variable, assume
  	 * it is unique no matter what pg_statistic says; the statistics could be
  	 * out of date, or we might have found a partial unique index that proves
! 	 * the var is unique for this query.  However, we'd better still believe
! 	 * the null-fraction statistic.
  	 */
  	if (vardata->isunique)
! 		stadistinct = -1.0 * (1.0 - stanullfrac);
  
  	/*
  	 * If we had an absolute estimate, use that.

Andrew Gierth

andrew@tao11.riddles.org.uk

over 9 years ago

In reply to: Tom Lane (#1)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> What I did in the patch is to scale the formerly fixed "-1.0"
Tom> stadistinct estimate to discount the fraction of nulls we found.

This seems quite dubious to me. stadistinct representing only the
non-null values seems to me to be substantially more useful and less
confusing; it should be up to consumers to take stanullfrac into account
(in general they already do) since in many cases we explicitly do _not_
want to count nulls.

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andreas Joseph Krogh

andreas@visena.com

over 9 years ago

In reply to: Tom Lane (#2)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

På fredag 05. august 2016 kl. 01:01:06, skrev Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>>:
I wrote:

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane

(Sorry for HTML-email, but we live in 2016, and I think it improves
readability)

Great!

I have tested it (with PG-9.6 from HEAD,
e7caacf733f3ee77a555aa715ab6fbf4434e6b52) and it sure looks like it fixes the
problem for my query.

The query:

explain analyze SELECT log.relation_id as company_id , sum(log.duration) AS
durationFROM onp_crm_activity_log log JOIN onp_crm_person logfor ON
logfor.onp_user_id =log.logged_for AND logfor.is_resource = FALSE WHERE 1 = 1
-- Filter out already invoiced AND NOT EXISTS( SELECT * FROM
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id =
inv.entity_idWHERE cem.calendar_entry_id = log.id AND NOT EXISTS( SELECT * FROM
onp_crm_invoice creditnoteWHERE il.invoice_id = creditnote.credit_against ) )
GROUP BY log.relation_id ;

Gave the following explain-plan (before the patch), with enable_nestloop=off
(needed to prevent nest-loop-anti-join which caused this query to execute in
~26 min.):

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=25201.62..25220.06 rows=1475 width=36) (actual time
=381.191..381.281rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join
(cost=4782.34..25148.28 rows=10667 width=12) (actual time=103.683..361.409 rows=
148330loops=1) Hash Cond: (log.id = cem.calendar_entry_id) -> Hash Join (cost
=81.46..20312.73rows=10667 width=20) (actual time=0.100..156.432 rows=380617
loops=1) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Seq Scan on
onp_crm_activity_loglog (cost=0.00..18696.71 rows=380771 width=24) (actual time
=0.005..49.195rows=380774 loops=1) -> Hash (cost=25.02..25.02 rows=4515 width=8
) (actualtime=0.078..0.078 rows=119 loops=1) Buckets: 8192 Batches: 1 Memory
Usage: 69kB -> Index Scan using onp_crm_person_onp_id_idx on onp_crm_person
logfor (cost=0.14..25.02 rows=4515 width=8) (actual time=0.005..0.063 rows=119
loops=1) Filter: (NOT is_resource) Rows Removed by Filter: 8 -> Hash (cost
=4700.87..4700.87rows=1 width=4) (actual time=103.575..103.575 rows=232412
loops=1) Buckets: 262144 (originally 1024) Batches: 1 (originally 1) Memory
Usage: 10219kB -> Hash Join (cost=474.41..4700.87 rows=1 width=4) (actual time
=9.724..76.015rows=232412 loops=1) Hash Cond: (cem.invoice_line_id = il.id) ->
Seq Scanon onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows
=232528 width=8) (actual time=0.004..17.626 rows=232528 loops=1) -> Hash (cost
=474.40..474.40rows=1 width=4) (actual time=9.710..9.710 rows=11535 loops=1)
Buckets:16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 534kB
-> MergeJoin (cost=415.33..474.40 rows=1 width=4) (actual time=4.149..8.467 rows
=11535 loops=1) Merge Cond: (il.invoice_id = inv.entity_id) -> Sort (cost
=415.05..415.06rows=1 width=8) (actual time=4.138..4.701 rows=11535 loops=1)
SortKey: il.invoice_id Sort Method: quicksort Memory: 925kB -> Hash Anti Join (
cost=77.13..415.04 rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
HashCond: (il.invoice_id = creditnote.credit_against) -> Seq Scan on
onp_crm_invoice_line il (cost=0.00..294.30 rows=11630 width=8) (actual time
=0.003..0.995rows=11630 loops=1) -> Hash (cost=50.38..50.38 rows=2140 width=4)
(actualtime=0.247..0.247 rows=37 loops=1) Buckets: 4096 Batches: 1 Memory Usage:
34kB -> Index Only Scan using origo_invoice_credit_against_idx on
onp_crm_invoice creditnote (cost=0.28..50.38 rows=2140 width=4) (actual time
=0.013..0.182rows=2140 loops=1) Heap Fetches: 0 -> Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.008..1.274 rows=11576 loops=1) Heap Fetches: 0
Planningtime: 0.955 ms Execution time: 381.884 ms (35 rows)

AFAIU, this is the problematic estimate: -> Hash Anti Join (cost=77.13..415.04
rows=1 width=8) (actual time=0.257..2.716 rows=11535 loops=1)
Right?

now (after the patch, and without needing to disable nest_loop) gives the
following explain-plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=44409.89..44428.47 rows=1486 width=36) (actual time
=366.502..366.594rows=404 loops=1) Group Key: log.relation_id -> Hash Anti Join
(cost=10157.30..43650.11 rows=151956 width=12) (actual time=99.420..347.309 rows
=148327 loops=1) Hash Cond: (log.id = cem.calendar_entry_id) -> Merge Join (cost
=0.56..26729.34rows=379984 width=20) (actual time=0.011..150.415 rows=380614
loops=1) Merge Cond: (logfor.onp_user_id = log.logged_for) -> Index Scan using
onp_crm_person_onp_id_idxon onp_crm_person logfor (cost=0.14..25.02 rows=4515
width=8) (actual time=0.006..0.030 rows=35 loops=1) Filter: (NOT is_resource)
RowsRemoved by Filter: 2 -> Index Scan using origo_activity_log_logged_for_idx
ononp_crm_activity_log log (cost=0.42..21977.16 rows=380771 width=24) (actual
time=0.003..69.406 rows=380623 loops=1) -> Hash (cost=7300.48..7300.48 rows=
228500width=4) (actual time=99.226..99.226 rows=232412 loops=1) Buckets: 262144
Batches:1 Memory Usage: 10219kB -> Hash Join (cost=789.13..7300.48 rows=228500
width=4) (actual time=8.043..67.079 rows=232412 loops=1) Hash Cond:
(cem.invoice_line_id = il.id) -> Seq Scanon
onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows=232528
width=8) (actual time=0.007..16.230 rows=232528 loops=1) -> Hash (cost
=646.27..646.27rows=11429 width=4) (actual time=8.003..8.003 rows=11535 loops=1
) Buckets:16384 Batches: 1 Memory Usage: 534kB -> Merge Anti Join (cost
=0.85..646.27rows=11429 width=4) (actual time=0.021..6.393 rows=11535 loops=1)
Merge Cond: (il.invoice_id = creditnote.credit_against) -> MergeJoin (cost
=0.57..613.99rows=11630 width=8) (actual time=0.010..4.771 rows=11630 loops=1)
Merge Cond: (inv.entity_id = il.invoice_id) ->Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.006..0.318 rows=2140 loops=1) Heap Fetches: 0 ->
IndexScan using onp_crm_invoice_line_invoice_id_lineno_key on
onp_crm_invoice_line il (cost=0.29..409.28 rows=11630 width=8) (actual time
=0.003..2.323rows=11630 loops=1) -> Index Only Scan using
onp_crm_invoice_credit_against_keyon onp_crm_invoice creditnote (cost
=0.28..47.98rows=2140 width=4) (actual time=0.011..0.016 rows=38 loops=1) Heap
Fetches:0 Planning time: 0.931 ms Execution time: 366.710 ms (28 rows)

And, for fun with max_parallel_workers_per_gather = 6
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=39788.73..39851.89 rows=1486 width=36) (actual
time=234.289..234.747 rows=404 loops=1) Group Key: log.relation_id -> Sort (cost
=39788.73..39799.88rows=4458 width=36) (actual time=234.276..234.307 rows=1000
loops=1) Sort Key: log.relation_id Sort Method: quicksort Memory: 103kB ->
Gather (cost=39054.15..39518.53 rows=4458 width=36) (actual time
=233.012..234.110rows=1000 loops=1) Workers Planned: 3 Workers Launched: 3 ->
PartialHashAggregate (cost=38054.15..38072.73 rows=1486 width=36) (actual time
=225.607..225.701rows=250 loops=4) Group Key: log.relation_id -> Hash Join (cost
=10238.19..37294.37rows=151956 width=12) (actual time=163.060..219.652 rows=
37082loops=4) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Hash Anti Join
(cost=10156.73..28460.48 rows=152271 width=16) (actual time=162.618..212.351
rows=37121 loops=4) Hash Cond: (log.id = cem.calendar_entry_id) -> Parallel Seq
Scanon onp_crm_activity_log log (cost=0.00..16117.29 rows=122829 width=24)
(actualtime=0.010..15.532 rows=95193 loops=4) -> Hash (cost=7300.48..7300.48
rows=228500 width=4) (actual time=161.270..161.270 rows=232412 loops=4) Buckets:
262144Batches: 1 Memory Usage: 10219kB -> Hash Join (cost=789.13..7300.48 rows=
228500width=4) (actual time=18.276..112.034 rows=232412 loops=4) Hash Cond:
(cem.invoice_line_id = il.id) -> Seq Scanon
onp_crm_calendarentry_invoice_membership cem (cost=0.00..3354.28 rows=232528
width=8) (actual time=0.045..26.061 rows=232528 loops=4) -> Hash (cost
=646.27..646.27rows=11429 width=4) (actual time=18.116..18.116 rows=11535 loops=
4) Buckets: 16384 Batches: 1 Memory Usage: 534kB -> Merge Anti Join (cost
=0.85..646.27rows=11429 width=4) (actual time=0.179..14.505 rows=11535 loops=4)
Merge Cond: (il.invoice_id = creditnote.credit_against) -> MergeJoin (cost
=0.57..613.99rows=11630 width=8) (actual time=0.113..10.871 rows=11630 loops=4)
Merge Cond: (inv.entity_id = il.invoice_id) ->Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.28..53.98 rows=
2140width=8) (actual time=0.063..0.811 rows=2140 loops=4) Heap Fetches: 0 ->
IndexScan using onp_crm_invoice_line_invoice_id_lineno_key on
onp_crm_invoice_line il (cost=0.29..409.28 rows=11630 width=8) (actual time
=0.046..5.213rows=11630 loops=4) -> Index Only Scan using
onp_crm_invoice_credit_against_keyon onp_crm_invoice creditnote (cost
=0.28..47.98rows=2140 width=4) (actual time=0.063..0.071 rows=38 loops=4) Heap
Fetches:0 -> Hash (cost=25.02..25.02 rows=4515 width=8) (actual time
=0.325..0.325rows=119 loops=4) Buckets: 8192 Batches: 1 Memory Usage: 69kB ->
IndexScan using onp_crm_person_onp_id_idx on onp_crm_person logfor (cost
=0.14..25.02rows=4515 width=8) (actual time=0.032..0.270 rows=119 loops=4)
Filter: (NOT is_resource) Rows Removed by Filter: 8 Planning time: 2.443 ms
Executiontime: 239.979 ms (38 rows)

All in all, thanks for looking into this, and +1 for backpatching to 9.5.

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>

Andrew Gierth

andrew@tao11.riddles.org.uk

over 9 years ago

In reply to: Andrew Gierth (#3)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> What I did in the patch is to scale the formerly fixed "-1.0"
Tom> stadistinct estimate to discount the fraction of nulls we found.

Andrew> This seems quite dubious to me. stadistinct representing only
Andrew> the non-null values seems to me to be substantially more useful
Andrew> and less confusing; it should be up to consumers to take
Andrew> stanullfrac into account (in general they already do) since in
Andrew> many cases we explicitly do _not_ want to count nulls.

Hm. I am wrong about this, since it's the fact that consumers are taking
stanullfrac into account that makes the value wrong in the first place.
For example, if a million-row table has stanullfrac=0.9 and
stadistinct=-1, then get_variable_numdistinct is returning 1 million,
and (for example) var_eq_non_const divides 0.1 by that to give a
selectivity of 1 in 10 million, which is obviously wrong.

But I think the fix is still wrong, because it changes the meaning of
ALTER TABLE ... ALTER col SET (n_distinct=...) in a non-useful way; it
is no longer possible to nail down a useful negative n_distinct value if
the null fraction of the column is variable. Would it not make more
sense to do the adjustment in get_variable_numdistinct, instead?

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Andrew Gierth (#5)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Hm. I am wrong about this, since it's the fact that consumers are taking
stanullfrac into account that makes the value wrong in the first place.

Also, the way that the value is calculated in the samples-not-all-distinct
case corresponds to the way I have it in the patch. What you want to do
would correspond to leaving these edge cases alone and changing all the
other ANALYZE cases instead (*plus* changing the consumers). I find that
a tad scary.

But I think the fix is still wrong, because it changes the meaning of
ALTER TABLE ... ALTER col SET (n_distinct=...) in a non-useful way; it
is no longer possible to nail down a useful negative n_distinct value if
the null fraction of the column is variable.

I think that argument is bogus. If we change the way that
get_variable_numdistinct (and other consumers) use the value, that will
break all existing custom settings of n_distinct, because they will no
longer mean what they did before. There have been exactly zero field
complaints that people could not get the results they wanted, so I do
not think that's justified.

In short, what you want to do constitutes a redefinition of stadistinct,
while my patch doesn't. That is far more invasive and I fear it will
break things that are not broken today.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Gierth

andrew@tao11.riddles.org.uk

over 9 years ago

In reply to: Tom Lane (#6)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> Also, the way that the value is calculated in the
Tom> samples-not-all-distinct case corresponds to the way I have it in
Tom> the patch.

Ahh, gotcha. You're referring to this:

/*
* If we estimated the number of distinct values at more than 10% of
* the total row count (a very arbitrary limit), then assume that
* stadistinct should scale with the row count rather than be a fixed
* value.
*/
if (stats->stadistinct > 0.1 * totalrows)
stats->stadistinct = -(stats->stadistinct / totalrows);

where "totalrows" includes nulls obviously. So this expects negative
stadistinct to be scaled by the total table size, and the all-distinct
case should do the same.

Objection withdrawn.

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Andrew Gierth (#7)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Objection withdrawn.

OK, thanks. What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dean Rasheed

dean.a.rasheed@gmail.com

over 9 years ago

In reply to: Tom Lane (#8)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OK, thanks. What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:

The number of distinct nonnull data values in the column.
A value greater than zero is the actual number of distinct values.
A value less than zero is the negative of a multiplier for the number
of rows in the table; for example, a column in which values appear about
twice on the average could be represented by
<structfield>stadistinct</> = -0.5.

Additionally, I think that example is misleading because it's only
really true if there are no null values in the column. Perhaps it
would help to have a more explicit example to illustrate how nulls
affect stadistinct, for example:

... for example, a column in which about 80% of the values are nonnull
and each nonnull value appears about twice on average could be
represented by <structfield>stadistinct</> = -0.4.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andreas Joseph Krogh

andreas@visena.com

over 9 years ago

In reply to: Dean Rasheed (#9)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

På søndag 07. august 2016 kl. 09:01:40, skrev Dean Rasheed <
dean.a.rasheed@gmail.com <mailto:dean.a.rasheed@gmail.com>>:
On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OK, thanks. What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:
[snip]

Will this then make it into the soon-to-be-released 9.5.4?

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>

#11

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Dean Rasheed (#9)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OK, thanks. What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:

Yeah. I suspect that situations like this are pretty rare, or we'd have
recognized the problem sooner. And at least for Andreas, it'd be fixing
a real problem. I'll apply the back-patch unless I hear objections in
the next couple of hours.

Additionally, I think that example is misleading because it's only
really true if there are no null values in the column. Perhaps it
would help to have a more explicit example to illustrate how nulls
affect stadistinct, for example:

Good idea, will do.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andreas Joseph Krogh

andreas@visena.com

over 9 years ago

In reply to: Tom Lane (#11)

Re: Bogus ANALYZE results for an otherwise-unique column with many nulls

På søndag 07. august 2016 kl. 18:35:56, skrev Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>>:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 5 August 2016 at 21:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:

OK, thanks. What shall we do about Andreas' request to back-patch this?
I'm personally willing to do it, but there is the old bugaboo of "maybe
it will destabilize a plan that someone is happy with".

My inclination would be to back-patch it because arguably it's a
bug-fix -- at the very least the old behaviour didn't match the docs
for stadistinct:

Additionally, I think that example is misleading because it's only
really true if there are no null values in the column. Perhaps it
would help to have a more explicit example to illustrate how nulls
affect stadistinct, for example:

Good idea, will do.

regards, tom lane

(Sorry for the HTML-formatting, I'm using a web-based client which isn't very
good at converting the HTML to plain-text)

Just wanted to thank you for fixing this.

Take this query (pg-9.5):

explain analyze SELECT com.entity_id AS company_id , sum((coalesce(log
.adjusted_time,log.duration) / (3600 * 1000)::numeric) * coalesce(log
.adjusted_hourly_rate,log.hourly_rate)) as not_invoiced_hours_amount_adjusted ,
sum((log.duration / (3600 * 1000)::numeric) * log.hourly_rate) as
not_invoiced_hours_amount_original ,sum(coalesce(log.adjusted_time, log
.duration))AS not_invoiced_hours_duration_adjusted , sum(log.duration) AS
not_invoiced_hours_duration_originalFROM onp_crm_relation com JOIN
onp_crm_activity_loglog ON com.entity_id = log.relation_id JOIN onp_crm_person
logforON logfor.onp_user_id = log.logged_for AND logfor.is_resource = FALSE
LEFT OUTER JOIN( onp_crm_activity act INNER JOIN onp_crm_project proj ON
act.project_id = proj.id )ON log.activity_id = act.id WHERE 1 = 1 AND log
.entry_type ='TIMESHEET_ENTRY' AND log.start_date IS NOT NULL AND log
.is_chargable =TRUE AND log.status IN (1,2,3,5) AND log.start_date <=
'2016-06-27' -- AND com.entity_id = 2246068 AND NOT EXISTS( SELECT * FROM
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id =
inv.entity_idWHERE cem.calendar_entry_id = log.id AND inv.status_key =
'INVOICE_STATUS_INVOICED' AND inv.sent_date <= '2016-06-27' AND NOT EXISTS(
SELECT* FROM onp_crm_invoice creditnote WHERE il.invoice_id =
creditnote.credit_againstAND creditnote.status_key = 'INVOICE_STATUS_INVOICED'
ANDcreditnote.sent_date <= '2016-06-27' ) ) GROUP BY com.entity_id;

Before the fix, the plan was like this:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=13874.94..13894.38 rows=972 width=32) (actual time
=420188.465..420189.232rows=462 loops=1) Group Key: com.entity_id -> Nested
Loop Left Join(cost=1710.21..13848.21 rows=972 width=32) (actual time
=1004.858..420166.736rows=3882 loops=1) -> Nested Loop Anti Join (cost
=1709.51..12990.02rows=972 width=36) (actual time=1003.001..419356.749 rows=3882
loops=1) Join Filter: (cem.calendar_entry_id = log.id) Rows Removed by Join
Filter: 3996874113 -> Hash Join (cost=1172.69..10612.53 rows=972 width=44)
(actualtime=29.085..333.323 rows=89431 loops=1) Hash Cond: (log.relation_id =
com.entity_id) ->Hash Join (cost=1063.39..10489.85 rows=976 width=40) (actual
time=27.598..261.822 rows=89431 loops=1) Hash Cond: (log.logged_for =
logfor.onp_user_id) -> Bitmap Heap Scanon onp_crm_activity_log log (cost
=976.83..10055.99rows=90010 width=44) (actual time=27.402..223.407 rows=89486
loops=1) Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND is_chargable
AND(start_date IS NOT NULL) AND ((entry_type)::text = 'TIMESHEET_ENTRY'::text))
Filter: (start_date <= '2016-06-27'::date) Rows Removed by Filter: 991 Heap
Blocks: exact=6095 -> Bitmap Index Scan on origo_calendar_entry_status_1_idx (
cost=0.00..954.33 rows=93008 width=0) (actual time=26.339..26.339 rows=101274
loops=1) Index Cond: (status = ANY ('{1,2,3,5}'::integer[])) -> Hash (cost
=39.46..39.46rows=3768 width=8) (actual time=0.148..0.148 rows=36 loops=1)
Buckets:4096 Batches: 1 Memory Usage: 34kB -> Bitmap Heap Scan on
onp_crm_person logfor (cost=3.69..39.46 rows=3768 width=8) (actual time
=0.049..0.137rows=36 loops=1) Recheck Cond: (onp_user_id IS NOT NULL) Filter: (
NOTis_resource) Rows Removed by Filter: 5 Heap Blocks: exact=11 -> Bitmap Index
Scanon onp_crm_person_onp_id_idx (cost=0.00..2.75 rows=41 width=0) (actual time
=0.021..0.021rows=41 loops=1) -> Hash (cost=80.66..80.66 rows=2291 width=8)
(actualtime=1.449..1.449 rows=2292 loops=1) Buckets: 4096 Batches: 1 Memory
Usage: 122kB -> Index Only Scan using onp_crm_relation_pkey on onp_crm_relation
com (cost=0.28..80.66 rows=2291 width=8) (actual time=0.221..1.036 rows=2292
loops=1) Heap Fetches: 1127 -> Materialize (cost=536.82..2362.91 rows=1 width=4
) (actualtime=0.000..1.564 rows=44693 loops=89431) -> Nested Loop (cost
=536.82..2362.91rows=1 width=4) (actual time=10.480..382.816 rows=85668 loops=1
) -> NestedLoop (cost=536.53..2362.59 rows=1 width=8) (actual time
=10.437..176.495rows=88731 loops=1) -> Hash Anti Join (cost=536.24..2362.05 rows
=1 width=8) (actual time=9.914..44.329 rows=76347 loops=1) Hash Cond:
(il.invoice_id = creditnote.credit_against) ->Index Only Scan using
origo_invoice_line_id_invoice_idxon onp_crm_invoice_line il (cost=0.29..1530.95
rows=78707 width=8) (actual time=0.213..20.450 rows=78787 loops=1) Heap Fetches:
3131 -> Hash (cost=390.36..390.36 rows=11647 width=8) (actual time=9.589..9.589
rows=372 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 143kB -> Index Only
Scanusing origo_invoice_credit_against_status_sent_idx on onp_crm_invoice
creditnote (cost=0.29..390.36 rows=11647 width=8) (actual time=0.205..8.953 rows
=11507 loops=1) Index Cond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND
(sent_date <='2016-06-27'::date)) Heap Fetches: 999 -> Index Only Scan using
onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_keyon
onp_crm_calendarentry_invoice_membership cem (cost=0.29..0.45 rows=9 width=8)
(actualtime=0.001..0.001 rows=1 loops=76347) Index Cond: (invoice_line_id =
il.id) Heap Fetches:4951 -> Index Only Scan using
origo_invoice_id_status_sent_idxon onp_crm_invoice inv (cost=0.29..0.31 rows=1
width=8) (actual time=0.002..0.002 rows=1 loops=88731) Index Cond: ((entity_id
= il.invoice_id)AND (status_key = 'INVOICE_STATUS_INVOICED'::text) AND
(sent_date <='2016-06-27'::date)) Heap Fetches: 9293 -> Nested Loop (cost
=0.71..0.87rows=1 width=4) (actual time=0.207..0.207 rows=1 loops=3882) -> Index
Scanusing onp_crm_activity_pkey on onp_crm_activity act (cost=0.42..0.56 rows=1
width=8) (actual time=0.199..0.199 rows=1 loops=3882) Index Cond: (log
.activity_id = id) ->Index Only Scan using onp_crm_project_pkey on
onp_crm_project proj (cost=0.29..0.30 rows=1 width=4) (actual time=0.006..0.006
rows=1 loops=3882) Index Cond: (id = act.project_id) Heap Fetches: 2782 Planning
time: 33.132 ms Execution time: 420195.041 ms (55 rows)

After the fix (using pg-9.5.4), the plan is this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=44223.20..44269.02 rows=2291 width=32) (actual time
=604.480..604.729rows=462 loops=1) Group Key: com.entity_id -> Hash Left Join (
cost=40805.80..42869.54 rows=49224 width=32) (actual time=591.912..597.284 rows=
3882loops=1) Hash Cond: (log.activity_id = act.id) -> Hash Join (cost
=24077.93..25498.04rows=49224 width=36) (actual time=309.816..313.542 rows=3882
loops=1) Hash Cond: (log.relation_id = com.entity_id) -> Merge Join (cost
=23968.63..24711.11rows=49435 width=32) (actual time=304.949..307.474 rows=3882
loops=1) Merge Cond: (logfor.onp_user_id = log.logged_for) -> Index Scan using
onp_crm_person_onp_id_idxon onp_crm_person logfor (cost=0.14..41.05 rows=3768
width=8) (actual time=0.194..1.321 rows=36 loops=1) Filter: (NOT is_resource)
RowsRemoved by Filter: 5 -> Sort (cost=23968.49..24092.29 rows=49520 width=36)
(actualtime=304.579..305.137 rows=3893 loops=1) Sort Key: log.logged_for Sort
Method: quicksort Memory:401kB -> Hash Anti Join (cost=9997.74..20106.99 rows=
49520width=36) (actual time=147.213..302.609 rows=3893 loops=1) Hash Cond: (log
.id = cem.calendar_entry_id) -> Bitmap Heap Scanon onp_crm_activity_log log (
cost=976.83..10055.99 rows=90010 width=44) (actual time=28.286..158.599 rows=
89486loops=1) Recheck Cond: ((status = ANY ('{1,2,3,5}'::integer[])) AND
is_chargableAND (start_date IS NOT NULL) AND ((entry_type)::text =
'TIMESHEET_ENTRY'::text)) Filter: (start_date <= '2016-06-27'::date) Rows
Removedby Filter: 991 Heap Blocks: exact=6095 -> Bitmap Index Scan on
origo_calendar_entry_status_1_idx (cost=0.00..954.33 rows=93008 width=0) (actual
time=27.251..27.251 rows=101274 loops=1) Index Cond: (status = ANY ('{1,2,3,5}'
::integer[])) -> Hash (cost=7964.73..7964.73 rows=84494 width=4) (actual time
=118.106..118.106rows=85668 loops=1) Buckets: 131072 Batches: 1 Memory Usage:
4036kB -> Hash Join (cost=1069.29..7964.73 rows=84494 width=4) (actual time
=15.605..102.721rows=85668 loops=1) Hash Cond: (il.invoice_id = inv.entity_id)
->Hash Anti Join (cost=536.55..6265.54 rows=85741 width=8) (actual time
=5.467..71.124rows=88731 loops=1) Hash Cond: (il.invoice_id =
creditnote.credit_against) -> MergeJoin (cost=0.61..4501.71 rows=89266 width=8)
(actualtime=0.441..53.014 rows=89301 loops=1) Merge Cond: (il.id =
cem.invoice_line_id) ->Index Only Scan using origo_invoice_line_id_invoice_idx
ononp_crm_invoice_line il (cost=0.29..1530.95 rows=78707 width=8) (actual time
=0.226..15.901rows=78787 loops=1) Heap Fetches: 3131 -> Index Only Scan using
onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_keyon
onp_crm_calendarentry_invoice_membership cem (cost=0.29..1662.98 rows=89266
width=8) (actual time=0.194..16.489 rows=89301 loops=1) Heap Fetches: 5027 ->
Hash(cost=390.36..390.36 rows=11647 width=8) (actual time=4.785..4.785 rows=372
loops=1) Buckets: 16384 Batches: 1 Memory Usage: 143kB -> Index Only Scan using
origo_invoice_credit_against_status_sent_idxon onp_crm_invoice creditnote (cost
=0.29..390.36rows=11647 width=8) (actual time=0.020..4.213 rows=11507 loops=1)
IndexCond: ((status_key = 'INVOICE_STATUS_INVOICED'::text) AND (sent_date <=
'2016-06-27'::date)) Heap Fetches: 999 -> Hash (cost=387.15..387.15 rows=11647
width=8) (actual time=10.063..10.063 rows=11507 loops=1) Buckets: 16384 Batches:
1Memory Usage: 578kB -> Index Only Scan using origo_invoice_id_status_sent_idx
ononp_crm_invoice inv (cost=0.29..387.15 rows=11647 width=8) (actual time
=0.233..8.133rows=11507 loops=1) Index Cond: ((status_key =
'INVOICE_STATUS_INVOICED'::text) AND (sent_date <= '2016-06-27'::date)) Heap
Fetches:999 -> Hash (cost=80.66..80.66 rows=2291 width=8) (actual time
=4.831..4.831rows=2292 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 122kB ->
Index OnlyScan using onp_crm_relation_pkey on onp_crm_relation com (cost
=0.28..80.66rows=2291 width=8) (actual time=0.235..4.370 rows=2292 loops=1)
Heap Fetches:1127 -> Hash (cost=14012.31..14012.31 rows=217245 width=4) (actual
time=281.015..281.015 rows=220147 loops=1) Buckets: 262144 Batches: 1 Memory
Usage: 9788kB -> Hash Join (cost=433.74..14012.31 rows=217245 width=4) (actual
time=29.243..235.289 rows=220147 loops=1) Hash Cond: (act.project_id = proj.id)
-> Seq Scanon onp_crm_activity act (cost=0.00..10591.45 rows=217245 width=8)
(actualtime=0.186..139.016 rows=220147 loops=1) -> Hash (cost=262.35..262.35
rows=13711 width=4) (actual time=28.980..28.980 rows=13760 loops=1) Buckets:
16384Batches: 1 Memory Usage: 612kB -> Index Only Scan using
onp_crm_project_pkeyon onp_crm_project proj (cost=0.29..262.35 rows=13711 width=
4) (actual time=0.067..26.575 rows=13760 loops=1) Heap Fetches: 6146 Planning
time: 41.021 ms Execution time: 605.530 ms (60 rows)

That's 420195 / 605 = 695x speedup!
Thanks!

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>