Hash index initial size is too large given NULLs or partial indexes

Started by Jeff Janesalmost 7 years ago3 messages
#1Jeff Janes
jeff.janes@gmail.com

Referring to this thread:

https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices

When a hash index is created on a populated table, it estimates the number
of buckets to start out with based on the number of tuples returned
by estimate_rel_size. But this number ignores both the fact that NULLs are
not stored in hash indexes, and that partial indexes exist. This can lead
to much too large hash indexes. Doing a re-index just repeats the logic,
so doesn't fix anything. Fill factor also can't fix it, as you are not
allowed to increase that beyond 100.

This goes back to when the pre-sizing was implemented in 2008
(c9a1cc694abef737548a2a). It seems to be an oversight, rather than
something that was considered.

Is this a bug that should be fixed? Or if getting a more accurate estimate
is not possible or not worthwhile, add a code comment about that?

Cheers,

Jeff

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeff Janes (#1)
Re: Hash index initial size is too large given NULLs or partial indexes

On 3/8/19 7:14 PM, Jeff Janes wrote:

Referring to this thread:

https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices

When a hash index is created on a populated table, it estimates the
number of buckets to start out with based on the number of tuples
returned by estimate_rel_size.  But this number ignores both the fact
that NULLs are not stored in hash indexes, and that partial indexes
exist.  This can lead to much too large hash indexes.  Doing a re-index
just repeats the logic, so doesn't fix anything.  Fill factor also can't
fix it, as you are not allowed to increase that beyond 100.

Hmmm :-(

This goes back to when the pre-sizing was implemented in 2008
(c9a1cc694abef737548a2a).  It seems to be an oversight, rather than
something that was considered.

Is this a bug that should be fixed?  Or if getting a more accurate
estimate is not possible or not worthwhile, add a code comment about that?

I'd agree this smells like a bug (or perhaps two). The sizing probably
should consider both null_frac and selectivity of the index predicate.
When those two are redundant (i.e. when there's IS NOT NULL condition on
indexed column), this will result in under-estimate. That means the
index build will do a an extra split, but that's probably better than
having permanently bloated index.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#2)
Re: Hash index initial size is too large given NULLs or partial indexes

On Fri, Mar 8, 2019 at 11:57 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 3/8/19 7:14 PM, Jeff Janes wrote:

This goes back to when the pre-sizing was implemented in 2008
(c9a1cc694abef737548a2a). It seems to be an oversight, rather than
something that was considered.

Is this a bug that should be fixed? Or if getting a more accurate
estimate is not possible or not worthwhile, add a code comment about that?

I'd agree this smells like a bug (or perhaps two). The sizing probably
should consider both null_frac and selectivity of the index predicate.

Like you guys, I also think this area needs improvement. I am not
sure how easy it is to get the selectivity of the predicate in this
code path. If we see how we do it in set_plain_rel_size() during path
generation in the planner, we can get some idea.

Another idea could be that we don't create the buckets till we know
the exact tuples returned by IndexBuildHeapScan. Basically, I think
we need to spool the tuples, create the appropriate buckets and then
insert the tuples. We might want to do this only when some index
predicate is present.

If somebody is interested in doing the leg work, I can help in
reviewing the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com