Upgrading pg_statistic to handle collation honestly
When we first put in collations support, we basically punted on teaching
ANALYZE, pg_statistic, and the planner selectivity functions about that.
They just use DEFAULT_COLLATION_OID independently of the actual collation
of the data. I tripped over this while investigating making type "name"
collatable: it needs to default to C_COLLATION_OID, and the mismatch
resulted in broken statistics for name columns. So it's time to pay down
that technical debt.
Attached is a draft patch for same. It adds storage to pg_statistic
to record the collation of each statistics "slot". A plausible
alternative design would be to just say "look at the collation of the
underlying column", but that would require extra catcache lookups in
the selectivity functions that need the info. Doing it like this also
makes it theoretically feasible to track stats computed with respect
to different collations for the same column, though I'm not really
convinced that we'd ever do that.
Loose ends:
* I'm not sure what, if anything, needs to be done in the extended
statistics stuff. It looks like the existing types of extended stats
aren't really collation sensitive, so maybe the answer is "nothing".
* There's a remaining use of DEFAULT_COLLATION_OID in array_selfuncs.c's
element_compare(). I'm not sure if it's important to get rid of that,
either; it doesn't seem to be used for anything that relates to
collected statistics, so it might be fine as-is.
* Probably this conflicts to some extent with Peter's "Reorganize
collation lookup" patch, but I haven't studied that yet.
* There's a kluge in get_attstatsslot() that I'd like to get rid of
later, but it's necessary for now because of the weird things that
happen when doing regex operators on "name" columns.
Comments, objections?
regards, tom lane
Attachments:
collatable-statistics-1.patchtext/x-diff; charset=us-ascii; name=collatable-statistics-1.patchDownload+186-120
On 12/12/2018 16:57, Tom Lane wrote:
Attached is a draft patch for same. It adds storage to pg_statistic
to record the collation of each statistics "slot". A plausible
alternative design would be to just say "look at the collation of the
underlying column", but that would require extra catcache lookups in
the selectivity functions that need the info.
That looks like a good approach to me.
Doing it like this also
makes it theoretically feasible to track stats computed with respect
to different collations for the same column, though I'm not really
convinced that we'd ever do that.
It's a good option to keep around. Maybe someday extended statistics
could be used to ask for additional statistics to be collected.
* Probably this conflicts to some extent with Peter's "Reorganize
collation lookup" patch, but I haven't studied that yet.
I've looked it over, and it's nothing that can't be easily fixed up. In
fact, it simplifies a few things, so I'm in favor of moving your patch
along first.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 12/12/2018 16:57, Tom Lane wrote:
* Probably this conflicts to some extent with Peter's "Reorganize
collation lookup" patch, but I haven't studied that yet.
I've looked it over, and it's nothing that can't be easily fixed up. In
fact, it simplifies a few things, so I'm in favor of moving your patch
along first.
Thanks for looking! I'll take a closer look at the loose ends I mentioned
and then push it.
regards, tom lane
On 12/12/2018 16:57, Tom Lane wrote:
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index b8445dc..dcbd04c 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *************** examine_attribute(Relation onerel, int a *** 904,914 **** --- 904,917 ---- { stats->attrtypid = exprType(index_expr); stats->attrtypmod = exprTypmod(index_expr); + stats->attrcollid = exprCollation(index_expr); + /* XXX should we consult indcollation instead? */
After looking through this again, I think the answer here is "yes". If
the index definition overrides the collation, then we clearly want to
use that. If it's not overridden, then indcollation is still set, so
it's just as easy to use it.
}
else
{
stats->attrtypid = attr->atttypid;
stats->attrtypmod = attr->atttypmod;
+ stats->attrcollid = attr->attcollation;
}typtuple = SearchSysCacheCopy1(TYPEOID,
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 12/12/2018 16:57, Tom Lane wrote:
+ stats->attrcollid = exprCollation(index_expr); + /* XXX should we consult indcollation instead? */
After looking through this again, I think the answer here is "yes".
Yeah, I was leaning towards that as well, but hadn't written the extra
code needed to make it so. Will fix.
regards, tom lane