Making the C collation less inclined to abort abbreviation

Started by Peter Geogheganover 10 years ago2 messageshackers
Jump to latest

The C collation is treated exactly the same as other collations when
considering whether the generation of abbreviated keys for text should
continue. This doesn't make much sense. With text, the big cost that
we are concerned about going to waste should abbreviated keys not
capture sufficient entropy is the cost of n strxfrm() calls. However,
the C collation doesn't use strxfrm() -- it uses memcmp(), which is
far cheaper.

With other types, like numeric and now UUID, the cost of generating an
abbreviated key is significantly lower than text when using collations
other than the C collation. Their cost models reflect this, and abort
abbreviation far less aggressively than text's, even though the
trade-off is very similar when text uses the C collation.

Attached patch fixes this inconsistency by making it significantly
less likely that abbreviation will be aborted when the C collation is
in use. The behavior with other collations is unchanged. This should
be backpatched to 9.5 as a bugfix, IMV.

--
Peter Geoghegan

Attachments:

0001-Abort-C-collation-text-abbreviation-less-frequently.patchtext/x-patch; charset=US-ASCII; name=0001-Abort-C-collation-text-abbreviation-less-frequently.patchDownload+6-3
#2Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Making the C collation less inclined to abort abbreviation

On Sun, Nov 29, 2015 at 4:02 PM, Peter Geoghegan <pg@heroku.com> wrote:

The C collation is treated exactly the same as other collations when
considering whether the generation of abbreviated keys for text should
continue. This doesn't make much sense. With text, the big cost that
we are concerned about going to waste should abbreviated keys not
capture sufficient entropy is the cost of n strxfrm() calls. However,
the C collation doesn't use strxfrm() -- it uses memcmp(), which is
far cheaper.

With other types, like numeric and now UUID, the cost of generating an
abbreviated key is significantly lower than text when using collations
other than the C collation. Their cost models reflect this, and abort
abbreviation far less aggressively than text's, even though the
trade-off is very similar when text uses the C collation.

Attached patch fixes this inconsistency by making it significantly
less likely that abbreviation will be aborted when the C collation is
in use. The behavior with other collations is unchanged. This should
be backpatched to 9.5 as a bugfix, IMV.

Could you provide a test case where this change is a winner for the C
locale but a loser for some other locale?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers