Built-in CTYPE provider

Started by Jeff Davisover 2 years ago125 messageshackers

pgsql@j-davis.com

over 2 years ago

CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.

Unicode offers relatively simple rules for CTYPE-like functionality
based on data files. There are a few exceptions and a few options,
which I'll address below.

(In contrast, collation varies a lot from locale to locale, and has a
lot more options and nuance than ctype.)

=== Proposal ===

Parse some Unicode data files into static lookup tables in .h files
(similar to what we already do for normalization) and provide
functions to perform the right lookups according to Unicode
recommentations[1]http://www.unicode.org/reports/tr18/#Compatibility_Properties[2]https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992. Then expose the functionality as either a
specially-named locale for the libc provider, or as part of the
built-in collation provider which I previously proposed[3]/messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com. (Provided
patches don't expose the functionality yet; I'm looking for feedback
first.)

Using libc or ICU for a CTYPE provider would still be supported, but
as I explain below, there's not nearly as much reason to do so as you
might expect. As far as I can tell, using an external provider for
CTYPE functionality is mostly unnecessary complexity and magic.

There's still plenty of reason to use the plain "C" semantics, if
desired, but those semantics are already built-in.

=== Benefits ===

* platform-independent ctype semantics based on Unicode, not tied to
any dependency's implementation
* ability to combine fast memcmp() collation with rich ctype
semantics
* user-visible semantics can be documented and tested
* stability within a PG major version
* transparency of changes: tables would be checked in to .h files,
so whoever runs the "update-unicode" build target would see if
there are unexpected or impactful changes that should be addressed
in the release notes
* the built-in tables themselves can be tested exhaustively by
comparing with ICU so we can detect trivial parsing errors and the
like

=== Character Classification ===

Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties, specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.

Character classification in Unicode does not vary from locale to
locale. The same character is considered to be a member of the same
classes regardless of locale (in other words, there's no
"tailoring"). There is no strong compatibility guarantee around the
classification of characters, but it doesn't seem to change much in
practice (I could collect more data here if it matters).

In glibc, character classification is not affected by the locale as
far as I can tell -- all non-"C" locales behave like "C.UTF-8"
(perhaps other libc implementations or versions or custom locales
behave differently -- corrections welcome). There are some differences
between "C.UTF-8" and what Unicode seems to recommend, and I'm not
entirely sure why those differences exist or whether those differences
are important for anything other than compatibility.

Note: ICU offers character classification based on Unicode standards,
too, but the fact that it's an external dependency makes it a
difficult-to-test black box that is not tied to a PG major
version. Also, we currently don't use the APIs that Unicode
recommends; so in Postgres today, ICU-based character classification
is further from Unicode than glibc character classification.

=== LOWER()/INITCAP()/UPPER() ===

The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.

In Unicode, lowercasing and uppercasing behavior is a mapping[2]https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992, and
also backed by a strong compatibility guarantee that "case pairs" will
always remain case pairs[4]https://www.unicode.org/policies/stability_policy.html#Case_Pair. The mapping may be "simple"
(context-insensitive, locale-insensitive, not adding any code points),
or "full" (may be context-sensitive, locale-sensitive, and one code
point may turn into 1-3 code points).

Titlecasing (INITCAP() in Postgres) in Unicode is similar to
upper/lowercasing, except that it has the additional complexity of
finding word boundaries, which have a non-trivial definition. To
simplify, we'd either use the Postgres definition (alphanumeric) or
the "word" character class specified in [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties. If someone wants more
sophisticated word segmentation they could use ICU.

While "full" case mapping sounds more complex, there are actually very
few cases to consider and they are covered in another (small) data
file. That data file covers ~100 code points that convert to multiple
code points when the case changes (e.g. "ß" -> "SS"), 7 code points
that have context-sensitive mappings, and three locales which have
special conversions ("lt", "tr", and "az") for a few code points.

ICU can do the simple case mapping (u_tolower(), etc.) or full mapping
(u_strToLower(), etc.). I see one difference in ICU that I can't yet
explain for the full titlecase mapping of a singular \+000345.

glibc in UTF8 (at least in my tests) just does the simple upper/lower
case mapping, extended with simple mappings for the locales with
special conversions (which I think are exactly the same 3 locales
mentioned above). libc doesn't do titlecase. If the resuling character
isn't representable in the server encoding, I think libc just maps the
character to itself, though I should test this assumption.

=== Encodings ===

It's easiest to implement these rules in UTF8, but possible for any
encoding where we can decode to a Unicode code point.

=== Patches ===

0001 & 0002 are just cleanup. I intend to commit them unless someone
has a comment.

0003 implements character classification ("Standard" and "POSIX
Compatible" variants) but doesn't actually use them for anything.

0004 implements "simple" case mapping, and a partial implementation of
"full" case mapping. Again, does not use them yet.

=== Questions ===

* Is a built-in ctype provider a reasonable direction for Postgres as
a project?
* Does it feel like it would be simpler or more complex than what
we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
options (e.g. "standard" or "posix" character classification;
"simple" or "full" case mapping)?

Regards,
Jeff Davis

[1]: http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992
[3]: /messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
/messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
[4]: https://www.unicode.org/policies/stability_policy.html#Case_Pair

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Jeremy Schneider

schneider@ardentperf.com

over 2 years ago

In reply to: Jeff Davis (#1)

Re: Built-in CTYPE provider

On 12/5/23 3:46 PM, Jeff Davis wrote:

=== Character Classification ===

Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1], specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.

=== LOWER()/INITCAP()/UPPER() ===

The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.

=== Questions ===

* Is a built-in ctype provider a reasonable direction for Postgres as
a project?
* Does it feel like it would be simpler or more complex than what
we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
options (e.g. "standard" or "posix" character classification;
"simple" or "full" case mapping)?

Generally, I am in favor of this - I think we need to move in the
direction of having an in-database option around unicode for PG users,
given how easy it is for administrators to mis-manage dependencies.
Especially when OS admins can be different from DB admins, and when
nobody really understands risks of changing libs with in-place moves to
new operating systems - except for like 4 of us on the mailing lists.

My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before libs
are updated. When users start to notice that regex [[:digit:]] or
upper/lower functions aren't working correctly with characters in their
DB, they'll probably come asking for fixes. And we may end up with
something like the timezone database where we need to periodically add a
more current ruleset - albeit alongside as a new version in this case.

Here are direct links to charts of newly assigned characters from the
last few Unicode updates:

2022: https://www.unicode.org/charts/PDF/Unicode-15.0/
2021: https://www.unicode.org/charts/PDF/Unicode-14.0/
2020: https://www.unicode.org/charts/PDF/Unicode-13.0/
2019: https://www.unicode.org/charts/PDF/Unicode-12.0/

If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or Nag
Mundari or Kawi digits without that update to character type specs.

If I'm reading the Unicode 12 update correctly, then upper/lower
functions aren't going to work correctly on Latin Glottal A and I and U
characters without that update to character type specs.

Overall I see a lot fewer Unicode updates involving upper/lower than I
do with digits - especially since new scripts often involve their own
numbering characters which makes new digits more common.

But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching. It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.

So even with something as basic as character type, if we're going to do
it right, we still need to either version it or definitively decide that
we're not going to every support newly added Unicode characters like
Latin Glottals.

-Jeremy

Built-in CTYPE provider

Attachments: