Built-in CTYPE provider
CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.
Unicode offers relatively simple rules for CTYPE-like functionality
based on data files. There are a few exceptions and a few options,
which I'll address below.
(In contrast, collation varies a lot from locale to locale, and has a
lot more options and nuance than ctype.)
=== Proposal ===
Parse some Unicode data files into static lookup tables in .h files
(similar to what we already do for normalization) and provide
functions to perform the right lookups according to Unicode
recommentations[1]http://www.unicode.org/reports/tr18/#Compatibility_Properties[2]https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992. Then expose the functionality as either a
specially-named locale for the libc provider, or as part of the
built-in collation provider which I previously proposed[3]/messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com. (Provided
patches don't expose the functionality yet; I'm looking for feedback
first.)
Using libc or ICU for a CTYPE provider would still be supported, but
as I explain below, there's not nearly as much reason to do so as you
might expect. As far as I can tell, using an external provider for
CTYPE functionality is mostly unnecessary complexity and magic.
There's still plenty of reason to use the plain "C" semantics, if
desired, but those semantics are already built-in.
=== Benefits ===
* platform-independent ctype semantics based on Unicode, not tied to
any dependency's implementation
* ability to combine fast memcmp() collation with rich ctype
semantics
* user-visible semantics can be documented and tested
* stability within a PG major version
* transparency of changes: tables would be checked in to .h files,
so whoever runs the "update-unicode" build target would see if
there are unexpected or impactful changes that should be addressed
in the release notes
* the built-in tables themselves can be tested exhaustively by
comparing with ICU so we can detect trivial parsing errors and the
like
=== Character Classification ===
Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties, specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.
Character classification in Unicode does not vary from locale to
locale. The same character is considered to be a member of the same
classes regardless of locale (in other words, there's no
"tailoring"). There is no strong compatibility guarantee around the
classification of characters, but it doesn't seem to change much in
practice (I could collect more data here if it matters).
In glibc, character classification is not affected by the locale as
far as I can tell -- all non-"C" locales behave like "C.UTF-8"
(perhaps other libc implementations or versions or custom locales
behave differently -- corrections welcome). There are some differences
between "C.UTF-8" and what Unicode seems to recommend, and I'm not
entirely sure why those differences exist or whether those differences
are important for anything other than compatibility.
Note: ICU offers character classification based on Unicode standards,
too, but the fact that it's an external dependency makes it a
difficult-to-test black box that is not tied to a PG major
version. Also, we currently don't use the APIs that Unicode
recommends; so in Postgres today, ICU-based character classification
is further from Unicode than glibc character classification.
=== LOWER()/INITCAP()/UPPER() ===
The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.
In Unicode, lowercasing and uppercasing behavior is a mapping[2]https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992, and
also backed by a strong compatibility guarantee that "case pairs" will
always remain case pairs[4]https://www.unicode.org/policies/stability_policy.html#Case_Pair. The mapping may be "simple"
(context-insensitive, locale-insensitive, not adding any code points),
or "full" (may be context-sensitive, locale-sensitive, and one code
point may turn into 1-3 code points).
Titlecasing (INITCAP() in Postgres) in Unicode is similar to
upper/lowercasing, except that it has the additional complexity of
finding word boundaries, which have a non-trivial definition. To
simplify, we'd either use the Postgres definition (alphanumeric) or
the "word" character class specified in [1]http://www.unicode.org/reports/tr18/#Compatibility_Properties. If someone wants more
sophisticated word segmentation they could use ICU.
While "full" case mapping sounds more complex, there are actually very
few cases to consider and they are covered in another (small) data
file. That data file covers ~100 code points that convert to multiple
code points when the case changes (e.g. "ß" -> "SS"), 7 code points
that have context-sensitive mappings, and three locales which have
special conversions ("lt", "tr", and "az") for a few code points.
ICU can do the simple case mapping (u_tolower(), etc.) or full mapping
(u_strToLower(), etc.). I see one difference in ICU that I can't yet
explain for the full titlecase mapping of a singular \+000345.
glibc in UTF8 (at least in my tests) just does the simple upper/lower
case mapping, extended with simple mappings for the locales with
special conversions (which I think are exactly the same 3 locales
mentioned above). libc doesn't do titlecase. If the resuling character
isn't representable in the server encoding, I think libc just maps the
character to itself, though I should test this assumption.
=== Encodings ===
It's easiest to implement these rules in UTF8, but possible for any
encoding where we can decode to a Unicode code point.
=== Patches ===
0001 & 0002 are just cleanup. I intend to commit them unless someone
has a comment.
0003 implements character classification ("Standard" and "POSIX
Compatible" variants) but doesn't actually use them for anything.
0004 implements "simple" case mapping, and a partial implementation of
"full" case mapping. Again, does not use them yet.
=== Questions ===
* Is a built-in ctype provider a reasonable direction for Postgres as
a project?
* Does it feel like it would be simpler or more complex than what
we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
options (e.g. "standard" or "posix" character classification;
"simple" or "full" case mapping)?
Regards,
Jeff Davis
[1]: http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992
[3]: /messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
/messages/by-id/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
[4]: https://www.unicode.org/policies/stability_policy.html#Case_Pair
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
v2-0004-Add-unicode-case-mapping-tables-and-functions.patchtext/x-patch; charset=UTF-8; name=v2-0004-Add-unicode-case-mapping-tables-and-functions.patchDownload+3711-5
v2-0003-Add-Unicode-property-tables.patchtext/x-patch; charset=UTF-8; name=v2-0003-Add-Unicode-property-tables.patchDownload+3273-64
v2-0002-Shrink-unicode-category-table.patchtext/x-patch; charset=UTF-8; name=v2-0002-Shrink-unicode-category-table.patchDownload+15-724
v2-0001-Minor-cleanup-for-unicode-update-build-and-test.patchtext/x-patch; charset=UTF-8; name=v2-0001-Minor-cleanup-for-unicode-update-build-and-test.patchDownload+34-35
On 12/5/23 3:46 PM, Jeff Davis wrote:
=== Character Classification ===
Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1], specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.=== LOWER()/INITCAP()/UPPER() ===
The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.=== Questions ===
* Is a built-in ctype provider a reasonable direction for Postgres as
a project?
* Does it feel like it would be simpler or more complex than what
we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
options (e.g. "standard" or "posix" character classification;
"simple" or "full" case mapping)?
Generally, I am in favor of this - I think we need to move in the
direction of having an in-database option around unicode for PG users,
given how easy it is for administrators to mis-manage dependencies.
Especially when OS admins can be different from DB admins, and when
nobody really understands risks of changing libs with in-place moves to
new operating systems - except for like 4 of us on the mailing lists.
My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before libs
are updated. When users start to notice that regex [[:digit:]] or
upper/lower functions aren't working correctly with characters in their
DB, they'll probably come asking for fixes. And we may end up with
something like the timezone database where we need to periodically add a
more current ruleset - albeit alongside as a new version in this case.
Here are direct links to charts of newly assigned characters from the
last few Unicode updates:
2022: https://www.unicode.org/charts/PDF/Unicode-15.0/
2021: https://www.unicode.org/charts/PDF/Unicode-14.0/
2020: https://www.unicode.org/charts/PDF/Unicode-13.0/
2019: https://www.unicode.org/charts/PDF/Unicode-12.0/
If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or Nag
Mundari or Kawi digits without that update to character type specs.
If I'm reading the Unicode 12 update correctly, then upper/lower
functions aren't going to work correctly on Latin Glottal A and I and U
characters without that update to character type specs.
Overall I see a lot fewer Unicode updates involving upper/lower than I
do with digits - especially since new scripts often involve their own
numbering characters which makes new digits more common.
But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching. It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.
So even with something as basic as character type, if we're going to do
it right, we still need to either version it or definitively decide that
we're not going to every support newly added Unicode characters like
Latin Glottals.
-Jeremy
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before
libs
are updated.
Is the concern only about unassigned code points?
I already committed a function "unicode_assigned()" to test whether a
string contains only assigned code points, which can be used in a
CHECK() constraint. I also posted[5]/messages/by-id/c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com an idea about a per-database
option that could reject the storage of any unassigned code point,
which would make it easier for users highly concerned about
compatibility.
And we may end up with
something like the timezone database where we need to periodically
add a
more current ruleset - albeit alongside as a new version in this
case.
There's a build target "update-unicode" which is run to pull in new
Unicode data files and parse them into static C arrays (we already do
this for the Unicode normalization tables). So I agree that the tables
should be updated but I don't understand why that's a problem.
If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or
Nag
Mundari or Kawi digits without that update to character type specs.
Yeah, if we are behind in the Unicode version, then results won't be
the most up-to-date. But ICU or libc could also be behind in the
Unicode version.
But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching.
UPPER()/LOWER() are based on case mapping, not character
classification.
I intend to introduce a SQL-level CASEFOLD() function that would obey
Unicode casefolding rules, which have very strong compatibility
guarantees[6]https://www.unicode.org/policies/stability_policy.html#Case_Folding (essentially, if you are only using assigned code points,
you are fine).
It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.
I don't agree with this characterization at all.
(a) It's not "another case". Corruption of an index on LOWER() can
happen today. My proposal makes the situation better, not worse.
(b) These aren't libraries, I am proposing built-in Unicode tables
that only get updated in a new major PG version.
(c) It likely only affects a small number of indexes and it's easier
for an administrator to guess which ones might be affected, making it
easier to just rebuild those indexes.
(d) It's not a problem if you stick to assigned code points.
So even with something as basic as character type, if we're going to
do
it right, we still need to either version it or definitively decide
that
we're not going to every support newly added Unicode characters like
Latin Glottals.
If, by "version it", you mean "update the data tables in new Postgres
versions", then I agree. If you mean that one PG version would need to
support many versions of Unicode, I don't agree.
Regards,
Jeff Davis
[5]: /messages/by-id/c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com
/messages/by-id/c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com
[6]: https://www.unicode.org/policies/stability_policy.html#Case_Folding
Jeff Davis wrote:
While "full" case mapping sounds more complex, there are actually
very few cases to consider and they are covered in another (small)
data file. That data file covers ~100 code points that convert to
multiple code points when the case changes (e.g. "ß" -> "SS"), 7
code points that have context-sensitive mappings, and three locales
which have special conversions ("lt", "tr", and "az") for a few code
points.
But there are CLDR mappings on top of that.
According to the Unicode FAQ
https://unicode.org/faq/casemap_charprop.html#5
Q: Does the default case mapping work for every language? What
about the default case folding?
[...]
To make case mapping language sensitive, the Unicode Standard
specificially allows implementations to tailor the mappings for
each language, but does not provide the necessary data. The file
SpecialCasing.txt is included in the Standard as a guide to a few
of the more important individual character mappings needed for
specific languages, notably the Greek script and the Turkic
languages. However, for most language-specific mappings and
tailoring, users should refer to CLDR and other resources.
In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.
The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
But there are CLDR mappings on top of that.
I see, thank you.
Would it still be called "full" case mapping to only use the mappings
in SpecialCasing.txt? And would that be useful?
Regards,
Jeff Davis
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.
I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the
ctype semantics match C.UTF-8 for all code points. glibc is not doing
this additional tailoring for "el".
Therefore I believe the builtin CTYPE would be very useful for case
mapping (both "simple" and "full") even without this additional
tailoring.
You are correct that ICU will still have some features that won't be
supported by the builtin provider. Better word boundary semantics in
INITCAP() are another advantage.
Regards,
Jeff Davis
On 12/13/23 5:28 AM, Jeff Davis wrote:
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before
libs
are updated.Is the concern only about unassigned code points?
I already committed a function "unicode_assigned()" to test whether a
string contains only assigned code points, which can be used in a
CHECK() constraint. I also posted[5] an idea about a per-database
option that could reject the storage of any unassigned code point,
which would make it easier for users highly concerned about
compatibility.
I didn't know about this. Did a few smoke tests against today's head on
git and it's nice to see the function working as expected. :)
test=# select unicode_version();
unicode_version
-----------------
15.1
test=# select chr(3212),unicode_assigned(chr(3212));
chr | unicode_assigned
-----+------------------
ಌ | t
-- unassigned code point inside assigned block
test=# select chr(3213),unicode_assigned(chr(3213));
chr | unicode_assigned
-----+------------------
| f
test=# select chr(3214),unicode_assigned(chr(3214));
chr | unicode_assigned
-----+------------------
ಎ | t
-- unassigned block
test=# select chr(67024),unicode_assigned(chr(67024));
chr | unicode_assigned
-----+------------------
| f
test=# select chr(67072),unicode_assigned(chr(67072));
chr | unicode_assigned
-----+------------------
𐘀 | t
Looking closer, patches 3 and 4 look like an incremental extension of
this earlier idea; the perl scripts download data from unicode.org and
we've specifically defined Unicode version 15.1 and the scripts turn the
data tables inside-out into C data structures optimized for lookup. That
C code is then checked in to the PostgreSQL source code files
unicode_category.h and unicode_case_table.h - right?
Am I reading correctly that these two patches add C functions
pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
don't yet reference these functions anywhere? So this is just getting
some plumbing in place?
And we may end up with
something like the timezone database where we need to periodically
add a
more current ruleset - albeit alongside as a new version in this
case.There's a build target "update-unicode" which is run to pull in new
Unicode data files and parse them into static C arrays (we already do
this for the Unicode normalization tables). So I agree that the tables
should be updated but I don't understand why that's a problem.
I don't want to get stuck on this. I agree with the general approach of
beginning to add a provider for locale functions inside the database. We
have awhile before Unicode 16 comes out. Plenty of time for bikeshedding
My prediction is that updating this built-in provider eventually won't
be any different from ICU or glibc. It depends a bit on how we
specifically built on this plumbing - but when Unicode 16 comes out, i
I'll try to come up with a simple repro on a default DB config where
changing the Unicode version causes corruption (it was pretty easy to
demonstrate for ICU collation, if you knew where to look)... but I don't
think that discussion should derail this commit, because for now we're
just starting the process of getting Unicode 15.1 into the PostgreSQL
code base. We can cross the "update" bridge when we come to it.
Later on down the road, from a user perspective, I think we should be
careful about confusion where providers are used inconsistently. It's
not great if one function follow built-in Unicode 15.1 rules but another
function uses Unicode 13 rules because it happened to call an ICU
function or a glibc function. We could easily end up with multiple
providers processing different parts of a single SQL statement, which
could lead to strange results in some cases.
Ideally a user just specifies a default provider their database, and the
rules for that version of Unicode are used as consistently as possible -
unless a user explicitly overrides their choice in a table/column
definition, query, etc. But it might take a little time and work to get
to this point.
-Jeremy
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
Looking closer, patches 3 and 4 look like an incremental extension of
this earlier idea;
Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial extension of the idea.
the perl scripts download data from unicode.org and
we've specifically defined Unicode version 15.1 and the scripts turn
the
data tables inside-out into C data structures optimized for lookup.
That
C code is then checked in to the PostgreSQL source code files
unicode_category.h and unicode_case_table.h - right?
Yes. The standard build process shouldn't be downloading files, so the
static tables are checked in. Also, seeing the diffs of the static
tables improves the visibility of changes in case there's some mistake
or big surprise.
Am I reading correctly that these two patches add C functions
pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
don't yet reference these functions anywhere? So this is just getting
some plumbing in place?
Correct. Perhaps I should combine these into the builtin provider
thread, but these are independently testable and reviewable.
My prediction is that updating this built-in provider eventually
won't
be any different from ICU or glibc.
The built-in provider will have several advantages because it's tied to
a PG major version:
* A physical replica can't have different semantics than the primary.
* Easier to document and test.
* Changes are more transparent and can be documented in the release
notes, so that administrators can understand the risks and blast radius
at pg_upgrade time.
Later on down the road, from a user perspective, I think we should be
careful about confusion where providers are used inconsistently. It's
not great if one function follow built-in Unicode 15.1 rules but
another
function uses Unicode 13 rules because it happened to call an ICU
function or a glibc function. We could easily end up with multiple
providers processing different parts of a single SQL statement, which
could lead to strange results in some cases.
The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.
The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.
Regards,
Jeff Davis
On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis <pgsql@j-davis.com> wrote:
The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.
FWIW, the idea that we're going to develop a built-in provider seems
to be solid, for the reasons Jeff mentions: it can be stable, and
under our control. But it seems like we might need built-in providers
for everything rather than just CTYPE to get those advantages, and I
fear we'll get sucked into needing a lot of tailoring rather than just
being able to get by with one "vanilla" implementation.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote:
FWIW, the idea that we're going to develop a built-in provider seems
to be solid, for the reasons Jeff mentions: it can be stable, and
under our control. But it seems like we might need built-in providers
for everything rather than just CTYPE to get those advantages, and I
fear we'll get sucked into needing a lot of tailoring rather than
just
being able to get by with one "vanilla" implementation.
For the database default collation, I suspect a lot of users would jump
at the chance to have "vanilla" semantics. Tailoring is more important
for individual collation objects than for the database-level collation.
There are reasons you might select a tailored database collation, like
if the set of users accessing it are mostly from a single locale, or if
the application connected to the database is expecting it in a certain
form.
But there are a lot of users for whom neither of those things are true,
and it makes zero sense to order all of the text indexes in the
database according to any one particular locale. I think these users
would prioritize stability and performance for the database collation,
and then use COLLATE clauses with ICU collations where necessary.
The question for me is how good the "vanilla" semantics need to be to
be useful as a database-level collation. Most of the performance and
stability problems come from collation, so it makes sense to me to
provide a fast and stable memcmp collation paired with richer ctype
semantics (as proposed here). Users who want something more probably
want the Unicode "root" collation, which can be provided by ICU today.
I am also still concerned that we have the wrong defaults. Almost
nobody thinks libc is a great provider, but that's the default, and
there were problems trying to change that default to ICU in 16. If we
had a builtin provider, that might be a better basis for a default
(safe, fast, always available, and documentable). Then, at least if
someone picks a different locale at initdb time, they would be doing so
intentionally, rather than implicitly accepting index corruption risks
based on an environment variable.
Regards,
Jeff Davis
Jeff Davis wrote:
But there are a lot of users for whom neither of those things are true,
and it makes zero sense to order all of the text indexes in the
database according to any one particular locale. I think these users
would prioritize stability and performance for the database collation,
and then use COLLATE clauses with ICU collations where necessary.
+1
I am also still concerned that we have the wrong defaults. Almost
nobody thinks libc is a great provider, but that's the default, and
there were problems trying to change that default to ICU in 16. If we
had a builtin provider, that might be a better basis for a default
(safe, fast, always available, and documentable). Then, at least if
someone picks a different locale at initdb time, they would be doing so
intentionally, rather than implicitly accepting index corruption risks
based on an environment variable.
Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.
But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.
ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.
If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.
Thank you, that summarizes exactly the compromise that I'm trying to
reach.
Regards,
Jeff Davis
On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.Thank you, that summarizes exactly the compromise that I'm trying to
reach.
This makes sense to me, too, but it feels like it might work out
better for speakers of English than for speakers of other languages.
Right now, I tend to get databases that default to en_US.utf8, and if
the default changed to C.utf8, then the case-comparison behavior might
be different but the letters would still sort in the right order. For
someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a
change to C.utf8 would be a much bigger problem, I would think. Their
alphabet isn't in code point order, and so things would be
alphabetized wrongly. That might be OK if they don't care about
ordering for any purpose other than equality lookups, but otherwise
it's going to force them to change the default, where today they don't
have to do that.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote:
This makes sense to me, too, but it feels like it might work out
better for speakers of English than for speakers of other languages.
There's very little in the way of locale-specific tailoring for ctype
behaviors in ICU or glibc -- only for the 'az', 'el', 'lt', and 'tr'
locales. While English speakers like us may benefit from being aligned
with the default ctype behaviors, those behaviors are not at all
specific to 'en' locales in ICU or glibc.
Collation varies a lot more between locales. I wouldn't call memcmp
ideal for English ('Zebra' comes before 'apple', which seems wrong to
me). If memcmp sorting does favor any particular group, I would say it
favors programmers more than English speakers. But that could just be
my perspective and I certainly understand the point that memcmp
ordering is more tolerable for some languages than others.
Right now, I tend to get databases that default to en_US.utf8, and if
the default changed to C.utf8, then the case-comparison behavior
might
be different
en_US.UTF-8 and C.UTF-8 have the same ctype behavior.
For
someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a
change to C.utf8 would be a much bigger problem, I would think.
Those locales all have the same ctype behavior.
It turns out that that en_US.UTF-8 and fr_FR.UTF-8 also have the same
collation order -- no tailoring beyond root collation according to CLDR
files for 'en' and 'fr' (though note that 'fr_CA' does have tailoring).
That doesn't mean the experience of switching to memcmp order is
exactly the same for a French speaker and an English speaker, but I
think it's interesting.
That might be OK if they don't care about
ordering for any purpose other than equality lookups, but otherwise
it's going to force them to change the default, where today they
don't
have to do that.
To be clear, I haven't proposed changing the initdb default. This
thread is about adding a builtin provider with builtin ctype, which I
believe a lot of users would like.
It also might be the best chance we have to get to a reasonable default
behavior at some point in the future. It would be always available,
fast, stable, better semantics than "C" for many locales, and we can
document it. In any case, we don't need to decide that now. If the
builtin provider is useful, we should do it.
Regards,
Jeff Davis
On 12/5/23 3:46 PM, Jeff Davis wrote:
CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.=== Character Classification ===
Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1], specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.=== LOWER()/INITCAP()/UPPER() ===
The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.
I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂
I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.
I haven't checked the source code for the regexp_* functions yet, but
are these just passing through to an external library? Are we actually
able to easily change the CTYPE provider for them? If nobody
knows/replies then I'll find some time to look.
One other thing that comes to mind: how does the parser do case folding
for relation names? Is that using OS-provided libc as of today? Or did
we code it to use ICU if that's the DB default? I'm guessing libc, and
global catalogs probably need to be handled in a consistent manner, even
across different encodings.
(Kindof related... did you ever see the demo where I create a user named
'🏃' and then I try to connect to a database with non-unicode encoding?
💥😜 ...at least it seems to be able to walk the index without decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)
-Jeremy
On 12/20/23 3:47 PM, Jeremy Schneider wrote:
On 12/5/23 3:46 PM, Jeff Davis wrote:
CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.
I missed citext, which extends impact to replace(), split_part(),
strpos() and translate(). There are also the five *_REGEX() functions
from the SQL standard which I assume are just calling the PG functions.
I just saw the krb_caseins_users GUC, which reminds me that PLs also
have their own case functions. And of course extensions. I'm not saying
any of this is in scope for the change here, but I'm just trying to wrap
my brain around all the places we've got CTYPE processing happening, to
better understand the big picture. It might help tease out unexpected
small glitches from changing one thing but not another one.
-Jeremy
On 12/20/23 4:04 PM, Jeremy Schneider wrote:
On 12/20/23 3:47 PM, Jeremy Schneider wrote:
On 12/5/23 3:46 PM, Jeff Davis wrote:
CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.I missed citext, which extends impact to replace(), split_part(),
strpos() and translate(). There are also the five *_REGEX() functions
from the SQL standard which I assume are just calling the PG functions.
found some more. here's my running list of everything user-facing I see
in core PG code so far that might involve case:
* upper/lower/initcap
* regexp_*() and *_REGEXP()
* ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
* citext + replace(), split_part(), strpos() and translate()
* full text search - everything is case folded
* unaccent? not clear to me whether CTYPE includes accent folding
* ltree
* pg_trgm
* core PG parser, case folding of relation names
On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis <pgsql@j-davis.com> wrote:
Those locales all have the same ctype behavior.
Sigh. I keep getting confused about how that works...
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote:
found some more. here's my running list of everything user-facing I
see
in core PG code so far that might involve case:* upper/lower/initcap
* regexp_*() and *_REGEXP()
* ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
* citext + replace(), split_part(), strpos() and translate()
* full text search - everything is case folded
* unaccent? not clear to me whether CTYPE includes accent folding
No, ctype has nothing to do with accents as far as I can tell. I don't
know if I'm using the right terminology, but I think "case" is a
variant of a character whereas "accent" is a modifier/mark, and the
mark is a separate concept from the character itself.
* ltree
* pg_trgm
* core PG parser, case folding of relation names
Let's separate it into groups.
(1) Callers that use a collation OID or pg_locale_t:
* collation & hashing
* upper/lower/initcap
* regex, LIKE, formatting
* pg_trgm (which uses regexes)
* maybe postgres_fdw, but might just be a passthrough
* catalog cache (always uses DEFAULT_COLLATION_OID)
* citext (always uses DEFAULT_COLLATION_OID, but probably shouldn't)
(2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are
set to, or use ad-hoc ASCII-only semantics:
* core SQL parser downcase_identifier()
* callers of pg_strcasecmp() (DDL, etc.)
* GUC name case folding
* full text search ("mylocale = 0 /* TODO */")
* a ton of stuff uses isspace(), isdigit(), etc.
* various callers of tolower()/toupper()
* some selfuncs.c stuff
* ...
Might have missed some places.
The user impact of a new builtin provider would affect (1), but only
for those actually using the provider. So there's no compatibility risk
there, but it's good to understand what it will affect.
We can, on a case-by-case basis, also consider using the new APIs I'm
proposing for instances of (2). There would be some compatibility risk
there for existing callers, and we'd have to consider whether it's
worth it or not. Ideally, new callers would either use the new APIs or
use the pg_ascii_* APIs.
Regards,
Jeff Davis
On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote:
One other thing that comes to mind: how does the parser do case
folding
for relation names? Is that using OS-provided libc as of today? Or
did
we code it to use ICU if that's the DB default? I'm guessing libc,
and
global catalogs probably need to be handled in a consistent manner,
even
across different encodings.
The code is in downcase_identifier():
/*
* SQL99 specifies Unicode-aware case normalization, which we don't
* yet have the infrastructure for...
*/
if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
ch = tolower(ch);
result[i] = (char) ch;
My proposal would add the infrastructure that the comment above says is
missing.
It seems like we should be using the database collation at this point
because you don't want inconsistency between the catalogs and the
parser here. Then again, the SQL spec doesn't seem to support tailoring
of case conversions, so maybe we are avoiding it for that reason? Or
maybe we're avoiding catalog access? Or perhaps the work for ICU just
wasn't done here yet?
(Kindof related... did you ever see the demo where I create a user
named
'🏃' and then I try to connect to a database with non-unicode
encoding?
💥😜 ...at least it seems to be able to walk the index without
decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)
I didn't see that specific demo, but in general we seem to change
between pg_wchar and unicode code points too freely, so I'm not
surprised that something went wrong.
Regards,
Jeff Davis