Collation version tracking for macOS
During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.
Attachments:
0001-Collation-version-tracking-for-macOS.patchtext/plain; charset=UTF-8; name=0001-Collation-version-tracking-for-macOS.patchDownload+26-1
On Mon, Feb 14, 2022 at 10:00 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.
Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version. I remember that other old
Unixes used to be like that, and I suspect that they might be using
old pre-UTF-8 FreeBSD code for locales based on a quick peek at [1]https://github.com/apple-open-source-mirror/Libc/tree/master/locale
(though FreeBSD itself has since learned to do CLDR-based UTF-8
sorting with a completely new implementation shared with other OSes).
This makes me wonder if Apple is hiding another collation
implementation somewhere up its sleeve -- surely that libc support is
not good enough for the world's shiny globalised macOS/iOS apps?
Maybe UCCompareText() and friends (UnicodeUtilitiesCoreLib) and the
various Obj-C NSString comparison stuff, all of which probably
predates Unixoid macOS (google tells me that UnicodeUtilities.h was
present in macOS 9). It wouldn't be surprising if it shares nothing
with the modern OS's C runtime stuff that came via NeXT. Just
mentioning this as a curiosity, because I was trying to figure out how
that could be left non-working without anyone complaining...
[1]: https://github.com/apple-open-source-mirror/Libc/tree/master/locale
On 07.05.22 02:31, Thomas Munro wrote:
During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.
Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:
So I think we should reconsider this patch, even for PG15.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
On 07.05.22 02:31, Thomas Munro wrote:
Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.
Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:
So I think we should reconsider this patch, even for PG15.
According to that document, they changed it in macOS 11, which came out
a year and a half ago. Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update).
regards, tom lane
On 6/3/22 9:21 AM, Tom Lane wrote:
According to that document, they changed it in macOS 11, which came out
a year and a half ago. Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update).
Sorry, I'm going to rant for a minute... it is my very strong opinion
that using language like "false positive" here is misguided and dangerous.
If new version of sort order is released, for example when they recently
updated backwards-secondary sorting in french [CLDR-2905] or matching of
v and w in swedish and finnish [CLDR-7088], it is very dangerous to use
language like “false positive” to describe a database where there just
didn't happen to be any rows with accented french characters at the
point in time where PostgreSQL magically changed which version of sort
order it was using from the 2010 french version to the 2020 french version.
No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.
I have personally seen PostgreSQL databases where an update to an old
version of glibc was applied (I'm not even talking 2.28 here) and it
resulted in data loss b/c crash recovery couldn't replay WAL records and
the user had to do a PITR. That's aside from the more common issues of
segfaults or duplicate records that violate unique constraints or wrong
query results like missing data. And it's not just updates - people can
set up a hot standby on a different version and see many of these
problems too.
Collation versioning absolutely must be first class and directly
controlled by users, and it's very dangerous to allow users - at all -
to take an index and then use a different version than what the index
was built with.
Not to mention all the other places in the DB where collation is used...
partitioning, constraints, and any other place where persisted data can
make an assumption about any sort of string comparison.
It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.
-Jeremy Schneider
On Sat, Jun 4, 2022 at 12:17 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
On 07.05.22 02:31, Thomas Munro wrote:
Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:
How can I see evidence of this? I'm comparing Debian, FreeBSD and
macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
/usr/share/dict/words" I get upper and lower case mixed together on
the other OSes, but on the Mac the upper case comes first, which is my
usual smoke test for "am I looking at binary sort order?"
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:
No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.
I guess you know this but for the record, there have been discussions
before about supporting multiple versions of ICU concurrently, like
DB2. For example, one idea was that each ICU version could be a
separate "provider" in PostgreSQL, so you can concurrently use
multiple versions.
One of several places this came up:
/messages/by-id/CADE5jYJTnYaTNXMFKOK-0p44+Dm5LMcRcJ5kVi1MVHomb2QTkQ@mail.gmail.com
It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.
FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.
Thomas Munro <thomas.munro@gmail.com> writes:
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.
FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.
I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.
Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.
regards, tom lane
On 6/3/22 3:58 PM, Tom Lane wrote
Thomas Munro <thomas.munro@gmail.com> writes:
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.
The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override). A more sophisticated
option would be to provide the machinery for supporting multiple
collation libraries. Both of those at least ensure that users are aware
any time there's a problem, which IMO is *enormously* better than
letting core functionality silently stop working.
Jim Nasby <nasbyj@amazon.com> writes:
I think the real problem here is that the underlying software mostly
doesn't take this issue seriously.
The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).
You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has). Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.
regards, tom lane
On Jun 6, 2022, at 17:10, Jim Nasby <nasbyj@amazon.com> wrote:
Ignoring broken backups, segfaults and data corruption as a "rant" implies that we simply throw in the towel and tell users to suck it up or switch engines.
Well now, let’s be clear, I was the one who called my email a “rant”. 🙂
And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong that rants don’t usually help move things forward.
Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material; I haven’t read the whole thread yet. If you have some others that would also be particularly good background, let me know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I was pulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years, so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in the conversation here.
Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn on version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is pretty damn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things to happen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades of large & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I have a lot of reading to do, to really understand how that happened and where the dialogue is today.
Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but I still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?)
Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for the non-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>).
-Jeremy
Sent from my TI-83
On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby <nasbyj@amazon.com> wrote:
On 6/3/22 3:58 PM, Tom Lane wrote
Thomas Munro <thomas.munro@gmail.com> writes:
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.The first step to a solution is admitting that the problem exists.
We've been discussing this topic for years and I don't think anyone
thinks the case is closed...
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).
Yeah, it's been discussed, but never proposed. The problem is that
you need to start up to fix the problem. Another option is not to use
affected indexes, but that doesn't help with other forms of the
problem (partition constraints, etc).
A more sophisticated
option would be to provide the machinery for supporting multiple
collation libraries.
Earlier I mentioned distinct "providers" but I take that back, that's
too complicated. Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions? Or some
cleverer, shorter notation. Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not. For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with. You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK. Maintaining our own fork(s) of ICU would seem like massive
overkill and I don't think anyone has suggested that; the question on
my mind is whether we could rely on existing packages. Then you'd be
exposed only to changes that happen within (say) the ICU 63 package's
lifetime... I recall looking into whether that can happen but ... I
don't recall the answer.
On Mon, Jun 6, 2022 at 8:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <nasbyj@amazon.com> writes:
I think the real problem here is that the underlying software mostly
doesn't take this issue seriously.The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has). Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.
Is this more involved than creating a list of all valid Unicode characters
(~144 thousand), sorting them, then running crc32 over the sorted order to
create the "version" for the library/collation pair? Far from free but few
databases use more than a couple different collations.
--
Rod Taylor
On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:
Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.
Collation rules have multiple levels and all kinds of quirks, so that
won't work.
On Fri, Jun 3, 2022 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.
You don't, but that opinion isn't universally held, or at least not
with the same vigor that you hold it. See e.g.
/messages/by-id/a4019c5e570d4dbb5e3f816c080fb57c76ab604a.camel@cybertec.at
and subsequent discussion, for example.
In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:
"Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update)."
But you wrote that to Peter, who was essentially complaining that we
hadn't done anything, and linked to another source, which was also
complaining about the problem, and then Jeremy Schneider replied to
your email and complained some more.
Complaining about "false positives" doesn't really make sense to me.
It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually. The
alternative is having no idea at all that things might be broken,
which is not better.
--
Robert Haas
EDB: http://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:
Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.
Collation rules have multiple levels and all kinds of quirks, so that
won't work.
Yeah, and it's exactly at the level of quirks that things are likely
to change. Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.
The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:
No, I quite agree that we have a problem. What I don't agree is that
issuing a lot of false-positive warnings is a solution. That will
just condition people to ignore the warnings, and then when their
platform really does change behavior, they're still screwed. If we
could *accurately* report collation behavioral changes, I'd be all
for that.
Rod's idea upthread is certainly way too simplistic, but could we
build a set of test cases that do detect known changes in collation
behaviors? We'd be shooting at a moving target; but even if we're
late in noticing that platform X changed the behavior of collation Y,
we could help users who run in the problem afterwards.
regards, tom lane
On Tue, Jun 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually.
Collations by their very nature are unlikely to change all that much.
Obviously they can and do change, but the details are presumably
pretty insignificant to a native speaker. Stands to reason that the
issue (which is fundamentally a problem for natural language experts)
would have been resolved far sooner if there really was a significant
controversy about something that tends to come up often.
It's pretty clear that glibc as a project doesn't take the issue very
seriously, because they see it as a problem of the GUI sorting a table
in a way that seems slightly suboptimal to scholars of a natural
language. Clearly that isn't actually a big deal. But the latent
possibility of wrong answers to queries is a very big deal. Both are
true. It's just a matter of priorities in each case.
I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives. I bet Postgres that there are many
near-misses that we never get to hear about already. That's rather
beside the point. The index must be assumed to be corrupt.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives.
This is not the concern that I have. I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem". What
I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never. We will soon lose those users' attention ---
see fable about boy crying wolf --- and then when Apple actually does
change something, we've got a problem. So if we give collation-change
warnings, they'd better have some measurable connection to reality.
regards, tom lane
On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Earlier I mentioned distinct "providers" but I take that back, that's
too complicated. Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions? Or some
cleverer, shorter notation. Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not. For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with. You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK.
Right. Postgres could link to multiple versions of ICU at the same
time. Right now it doesn't, and right now the ICU C symbol names that
we use are actually versioned (this isn't immediately apparent because
the C preprocessor makes it appear that ICU symbol names are generic).
We could perhaps invent a new indirection that knows about
multiple ICU versions, each of which is an independent collation
provider, or maybe a related collation provider that gets used by
default on REINDEX. ICU is designed for this kind of thing. That
approach more or less puts packagers on the hook for managing
collation stability. But now long term collation stability is at least
feasible -- we at least have a coherent strategy. In the worst case
the community .deb and .rpm repos might continue to support an older
ICU version, or lobby for its continued support by the distro (while
actively discouraging its use in new databases). This isn't the same
thing as forking ICU. It's a compromise between that extreme, and
the current situation.
--
Peter Geoghegan