[18] Policy on IMMUTABLE functions and Unicode updates

mail@joeconway.com

over 1 year ago

In reply to: Jeff Davis (#1)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 7/16/24 13:42, Jeff Davis wrote:

The IMMUTABLE marker for functions is quite simple on the surface, but
could be interpreted a few different ways, and there's some historical
baggage that makes it complicated.

There are a number of ways in which IMMUTABLE functions can change
behavior:

1. Updating or moving to a different OS affects all collations that use
the libc provider (other than "C" and "POSIX", which don't actually use
libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
affected.

2. Updating ICU affects the collations that use the ICU provider.
ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
are also affected.

3. Moving to a different database encoding may affect collations that
use the "C" or "POSIX" locales in the libc provider (NB: those locales
don't actually use libc).

4. A PG Unicode update may change the results of functions that depend
on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
(NB: collation itself is not affected -- always code point order).

5. If a well-defined IMMUTABLE function produces the wrong results, we
may fix the bug in the next major release.

6. The GUC extra_float_digits can change the results of floating point
text output.

7. A UDF may be improperly marked IMMUTABLE. A particularly common
variant is a UDF without search_path specified, which is probably not
truly IMMUTABLE.

(more I'm sure, please add to list...)

#1 and #2 have been discussed much more than the rest, but I think it's
worthwhile to enumerate the other problems even if the impact is a lot
lower.

Noah seemed particularly concerned[1] about #4, so I'll start off by
discussing that. Here's a brief history (slightly confusing because the
PG and Unicode versions are similar numbers):

PG13: Unicode 13.0 and NORMALIZE() is first exposed as a SQL function
PG15: Unicode updated to 14.0
PG16: Unicode updated to 15.0
PG17: Unicode updated to 15.1, UNICODE_ASSIGNED(), UNICODE_VERSION()
and builtin "C.UTF-8" locale are introduced

To repeat, these Unicode updates do not affect collation itself, they
affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
If using the builtin "C.UTF-8" locale, they also affect LOWER(),
INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
provider hasn't yet gone through any Unicode update.)

There are two alternative philosophies:

A. By choosing to use a Unicode-based function, the user has opted in
to the Unicode stability guarantees[2], and it's fine to update Unicode
occasionally in new major versions as long as we are transparent with
the user.

B. IMMUTABLE implies some very strict definition of stability, and we
should never again update Unicode because it changes the results of
IMMUTABLE functions.

We've been following (A), and that's the defacto policy today[3][4].
Noah and Laurenz argued[5] that the policy starting in version 18
should be (B). Given that it's a policy decision that affects more than
just the builtin collation provider, I'd like to discuss it more
broadly outside of that subthread.

On the general topic, we have these definitions in the fine manual:

8<-----------------
A VOLATILE function can do anything, ... A query using a volatile
function will re-evaluate the function at every row where its value is
needed.

A STABLE function cannot modify the database and is guaranteed to return
the same results given the same arguments for all rows within a single
statement...

An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever.
8<-----------------

As Jeff points out, the IMMUTABLE definition has never really been true.
Even the STABLE is not quite right, as there are at least some STABLE
functions that will return the same value for multiple statements if
they are within a transaction block (e.g. "now()" -- TBH I don't
remember offhand if that is true for all stable functions).

In any case, there is quite a gap between "forever" and "single
statement". Perhaps we need to have more volatility categories, with
guarantees that lie somewhere between the two, and allow those to be
used like we do IMMUTABLE except with appropriate warning labels. E.g.
something ("STABLE_VERSION"?) to mean "forever within a major version
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't
upgrade your OS".

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Joe Conway (#2)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com> wrote:

There are two alternative philosophies:

A. By choosing to use a Unicode-based function, the user has opted in
to the Unicode stability guarantees[2], and it's fine to update Unicode
occasionally in new major versions as long as we are transparent with
the user.

B. IMMUTABLE implies some very strict definition of stability, and we
should never again update Unicode because it changes the results of
IMMUTABLE functions.

We've been following (A), and that's the defacto policy today[3][4].
Noah and Laurenz argued[5] that the policy starting in version 18
should be (B). Given that it's a policy decision that affects more than
just the builtin collation provider, I'd like to discuss it more
broadly outside of that subthread.

On the general topic, we have these definitions in the fine manual:

8<-----------------
A VOLATILE function can do anything, ... A query using a volatile
function will re-evaluate the function at every row where its value is
needed.

A STABLE function cannot modify the database and is guaranteed to return
the same results given the same arguments for all rows within a single
statement...

An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever.
8<-----------------

As Jeff points out, the IMMUTABLE definition has never really been true.

Even the STABLE is not quite right, as there are at least some STABLE
functions that will return the same value for multiple statements if
they are within a transaction block (e.g. "now()" -- TBH I don't
remember offhand if that is true for all stable functions).

Under-specification here doesn't make the meaning of stable incorrect. We
don't have anything that guarantees stability at the transaction scope
because I don't think it can be guaranteed there without considering
whether said transaction is read-committed, repeatable read, or
serializable. The function itself can promise more but the marker seems
correctly scoped for how the system uses it in statement optimization.

and allow those to be

used like we do IMMUTABLE except with appropriate warning labels. E.g.
something ("STABLE_VERSION"?) to mean "forever within a major version
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you don't
upgrade your OS".

I'd be content cutting "forever" down to "within a given server
configuration". Then just note that immutable functions can depend
implicitly on external server characteristics and so when moving data
between servers re-evaluation of immutable functions may be necessary. Not
so bad for indexes. A bit more problematic for generated values.

I'm not against adding metadata options here but for internal functions
comments and documentation can work. For user-defined functions I have my
doubts on how trustworthy they would end up being.

For the original question, I suggest continuing behaving per "A" and work
on making it more clear to users what that means in terms of server
upgrades.

If we do add metadata to reflect our reality I'd settle on a generic
"STATIC" marker that can be used on those functions the rely on real world
state, whether we are directly calling into the system (e.g., hashing) or
have chosen to provide the state access management ourselves (e.g.,
unicode).

When we do take control we should have a goal of allowing for a given
external dependency version to exist in many PostgreSQL versions and give
the DBA the choice of when to move individual databases from one version to
the next. Possibly dropping the dependency version support alongside the
dropping of support of the major version it first appeared in. Not keeping
up with external dependency versions just punishes new users by forbidding
them a tool permanently, as well as puts us out-of-step with those
dependency development groups, to save existing users some short-term
pain. Being able to deal with that pain at a time different than the
middle of a major version upgrade, one database at a time, gives those
existing users reasonable options.

David J.

mail@joeconway.com

over 1 year ago

In reply to: David G. Johnston (#3)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 7/16/24 15:33, David G. Johnston wrote:

On Tue, Jul 16, 2024 at 11:57 AM Joe Conway <mail@joeconway.com
<mailto:mail@joeconway.com>> wrote:

There are two alternative philosophies:

A. By choosing to use a Unicode-based function, the user has opted in
to the Unicode stability guarantees[2], and it's fine to update

Unicode

occasionally in new major versions as long as we are transparent with
the user.

B. IMMUTABLE implies some very strict definition of stability, and we
should never again update Unicode because it changes the results of
IMMUTABLE functions.

We've been following (A), and that's the defacto policy today[3][4].
Noah and Laurenz argued[5] that the policy starting in version 18
should be (B). Given that it's a policy decision that affects

more than

just the builtin collation provider, I'd like to discuss it more
broadly outside of that subthread.

On the general topic, we have these definitions in the fine manual:

8<-----------------
A VOLATILE function can do anything, ... A query using a volatile
function will re-evaluate the function at every row where its value is
needed.

A STABLE function cannot modify the database and is guaranteed to
return
the same results given the same arguments for all rows within a single
statement...

An IMMUTABLE function cannot modify the database and is guaranteed to
return the same results given the same arguments forever.
8<-----------------

As Jeff points out, the IMMUTABLE definition has never really been
true.

Even the STABLE is not quite right, as there are at least some STABLE
functions that will return the same value for multiple statements if
they are within a transaction block (e.g. "now()" -- TBH I don't
remember offhand if that is true for all stable functions).

Under-specification here doesn't make the meaning of stable incorrect.
We don't have anything that guarantees stability at the transaction
scope because I don't think it can be guaranteed there without
considering whether said transaction is read-committed, repeatable read,
or serializable. The function itself can promise more but the marker
seems correctly scoped for how the system uses it in statement optimization.

The way it is described is still surprising and can bite you if you are
not familiar with the nuances. In particular I have seen now() used in
transaction blocks surprise more than one person over the years.

and allow those to be
used like we do IMMUTABLE except with appropriate warning labels. E.g.
something ("STABLE_VERSION"?) to mean "forever within a major version
lifetime" and something ("STABLE_SYSTEM?") to mean "as long as you
don't
upgrade your OS".

I'd be content cutting "forever" down to "within a given server
configuration". Then just note that immutable functions can depend
implicitly on external server characteristics and so when moving data
between servers re-evaluation of immutable functions may be necessary.
Not so bad for indexes. A bit more problematic for generated values.

Yeah I forgot about the configuration controlled ones.

I'm not against adding metadata options here but for internal functions
comments and documentation can work. For user-defined functions I have
my doubts on how trustworthy they would end up being.

People lie all the time for user-defined functions, usually specifically
when they need IMMUTABLE semantics and are willing to live with the risk
and/or apply their own controls to ensure no changes in output.

For the original question, I suggest continuing behaving per "A" and
work on making it more clear to users what that means in terms of server
upgrades.

If we do add metadata to reflect our reality I'd settle on a generic
"STATIC" marker that can be used on those functions the rely on real
world state, whether we are directly calling into the system (e.g.,
hashing) or have chosen to provide the state access management ourselves
(e.g., unicode).

So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
third position before IMMUTABLE), give it IMMUTABLE semantics, mark
builtin functions that deserve it, and document with suitable caution
statements?

I guess can live with just one additional level of granularity.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Joe Conway (#4)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Joe Conway <mail@joeconway.com> writes:

So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
third position before IMMUTABLE), give it IMMUTABLE semantics, mark
builtin functions that deserve it, and document with suitable caution
statements?

What is the point of that, exactly?

I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level. Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.

To put this another way: the existing volatility levels were
basically reverse-engineered from the ways that the planner could
meaningfully treat a function: it's dangerous, it is safe enough
to use in an index condition (which changes the number of times
the query will evaluate it), or it's safe to constant-fold in
advance of execution. Unless there's a fourth planner behavior that's
worth having, we don't need a fourth level. Possibly you could
argue that "safe to put in an index" is a different level from
"safe to constant-fold", but I don't really agree with that.

regards, tom lane

mail@joeconway.com

over 1 year ago

In reply to: Tom Lane (#5)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 7/16/24 16:16, Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
third position before IMMUTABLE), give it IMMUTABLE semantics, mark
builtin functions that deserve it, and document with suitable caution
statements?

What is the point of that, exactly?

I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level. Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.

To put this another way: the existing volatility levels were
basically reverse-engineered from the ways that the planner could
meaningfully treat a function: it's dangerous, it is safe enough
to use in an index condition (which changes the number of times
the query will evaluate it), or it's safe to constant-fold in
advance of execution. Unless there's a fourth planner behavior that's
worth having, we don't need a fourth level. Possibly you could
argue that "safe to put in an index" is a different level from
"safe to constant-fold", but I don't really agree with that.

Fair enough, but then I think we should change the documentation to not
say "forever".

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Joe Conway (#6)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Joe Conway <mail@joeconway.com> writes:

Fair enough, but then I think we should change the documentation to not
say "forever".

No objection to that, it's clearly a misleading definition.

regards, tom lane

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Tom Lane (#5)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 16, 2024 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joe Conway <mail@joeconway.com> writes:

So you are proposing we add STATIC to VOLATILE/STABLE/IMMUTABLE (in the
third position before IMMUTABLE), give it IMMUTABLE semantics, mark
builtin functions that deserve it, and document with suitable caution
statements?

What is the point of that, exactly?

I'll agree that the user documentation could use some improvement
in how it describes the volatility levels, but I do not see how
it will reduce anybody's confusion to invent multiple aliases for
what's effectively the same volatility level. Nor do I see a
use-case for actually having multiple versions of "immutable".
Once you've decided you can put something into an index, quibbling
over just how immutable it is doesn't really change anything.

I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use
dependencies and report on any of these it finds and remind the DBA that
this latent issue may exist in their system.

I agree the core behaviors of the system would remain unchanged and both
modes would be handled identically. Though requiring superuser or a
predefined role membership to actually use a "static" mode function in an
index or generated expression would be an interesting option to consider.

David J.

schneider@ardentperf.com

over 1 year ago

In reply to: David G. Johnston (#8)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 16, 2024 at 3:28 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

I'd teach pg_upgrade to inspect the post-upgraded catalog of in-use
dependencies and report on any of these it finds and remind the DBA that
this latent issue may exist in their system.

Would this help? Collation-related dependency changes are a different thing
from major version DB upgrades

Tom’s point about how the levels are directly tied to concrete differences
in behavior (planner/executor) makes a lot of sense to me

-Jeremy

#10

pgsql@j-davis.com

over 1 year ago

In reply to: David G. Johnston (#8)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-16 at 13:27 -0700, David G. Johnston wrote:

I'd teach pg_upgrade to inspect the post-upgraded catalog of is-use
dependencies and report on any of these it finds and remind the DBA
that this latent issue may exist in their system.

That's impossible to do in a complete way, and hard to do with much
accuracy. I don't oppose it though -- if someone finds a way to provide
enough information to be useful, then that's fine with me.

Regards,
Jeff Davis

#11

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Jeff Davis (#1)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote:

The IMMUTABLE marker for functions is quite simple on the surface, but
could be interpreted a few different ways, and there's some historical
baggage that makes it complicated.

There are a number of ways in which IMMUTABLE functions can change
behavior:

1. Updating or moving to a different OS affects all collations that use
the libc provider (other than "C" and "POSIX", which don't actually use
libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
affected.

2. Updating ICU affects the collations that use the ICU provider.
ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
are also affected.

3. Moving to a different database encoding may affect collations that
use the "C" or "POSIX" locales in the libc provider (NB: those locales
don't actually use libc).

4. A PG Unicode update may change the results of functions that depend
on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
(NB: collation itself is not affected -- always code point order).

5. If a well-defined IMMUTABLE function produces the wrong results, we
may fix the bug in the next major release.

6. The GUC extra_float_digits can change the results of floating point
text output.

7. A UDF may be improperly marked IMMUTABLE. A particularly common
variant is a UDF without search_path specified, which is probably not
truly IMMUTABLE.

Noah seemed particularly concerned[1] about #4, so I'll start off by
discussing that.

Unicode updates do not affect collation itself, they
affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
If using the builtin "C.UTF-8" locale, they also affect LOWER(),
INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
provider hasn't yet gone through any Unicode update.)

There are two alternative philosophies:

A. By choosing to use a Unicode-based function, the user has opted in
to the Unicode stability guarantees[2], and it's fine to update Unicode
occasionally in new major versions as long as we are transparent with
the user.

B. IMMUTABLE implies some very strict definition of stability, and we
should never again update Unicode because it changes the results of
IMMUTABLE functions.

We've been following (A), and that's the defacto policy today[3][4].
Noah and Laurenz argued[5] that the policy starting in version 18
should be (B). Given that it's a policy decision that affects more than
just the builtin collation provider, I'd like to discuss it more
broadly outside of that subthread.

[1]
/messages/by-id/20240629220857.fb.nmisch@google.com

[2]
https://www.unicode.org/policies/stability_policy.html

[3]
/messages/by-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel@j-davis.com

[4]
/messages/by-id/564325.1720297161@sss.pgh.pa.us

[5]
/messages/by-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel@cybertec.at

Concerning #4, the new built-in locale, my hope (and, in my opinion, its only
value) is to get out of the problems #1 and #2 that are not under our control.

If changes in major PostgreSQL versions force users of the built-in
locale provider to rebuild indexes, that would invalidate it. I think that
users care more about data corruption than about exact Unicode-compliant
behavior. Anybody who does can use ICU.

People routinely create indexes that involve upper() or lower(), so I'd
say changing their behavior would be a problem.

Perhaps I should moderate my statement: if a change affects only a newly
introduced code point (which is unlikely to be used in a database), and we
think that the change is very important, we could consider applying it.
But that should be carefully considered; I am against blindly following the
changes in Unicode.

Yours,
Laurenz Albe

#12

pgsql@j-davis.com

over 1 year ago

In reply to: Laurenz Albe (#11)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote:

Perhaps I should moderate my statement: if a change affects only a
newly
introduced code point (which is unlikely to be used in a database),
and we
think that the change is very important, we could consider applying
it.
But that should be carefully considered; I am against blindly
following the
changes in Unicode.

That sounds reasonable.

I propose that, going forward, we take more care with Unicode updates:
assess the impact, provide time for comments, and consider possible
mitigations. In other words, it would be reviewed like any other
change.

Ideally, some new developments would make it less worrisome, and
Unicode updates could become more routine. I have some ideas, which I
can propose in separate threads. But for now, I don't see a reason to
rush Unicode updates.

Regards,
Jeff Davis

#13

peter@eisentraut.org

over 1 year ago

In reply to: Jeff Davis (#12)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 19.07.24 21:41, Jeff Davis wrote:

On Fri, 2024-07-19 at 21:06 +0200, Laurenz Albe wrote:

Perhaps I should moderate my statement: if a change affects only a
newly
introduced code point (which is unlikely to be used in a database),
and we
think that the change is very important, we could consider applying
it.
But that should be carefully considered; I am against blindly
following the
changes in Unicode.

That sounds reasonable.

I propose that, going forward, we take more care with Unicode updates:
assess the impact, provide time for comments, and consider possible
mitigations. In other words, it would be reviewed like any other
change.

I disagree with that. We should put ourselves into the position to
adopt new Unicode versions without fear. Similar to updates to time
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year.
That would be madness. How would we weigh each change against the
others? Some new character is introduced because it's the new currency
of some country; seems important. Some mobile phone platforms jumped
the gun and already use the character for the same purpose before it was
assigned; now the character is in databases but some function results
will change with the upgrade. How do we proceed?

Moreover, if we were to decide to not take a particular Unicode update,
that would then stop that process forever, because whatever the issue
was wouldn't go away with the next Unicode version.

Unless I missed something here, all the problem examples involve
unassigned code points that were later assigned. (Assigned code points
already have compatibility mechanisms, such as collation versions.) So
I would focus on that issue. We already have a mechanism to disallow
unassigned code points. So there is a tradeoff that users can make:
Disallow unassigned code points and avoid upgrade issues resulting from
them. Maybe that just needs to be documented more prominently.

#14

robertmhaas@gmail.com

over 1 year ago

In reply to: Peter Eisentraut (#13)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote:

I disagree with that. We should put ourselves into the position to
adopt new Unicode versions without fear. Similar to updates to time
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year.
That would be madness.

Yeah, I agree with that 100%. I can't imagine that we want to, in
effect, develop our own version of Unicode that is not quite the same
as upstream.

We've got to figure out a way to fix this problem from the other end -
coping with updates when they happen. I feel like we've already
discussed the obvious approach at some length: have a way to mark
indexes invalid when "immutable" things change. That doesn't fix
everything because you could, for example, manufacture constraint
violations, even if there are no relevant indexes, so maybe index
invalidation wouldn't be the only thing we'd ever need to do, but it
would help a lot. In view of Jeff's list at the start of the thread,
maybe that mechanism needs to be more general than just
collation-related stuff: maybe there should be a general way to say
"oopsie, this index can't be relied upon until it's rebuit" and a user
could manually do that if they change the definition of an immutable
function. Or there could even be some flag to CREATE FUNCTION that
triggers it for all dependent indexes. I'm not really sure.

If I remember correctly, Thomas Munro put a good deal of work into
developing specifically for collation definition changes a few
releases ago and it was judged not good enough, but that means we just
still have nothing, which is unfortunate considering how often things
go wrong in this area.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

pgsql@j-davis.com

over 1 year ago

In reply to: Peter Eisentraut (#13)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote:

Unless I missed something here, all the problem examples involve
unassigned code points that were later assigned.

For normalization and case mapping that's right.

For regexes, a character property could change. But that's mostly a
theoretical problem because, at least in my experience, I can't recall
ever seeing an index that would be affected.

Regards,
Jeff Davis

#16

/messages/by-id/20240629220857.fb.nmisch@google.com

pgsql@j-davis.com

over 1 year ago

In reply to: Robert Haas (#14)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote:

On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut
<peter@eisentraut.org> wrote:

I disagree with that. We should put ourselves into the position to
adopt new Unicode versions without fear. Similar to updates to
time
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year.
That would be madness.

Yeah, I agree with that 100%.

It's hard for me to argue; that was my reasoning during development.

But Noah seems to have a very strong opinion on this matter:

and I thought this thread would be a better opportunity for him to
express it. Noah?

In view of Jeff's list at the start of the thread,
maybe that mechanism needs to be more general than just
collation-related stuff: maybe there should be a general way to say
"oopsie, this index can't be relied upon until it's rebuit"

...

If I remember correctly, Thomas Munro put a good deal of work into
developing specifically for collation definition changes a few
releases ago and it was judged not good enough,

Yeah, see ec48314708. The revert appears to be for a number of
technical reasons, but even if we solve all of those, it's hard to have
a perfect solution that accounts for plpgsql functions that create
arbitrary query strings and EXECUTE them.

Though perhaps not impossible if we use some kind of runtime detection.
We could have some kind of global context that tracks, at runtime, when
an expression is executing for the purposes of an index. If a function
depends on a versioned collation, then mark the index or add a version
somewhere.

Regards,
Jeff Davis

#17

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Peter Eisentraut (#13)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 2024-07-22 at 16:26 +0200, Peter Eisentraut wrote:

I propose that, going forward, we take more care with Unicode updates:

assess the impact, provide time for comments, and consider possible
mitigations. In other words, it would be reviewed like any other
change.

I disagree with that. We should put ourselves into the position to
adopt new Unicode versions without fear. Similar to updates to time
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year.
That would be madness. How would we weigh each change against the
others? Some new character is introduced because it's the new currency
of some country; seems important. Some mobile phone platforms jumped
the gun and already use the character for the same purpose before it was
assigned; now the character is in databases but some function results
will change with the upgrade. How do we proceed?

Moreover, if we were to decide to not take a particular Unicode update,
that would then stop that process forever, because whatever the issue
was wouldn't go away with the next Unicode version.

I understand the difficulty (madness) of discussing every Unicode
change. If that's unworkable, my preference would be to stick with some
Unicode version and never modify it, ever.

The choice that users could make in that case is

a) use the built-in provider, don't get proper support for new code
points, but never again worry about corrupted indexes after an
upgrade

b) use ICU collations, be up to date with Unicode, but reindex whenever
you upgrade to a new ICU version

Unless I missed something here, all the problem examples involve
unassigned code points that were later assigned. (Assigned code points
already have compatibility mechanisms, such as collation versions.) So
I would focus on that issue. We already have a mechanism to disallow
unassigned code points. So there is a tradeoff that users can make:
Disallow unassigned code points and avoid upgrade issues resulting from
them. Maybe that just needs to be documented more prominently.

Are you proposing a switch that would make PostgreSQL error out if
somebody wants to use an unassigned code point? That would be an option.
If what you mean is just add some documentation that tells people not
to use unassigned code points if they want to avoid a reindex, I'd say
that is not enough.

Yours,
Laurenz Albe

#18

/messages/by-id/a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com

pgsql@j-davis.com

over 1 year ago

In reply to: Laurenz Albe (#17)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 2024-07-22 at 19:18 +0200, Laurenz Albe wrote:

I understand the difficulty (madness) of discussing every Unicode
change. If that's unworkable, my preference would be to stick with
some
Unicode version and never modify it, ever.

Among all the ways that IMMUTABLE and indexes can go wrong, is there a
reason why you think we should draw such a bright line in this one
case?

Are you proposing a switch that would make PostgreSQL error out if
somebody wants to use an unassigned code point? That would be an
option.

You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version
18 I have a proposal here to make it a database-level option:

(Note: the proposal might have a few holes in it, I didn't look at it
lately and nobody has commented yet.)

Regards,
Jeff Davis

#19

Isaac Morland

isaac.morland@gmail.com

over 1 year ago

In reply to: Jeff Davis (#18)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 22 Jul 2024 at 13:51, Jeff Davis <pgsql@j-davis.com> wrote:

Are you proposing a switch that would make PostgreSQL error out if
somebody wants to use an unassigned code point? That would be an
option.

You can use a CHECK(UNICODE_ASSIGNED(t)) in version 17, and in version
18 I have a proposal here to make it a database-level option:

And if you define a domain over text with this check, you would effectively
have a type that works exactly like text except you can only store assigned
code points in it. Then use that instead of text everywhere (easy to audit
with a query over the system tables).

#20

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#17)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I understand the difficulty (madness) of discussing every Unicode
change. If that's unworkable, my preference would be to stick with some
Unicode version and never modify it, ever.

I think that's a completely non-viable way forward. Even if everyone
here voted in favor of that, five years from now there will be someone
who shows up to say "I can't use your crappy software because the
Unicode tables haven't been updated in five years, here's a patch!".
And, like, what are we going to do? Still keeping shipping the 2024
version of Unicode four hundred years from now, assuming humanity and
civilization and PostgreSQL are still around then? Holding something
still "forever" is just never going to work.

Every other piece of software in the world has to deal with changes as
a result of the addition of new code points, and probably less
commonly, revisions to existing code points. Presumably, their stuff
breaks too, from time to time. I mean, I find it a bit difficult to
believe that web browsers or messaging applications on phones only
ever display emoji, and never try to do any sort of string sorting.
The idea that PostgreSQL is the only thing that ever sorts strings
cannot be taken seriously. So other people are presumably hacking
around this in some way appropriate to what their software does, and
we're going to have to figure out how to do the same thing. We could
of course sit here and talk about whether it's really a good of the
Unicode folks to add a lime emoji and a bunch of new emojis of people
proceeding in a rightward direction to complement the existing emojis
of people proceeding in a leftward direction, but they are going to do
that whether we like it or not, and people -- including me, I'm afraid
-- are going to use those emojis once they show up, so software that
wants to remain relevant is going to have to support them.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#20)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote:

On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I understand the difficulty (madness) of discussing every Unicode
change. If that's unworkable, my preference would be to stick with some
Unicode version and never modify it, ever.

I think that's a completely non-viable way forward. Even if everyone
here voted in favor of that, five years from now there will be someone
who shows up to say "I can't use your crappy software because the
Unicode tables haven't been updated in five years, here's a patch!".
And, like, what are we going to do? Still keeping shipping the 2024
version of Unicode four hundred years from now, assuming humanity and
civilization and PostgreSQL are still around then? Holding something
still "forever" is just never going to work.

I hear you. It would be interesting to know what other RDBMS do here.

Yours,
Laurenz Albe

#22

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#21)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 3:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I hear you. It would be interesting to know what other RDBMS do here.

Yeah, I agree.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23

schneider@ardentperf.com

over 1 year ago

In reply to: Laurenz Albe (#21)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 1:11 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Mon, 2024-07-22 at 13:55 -0400, Robert Haas wrote:

On Mon, Jul 22, 2024 at 1:18 PM Laurenz Albe <laurenz.albe@cybertec.at>

wrote:

I understand the difficulty (madness) of discussing every Unicode
change. If that's unworkable, my preference would be to stick with

some

Unicode version and never modify it, ever.

I think that's a completely non-viable way forward. Even if everyone
here voted in favor of that, five years from now there will be someone
who shows up to say "I can't use your crappy software because the
Unicode tables haven't been updated in five years, here's a patch!".
And, like, what are we going to do? Still keeping shipping the 2024
version of Unicode four hundred years from now, assuming humanity and
civilization and PostgreSQL are still around then? Holding something
still "forever" is just never going to work.

I hear you. It would be interesting to know what other RDBMS do here.

Other RDBMS are very careful not to corrupt databases, afaik including
function based indexes, by changing Unicode. I’m not aware of any other
RDBMS that updates Unicode versions in place; instead they support multiple
Unicode versions and do not drop the old ones.

I know Jeff mentioned that Unicode tables copied into Postgres for
normalization have been updated a few times. Did anyone ever actually
discuss the fact that things like function based indexes can be corrupted
by this, and weigh the reasoning? Are there past mailing list threads
touching on the corruption problem and making the argument why updating
anyway is the right thing to do? I always assumed that nobody had really
dug deeply into this before the last few years.

I do agree it isn’t as broad of a problem as linguistic collation itself,
which causes a lot more widespread corruption when it changes (as we’ve
seen from glibc 2.28 and also other older hacker mailing list threads about
smaller changes in older glibc versions corrupting databases). For now,
Postgres only has code-point collation and the other Unicode functions
mentioned in this thread.

-Jeremy

#24

robertmhaas@gmail.com

over 1 year ago

In reply to: Jeremy Schneider (#23)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 8:32 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

Other RDBMS are very careful not to corrupt databases, afaik including function based indexes, by changing Unicode. I’m not aware of any other RDBMS that updates Unicode versions in place; instead they support multiple Unicode versions and do not drop the old ones.

See also:
/messages/by-id/E8754F74-C65F-4A1A-826F-FD9F37599A2E@ardentperf.com

Hmm. I think we might have some unique problems due to the fact that
we rely partly on the operating system behavior, partly on libicu, and
partly on our own internal tables.

--
Robert Haas
EDB: http://www.enterprisedb.com

#25

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Jeff Davis (#16)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Mon, Jul 22, 2024 at 09:34:42AM -0700, Jeff Davis wrote:

On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote:

On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut <peter@eisentraut.org> wrote:

I disagree with that. We should put ourselves into the position to
adopt new Unicode versions without fear. Similar to updates to
time
zones, snowball, etc.

We can't be discussing the merits of the Unicode update every year.
That would be madness.

Yeah, I agree with that 100%.

It's hard for me to argue; that was my reasoning during development.

But Noah seems to have a very strong opinion on this matter:

/messages/by-id/20240629220857.fb.nmisch@google.com

and I thought this thread would be a better opportunity for him to
express it. Noah?

Long-term, we should handle this like Oracle, SQL Server, and DB2 do:
/messages/by-id/CA+fnDAbmn2d5tzZsj-4wmD0jApHTsg_zGWUpteb=OMSsX5rdAg@mail.gmail.com

Short-term, we should remedy the step backward that pg_c_utf8 has taken:
/messages/by-id/20240718233908.52.nmisch@google.com
/messages/by-id/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at

$SUBJECT has proposed remedy "take more care with Unicode updates". If one
wanted to pursue that, it should get more specific, by giving one or both of:

(a) principles for deciding whether a Unicode update is okay
(b) examples of past Unicode release changes and whether PostgreSQL should
adopt a future Unicode version making a similar change

That said, I'm not aware of an (a) or (b) likely to create an attractive
compromise between the "index scan agrees with seqscan after pg_upgrade" goal
(/messages/by-id/20240706195129.fd@rfd.leadboat.com) and the "don't freeze
Unicode data" goal
(/messages/by-id/CA+TgmoZRpOFVmQWKEXHdcKj9AFLbXT5ouwtXa58J=3ydLP00ZQ@mail.gmail.com).
The "long-term" above would satisfy both goals. If it were me, I would
abandon the "more care" proposal.

#26

pgsql@j-davis.com

over 1 year ago

In reply to: Robert Haas (#24)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 08:49 -0400, Robert Haas wrote:

Hmm. I think we might have some unique problems due to the fact that
we rely partly on the operating system behavior, partly on libicu,
and
partly on our own internal tables.

The reliance on the OS is especially problematic for reasons that have
already been discussed extensively.

One of my strongest motivations for PG_C_UTF8 was that there was still
a use case for libc in PG16: the "C.UTF-8" locale, which is not
supported at all in ICU. Daniel Vérité made me aware of the importance
of this locale, which offers code point order collation combined with
Unicode ctype semantics.

With PG17, between ICU and the builtin provider, there's little
remaining reason to use libc (aside from legacy).

Regards,
Jeff Davis

#27

robertmhaas@gmail.com

over 1 year ago

In reply to: Jeff Davis (#26)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 1:03 PM Jeff Davis <pgsql@j-davis.com> wrote:

One of my strongest motivations for PG_C_UTF8 was that there was still
a use case for libc in PG16: the "C.UTF-8" locale, which is not
supported at all in ICU. Daniel Vérité made me aware of the importance
of this locale, which offers code point order collation combined with
Unicode ctype semantics.

With PG17, between ICU and the builtin provider, there's little
remaining reason to use libc (aside from legacy).

I was really interested to read Jeremy Schneider's slide deck, to
which he linked earlier, wherein he explained that other major
databases default to something more like C.UTF-8. Maybe we need to
relitigate the debate about what our default should be in light of
those findings (but, if so, on another thread with a clear subject
line). But even if we were to decide to change the default, there are
lots and lots of existing databases out there that are using libc
collations. I'm not in a good position to guess how many of those
people actually truly care about language-specific collations. I'm
positive it's not zero, but I can't really guess how much more than
zero it is. Even if it were zero, though, the fact that so many
upgrades are done using pg_upgrade means that this problem will still
be around in a decade even if we changed the default tomorrow.

(I do understand that you wrote "aside from legacy" so I'm not
accusing you of ignoring the upgrade issues, just taking the
opportunity to be more explicit about my own view.)

Also, Noah has pointed out that C.UTF-8 introduces some
forward-compatibility hazards of its own, at least with respect to
ctype semantics. I don't have a clear view of what ought to be done
about that, but if we just replace a dependency on an unstable set of
libc definitions with a dependency on an equally unstable set of
PostgreSQL definitions, we're not really winning. Do we need to
version the new ctype provider?

--
Robert Haas
EDB: http://www.enterprisedb.com

#28

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Robert Haas (#27)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Robert Haas <robertmhaas@gmail.com> writes:

Also, Noah has pointed out that C.UTF-8 introduces some
forward-compatibility hazards of its own, at least with respect to
ctype semantics. I don't have a clear view of what ought to be done
about that, but if we just replace a dependency on an unstable set of
libc definitions with a dependency on an equally unstable set of
PostgreSQL definitions, we're not really winning.

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen. We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

Do we need to version the new ctype provider?

It would be a version for the underlying Unicode definitions,
not the provider as such, but perhaps yes. I don't know to what
extent doing so would satisfy Noah's concern; but if it would do
so I'd be happy with that answer.

regards, tom lane

#29

mail@joeconway.com

over 1 year ago

In reply to: Tom Lane (#28)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 7/23/24 15:26, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Also, Noah has pointed out that C.UTF-8 introduces some
forward-compatibility hazards of its own, at least with respect to
ctype semantics. I don't have a clear view of what ought to be done
about that, but if we just replace a dependency on an unstable set of
libc definitions with a dependency on an equally unstable set of
PostgreSQL definitions, we're not really winning.

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen. We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

Do we need to version the new ctype provider?

It would be a version for the underlying Unicode definitions,
not the provider as such, but perhaps yes. I don't know to what
extent doing so would satisfy Noah's concern; but if it would do
so I'd be happy with that answer.

I came to the same conclusion. I think someone mentioned somewhere on
this thread that other databases support multiple Unicode versions. I
think we need to figure out how to do that too.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#30

pgsql@j-davis.com

over 1 year ago

In reply to: Tom Lane (#28)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote:

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen. We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

Also, changes to libc collations are much more impactful, at least two
orders of magnitude. All indexes on text are at risk, even primary
keys.

PG_C_UTF8 has stable code point ordering (memcmp()) that is unaffected
by Unicode updates, so primary keys will never be affected. The risks
we are talking about are for expression indexes, e.g. on LOWER(). Even
if you do have such expression indexes, the types of changes Unicode
makes to casing and character properties are typically much more mild.

Regards,
Jeff Davis

#31

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Jeff Davis (#30)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Jeff Davis <pgsql@j-davis.com> writes:

On Tue, 2024-07-23 at 15:26 -0400, Tom Lane wrote:

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen. We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

Also, changes to libc collations are much more impactful, at least two
orders of magnitude. All indexes on text are at risk, even primary
keys.

Well, it depends on which libc collation you have in mind. I was
thinking of a libc-supplied C.UTF-8 collation, which I would expect
to behave the same as pg_c_utf8, modulo which Unicode version it's
based on. But even when comparing to that, pg_c_utf8 can win on
stability for the reasons I stated. If you don't have a C.UTF-8
collation available, and are forced to use en_US.UTF-8 or
$locale-of-choice, then the stability picture is far more dire,
as Jeff says.

Noah seems to be comparing the stability of pg_c_utf8 to the stability
of a pure C/POSIX collation, but I do not think that is the relevant
comparison to make. Besides, if someone is using C/POSIX, this
feature doesn't stop them from continuing to do so.

regards, tom lane

#32

pgsql@j-davis.com

over 1 year ago

In reply to: Noah Misch (#25)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote:

we should remedy the step backward that pg_c_utf8 has taken:

Obviously I disagree that we've taken a step backwards.

Can you articulate the principle by which all of the other problems
with IMMUTABLE are just fine, but updates to Unicode are intolerable,
and only for PG_C_UTF8?

Regards,
Jeff Davis

#33

robertmhaas@gmail.com

over 1 year ago

In reply to: Tom Lane (#28)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, I think we *are* winning, because the updates are not "equally
unstable": with pg_c_utf8, we control when changes happen. We can
align them with major releases and release-note the differences.
With libc-based collations, we have zero control and not much
notification.

OK, that's pretty fair.

Do we need to version the new ctype provider?

It would be a version for the underlying Unicode definitions,
not the provider as such, but perhaps yes. I don't know to what
extent doing so would satisfy Noah's concern; but if it would do
so I'd be happy with that answer.

I don't see how we can get by without some kind of versioning here.
It's probably too late to do that for v17, but if we bet either that
(1) we'll never need to change anything for pg_c_utf8 or that (2)
those changes will be so minor that nobody will have a problem, I
think we will lose our bet.

--
Robert Haas
EDB: http://www.enterprisedb.com

#34

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Robert Haas (#33)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jul 23, 2024 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do we need to version the new ctype provider?

It would be a version for the underlying Unicode definitions,
not the provider as such, but perhaps yes. I don't know to what
extent doing so would satisfy Noah's concern; but if it would do
so I'd be happy with that answer.

I don't see how we can get by without some kind of versioning here.
It's probably too late to do that for v17,

Why? If we agree that that's the way forward, we could certainly
stick some collversion other than "1" into pg_c_utf8's pg_collation
entry. There's already been one v17 catversion bump since beta2
(716bd12d2), so another one is basically free.

regards, tom lane

#35

Daniel Verite

daniel@manitou-mail.org

over 1 year ago

In reply to: Tom Lane (#34)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Tom Lane wrote:

I don't see how we can get by without some kind of versioning here.
It's probably too late to do that for v17,

Why? If we agree that that's the way forward, we could certainly
stick some collversion other than "1" into pg_c_utf8's pg_collation
entry. There's already been one v17 catversion bump since beta2
(716bd12d2), so another one is basically free.

pg_collation.collversion has been used so far for the sort part
of the collations.

For the ctype part:

postgres=# select unicode_version();
unicode_version
-----------------
15.1
(1 row)

postgres=# select icu_unicode_version ();
icu_unicode_version
---------------------
14.0
(1 row)

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#36

peter@eisentraut.org

over 1 year ago

In reply to: Robert Haas (#20)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 22.07.24 19:55, Robert Haas wrote:

Every other piece of software in the world has to deal with changes as
a result of the addition of new code points, and probably less
commonly, revisions to existing code points. Presumably, their stuff
breaks too, from time to time. I mean, I find it a bit difficult to
believe that web browsers or messaging applications on phones only
ever display emoji, and never try to do any sort of string sorting.

The sorting isn't the problem. We have a versioning mechanism for
collations. What we do with the version information is clearly not
perfect yet, but the mechanism exists and you can hack together queries
that answer the question, did anything change here that would affect my
indexes. And you could build more tooling around that and so on.

The problem being considered here are updates to Unicode itself, as
distinct from the collation tables. A Unicode update can impact at
least two things:

- Code points that were previously unassigned are now assigned. That's
obviously a very common thing with every Unicode update. The new
character will have new properties attached to it, so the result of
various functions that use such properties (upper(), lower(),
normalize(), etc.) could change, because previously the code point had
no properties, and so those functions would not do anything interesting
with the character.

- Certain properties of an existing character can change. Like, a
character used to be a letter and now it's a digit. (This is an
example; I'm not sure if that particular change would be allowed.) In
the extreme case, this could have the same impact as the above, but in
practice the kinds of changes that are allowed wouldn't affect typical
indexes.

I don't think this has anything in particular to do with the new builtin
collation provider. That is just one new consumer of this.

#37

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Daniel Verite (#35)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

"Daniel Verite" <daniel@manitou-mail.org> writes:

Tom Lane wrote:

Why? If we agree that that's the way forward, we could certainly
stick some collversion other than "1" into pg_c_utf8's pg_collation
entry. There's already been one v17 catversion bump since beta2
(716bd12d2), so another one is basically free.

pg_collation.collversion has been used so far for the sort part
of the collations.

Hmm, we haven't particularly drawn a distinction between sort-related
and not-sort-related aspects of collation versions AFAIK. Perhaps
it'd be appropriate to do so, and I agree that there's not time to
design such a thing for v17. But pg_c_utf8 might be the only case
where we could do anything other than advance those versions in
lockstep. I doubt we have enough insight into the behaviors of
other providers to say confidently that an update affects only
one side of their behavior.

regards, tom lane

#38

[1]: /messages/by-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org
/messages/by-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org

pgsql@j-davis.com

over 1 year ago

In reply to: Tom Lane (#31)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote:

Well, it depends on which libc collation you have in mind. I was
thinking of a libc-supplied C.UTF-8 collation, which I would expect
to behave the same as pg_c_utf8, modulo which Unicode version it's
based on.

Daniel Vérité documented[1]/messages/by-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org cases where the libc C.UTF-8 locale changed
the *sort* behavior, thereby affecting primary keys.

Regards,
Jeff Davis

#39

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Jeff Davis (#38)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

Jeff Davis <pgsql@j-davis.com> writes:

On Tue, 2024-07-23 at 16:07 -0400, Tom Lane wrote:

Well, it depends on which libc collation you have in mind. I was
thinking of a libc-supplied C.UTF-8 collation, which I would expect
to behave the same as pg_c_utf8, modulo which Unicode version it's
based on.

Daniel Vérité documented[1] cases where the libc C.UTF-8 locale changed
the *sort* behavior, thereby affecting primary keys.

Ouch. But we didn't establish whether that was an ancient bug,
or something likely to happen again. (In any case, that surely
reinforces the point that we can expect pg_c_utf8 to be more
stable than any previously-available alternative.)

regards, tom lane

#40

robertmhaas@gmail.com

over 1 year ago

In reply to: Peter Eisentraut (#36)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote:

The sorting isn't the problem. We have a versioning mechanism for
collations. What we do with the version information is clearly not
perfect yet, but the mechanism exists and you can hack together queries
that answer the question, did anything change here that would affect my
indexes. And you could build more tooling around that and so on.

In my experience, sorting is, overwhelmingly, the problem. What people
complain about is that they do an upgrade - of PG or some OS package -
and then their indexes are broken. Or their partition bounds are
broken.

That we have versioning information that someone could hypothetically
know how to do something useful with is not really useful, because
nobody actually knows how to do it, and there's nothing to trigger
them to do it in the first place. People don't think "oh, I'm running
dnf update, I better run undocumented queries against the PostgreSQL
system catalogs to see whether my system is going to melt afterwards."

What needs to happen is that when you do something that breaks
something, something notices automatically and tells you and gives you
a way to get it fixed again. Or better yet, when you do something that
would break something as things stand today, some kind of versioning
logic kicks in and you keep the old behavior and nothing actually
breaks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#41

pgsql@j-davis.com

over 1 year ago

In reply to: Robert Haas (#40)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 21:37 -0400, Robert Haas wrote:

In my experience, sorting is, overwhelmingly, the problem.

I strongly agree.

That we have versioning information that someone could hypothetically
know how to do something useful with is not really useful, because
nobody actually knows how to do it

Including me. I put significant effort into creating some views that
could help users identify potentially-affected indexes based on
collation changes, and I gave up. In theory it's just about impossible
(consider some UDF that constructs queries and EXECUTEs them -- what
collations does that depend on?). In practice, it's not much easier,
and you might as well just reindex everything having to do with text.

In contrast, if the problem is CTYPE-related, users are in a much
better position. It won't affect their primary keys or most indexes.
It's much more tractable to review your expression indexes and look for
problems (not ideal, but better). Also, as Peter points out, CTYPE
changes are typically more narrow, so there's a good chance that
there's no problem at all.

Regards,
Jeff Davis

#42

peter@eisentraut.org

over 1 year ago

In reply to: Robert Haas (#40)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 24.07.24 03:37, Robert Haas wrote:

On Tue, Jul 23, 2024 at 4:36 PM Peter Eisentraut <peter@eisentraut.org> wrote:

The sorting isn't the problem. We have a versioning mechanism for
collations. What we do with the version information is clearly not
perfect yet, but the mechanism exists and you can hack together queries
that answer the question, did anything change here that would affect my
indexes. And you could build more tooling around that and so on.

In my experience, sorting is, overwhelmingly, the problem. What people
complain about is that they do an upgrade - of PG or some OS package -
and then their indexes are broken. Or their partition bounds are
broken.

Fair enough. My argument was, that topic is distinct from the topic of
this thread.

#43

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Jeff Davis (#32)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, Jul 23, 2024 at 01:07:49PM -0700, Jeff Davis wrote:

On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote:

Short-term, we should remedy the step backward that pg_c_utf8 has taken:
/messages/by-id/20240718233908.52.nmisch@google.com
/messages/by-id/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.camel@cybertec.at

Obviously I disagree that we've taken a step backwards.

Yes.

Can you articulate the principle by which all of the other problems
with IMMUTABLE are just fine, but updates to Unicode are intolerable,
and only for PG_C_UTF8?

No, because I don't think all the other problems with IMMUTABLE are just fine.
The two messages linked cover the comparisons I do consider important,
especially the comparison between pg_c_utf8 and packager-frozen ICU.

#44

robertmhaas@gmail.com

over 1 year ago

In reply to: Peter Eisentraut (#42)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote:

Fair enough. My argument was, that topic is distinct from the topic of
this thread.

OK, that's fair. But I think the solutions are the same: we complain
all the time about glibc and ICU shipping collations and not
versioning them. We shouldn't make the same kinds of mistakes. Even if
ctype is less likely to break things than collations, it still can,
and we should move in the direction of letting people keep the v17
behavior for the foreseeable future while at the same time having a
way that they can also get the new behavior if they want it (and the
new behavior should be the default).

I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday. The problems there had
nothing to do with CTYPE, but there's no reason to suppose that it
couldn't have had such a problem. I suspect the reason we don't hear
about ctype problems now is that the collation problems are worse and
happen in similar situations. But if all the collation problems went
away, a subset of the same users would then be unhappy about ctype.

So I don't want to see us sit on our hands and assert that we don't
need to worry about ctype because it's minor in comparison with
collation. It *is* minor in comparison with collation. But one problem
can be small in comparison with another and still bad. If an aircraft
is on fire whilst experiencing a dual engine failure, it's still in a
lot of trouble even if the fire can be put out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#45

/messages/by-id/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E@tripadvisor.com

schneider@ardentperf.com

over 1 year ago

In reply to: Robert Haas (#44)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, Jul 24, 2024 at 6:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday. The problems there had
nothing to do with CTYPE, but there's no reason to suppose that it
couldn't have had such a problem. I suspect the reason we don't hear
about ctype problems now is that the collation problems are worse and
happen in similar situations. But if all the collation problems went
away, a subset of the same users would then be unhappy about ctype.

I have seen and created indexes on upper() functions a number of times too,
and I think this is not an uncommon pattern for case insensitive searching

Before glibc 2.28, there was at least one mailing list thread where an
unhappy person complained about collation problems; but for a number of
years before 2.28 I guess the collation changes were uncommon so it didn’t
get enough momentum to be considered a real problem until the problem
became widespread a few years ago?

I myself would prefer an approach here that sets a higher bar for
pg_upgrade not corrupting indexes, rather than saying it’s ok as long as
it’s rare

-Jeremy

#46

pgsql@j-davis.com

over 1 year ago

In reply to: Jeremy Schneider (#23)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Tue, 2024-07-23 at 06:31 -0600, Jeremy Schneider wrote:

Other RDBMS are very careful not to corrupt databases, afaik
including function based indexes, by changing Unicode. I’m not aware
of any other RDBMS that updates Unicode versions in place; instead
they support multiple Unicode versions and do not drop the old ones.

I'm curious about the details of what other RDBMSs do.

Let's simplify and say that there's one database-wide collation at
version 1, and the application doesn't use any COLLATE clause or other
specifications for queries or DDL.

Then, version 2 of that collation becomes available. When a query comes
into the database, which version of the collation does it use, 1 or 2?
If it uses the latest available (version 2), then all the old indexes
are effectively useless.

So I suppose there's some kind of migration process where you
rebuild/fix objects to use the new collation, and when that's done then
you change the default so that queries use version 2. How does all that
work?

Regards,
Jeff Davis

#47

pgsql@j-davis.com

over 1 year ago

In reply to: Robert Haas (#44)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, 2024-07-24 at 08:20 -0400, Robert Haas wrote:

I note in passing that the last time I saw a customer query with
UPPER() in the join clause was... yesterday.

Can you expand on that? This thread is mostly about durable state so I
don't immediately see the connection.

So I don't want to see us sit on our hands and assert that we don't
need to worry about ctype because it's minor in comparison with
collation. It *is* minor in comparison with collation.

...

But one problem
can be small in comparison with another and still bad. If an aircraft
is on fire whilst experiencing a dual engine failure, it's still in a
lot of trouble even if the fire can be put out.

There's a qualitative difference between a collation update which can
break your PKs and FKs, and a ctype update which definitely will not.
Your analogy doesn't quite capture this distinction. I don't mean to
over-emphasize this point, but I do think we need to keep some
perspective here.

But I agree with your general point that we shouldn't dismiss the
problem just because it's minor. We should expect the problem to
surface at some point and be reasonably prepared.

Regards,
Jeff Davis

#48

peter@eisentraut.org

over 1 year ago

In reply to: Robert Haas (#44)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On 24.07.24 14:20, Robert Haas wrote:

On Wed, Jul 24, 2024 at 12:42 AM Peter Eisentraut <peter@eisentraut.org> wrote:

Fair enough. My argument was, that topic is distinct from the topic of
this thread.

OK, that's fair. But I think the solutions are the same: we complain
all the time about glibc and ICU shipping collations and not
versioning them. We shouldn't make the same kinds of mistakes. Even if
ctype is less likely to break things than collations, it still can,
and we should move in the direction of letting people keep the v17
behavior for the foreseeable future while at the same time having a
way that they can also get the new behavior if they want it (and the
new behavior should be the default).

Versioning is possibly part of the answer, but I think it would be
different versioning from the collation version.

The collation versions are in principle designed to change rarely. Some
languages' rules might change once in twenty years, some never. Maybe
you have a database mostly in English and a few tables in, I don't know,
Swedish (unverified examples). Most of the time nothing happens during
upgrades, but one time in many years you need to reindex the Swedish
tables, and the system starts warning you about that as soon as you
access the Swedish tables. (Conversely, if you never actually access
the Swedish tables, then you don't get warned about.)

If we wanted a similar versioning system for the Unicode updates, it
would be separate. We'd write the Unicode version that was current when
the system catalogs were initialized into, say, a pg_database column.
And then at run-time, when someone runs say the normalize() function or
some regular expression character classification, then we check what the
version of the current compiled-in Unicode tables are, and then we'd
issue a warning when they are different.

A possible problem is that the Unicode version changes in practice with
every major PostgreSQL release, so this approach would end up warning
users after every upgrade. To avoid that, we'd probably need to keep
support for multiple Unicode versions around, as has been suggested in
this thread already.

#49

robertmhaas@gmail.com

over 1 year ago

In reply to: Jeff Davis (#47)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

There's a qualitative difference between a collation update which can
break your PKs and FKs, and a ctype update which definitely will not.

I don't think that's true. All you need is a unique index on UPPER(somecol).

--
Robert Haas
EDB: http://www.enterprisedb.com

#50

pgsql@j-davis.com

over 1 year ago

In reply to: Robert Haas (#49)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, 2024-07-24 at 14:47 -0400, Robert Haas wrote:

On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

There's a qualitative difference between a collation update which
can
break your PKs and FKs, and a ctype update which definitely will
not.

I don't think that's true. All you need is a unique index on
UPPER(somecol).

Primary keys are on plain column references, not expressions; and don't
support WHERE clauses, so I don't see how a ctype update would affect a
PK.

In any case, you are correct that Unicode updates could put some
constraints at risk, including unique indexes, CHECK, and partition
constraints. But someone has to actually use one of the affected
functions somewhere, and that's the main distinction that I'm trying to
draw.

The reason why collation is qualitatively a much bigger problem is
because there's no obvious indication that you are doing anything
related to collation at all. A very plain "CREATE TABLE x(t text
PRIMARY KEY)" is at risk.

Regards,
Jeff Davis

#51

robertmhaas@gmail.com

over 1 year ago

In reply to: Jeff Davis (#50)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, Jul 24, 2024 at 3:12 PM Jeff Davis <pgsql@j-davis.com> wrote:

In any case, you are correct that Unicode updates could put some
constraints at risk, including unique indexes, CHECK, and partition
constraints. But someone has to actually use one of the affected
functions somewhere, and that's the main distinction that I'm trying to
draw.

The reason why collation is qualitatively a much bigger problem is
because there's no obvious indication that you are doing anything
related to collation at all. A very plain "CREATE TABLE x(t text
PRIMARY KEY)" is at risk.

Well, I don't know. I agree that collation is a much bigger problem,
but not for that reason. I think a user who is familiar with the
problems in this area will see the danger either way, and one who
isn't, won't. For me, the only real difference is that a unique index
on a text column is a lot more common than one that involves UPPER.

--
Robert Haas
EDB: http://www.enterprisedb.com

#52

schneider@ardentperf.com

over 1 year ago

In reply to: Robert Haas (#51)

Re: [18] Policy on IMMUTABLE functions and Unicode updates

On Wed, Jul 24, 2024 at 12:47 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 24, 2024 at 1:45 PM Jeff Davis <pgsql@j-davis.com> wrote:

There's a qualitative difference between a collation update which can
break your PKs and FKs, and a ctype update which definitely will not.

I don't think that's true. All you need is a unique index on UPPER(somecol).

I doubt it’s common to have unique on upper()

But non-unique indexes for case insensitive searches will be more common.
Historically this is the most common way people did case insensitive on
oracle.

Changing ctype would mean these queries can return wrong results

The impact would be similar to the critical problem TripAdvisor hit in 2014
with their read replicas, in the Postgres email thread I linked above

-Jeremy

#53