[19] Proposal: function markers to indicate collation/ctype sensitivity
=== Motivation ===
Right now, we are missing information about the dependencies between
database objects and collation providers, which makes it very hard to
migrate to a new collation provider version smoothly. Even assuming a
user is willing to REINDEX potentially-affected indexes, they have no
way to know which indexes may be affected, so they have to reindex
everything.
This proposal would add that dependency information, and importantly,
would be careful about which dependency entries are required for
particular expressions and which are not.
=== Proposal ===
(The terminology can be confusing here, so I use "CollOid" to mean the
collation object in pg_collation, and COLLATE as the marker keyword.)
Introduce three new options when creating or altering a function,
operator or index AM: COLLATE, CTYPE, or EQUALITY, representing the
operations that the object is sensitive to.
COLLATE marker: indicates that the function or index AM depends on the
CollOid of the input expression for ordering. Example: text_lt(), btree
AM.
CTYPE marker: indicates that the function or index AM depends on the
CollOid of the input expression for case conversion or character
classification. Examples: LOWER(), ILIKE.
EQUALITY marker: indicates that the function or index AM depends on
CollOid for the equality semantics of the input expression. Examples:
texteq(), btree AM, hash AM. (Note: EQUALITY is only important for non-
deterministic collations and we might want to omit it if too many
functions qualify, but it helps illustrate the rest of the proposal so
I will leave it in for now.)
When walking an expression to find dependencies (e.g. functions called
from an index expression), when you encounter a function call, then
look at the inferred CollOid of the input, and:
* if the COLLATE marker is present, record a dependency on that
inferred CollOid unless collate_is_c
* if the CTYPE marker is present, record a dependency on that
inferred CollOid unless ctype_is_c
* if the EQUALITY marker is present, record a dependency on that
inferred CollOid unless it's deterministic
NB: We must make the dependency entry even if CollOid or the function
itself IsPinnedObject()!
Similarly, when creating indexes, check the AM's markers, and add
dependencies for the column's CollOid as appropriate.
=== Benefits ===
* Markers will act as documentation.
* Dependency entries can be followed to find objects dependent on a
given collation, which will be useful during provider upgrades.
* Users of the builtin collation provider or libc where COLLATE=C (and
CTYPE is something else) would have precise dependency entries about
the specific expression indexes that may be affected by a provider
upgrade.
* If necessary, we can add runtime checks that will throw an error if a
function is missing the appropriate marker. There's some work to do
here, but I believe it's possible by checking at the time the collation
is opened.
Because it will add dependencies against the default collation, it
could bloat pg_depend substantially if !collate_is_c. However, in the
future when we have catalog entries for the providers themselves, that
would be useful information rather than just bloat: it would point to a
specific provider entry with a version (lots of details to work out in
a separate proposal).
=== Why separate markers? ===
Consider:
CREATE TABLE foo (t TEXT COLLATE PG_C_UTF8);
-- no additional dependencies created for SUBSTR, because it
-- has no markers
ALTER TABLE foo ADD CHECK (SUBSTR(t,1,3) = t);
-- dependency created for PG_C_UTF8
ALTER TABLE foo ADD CHECK (LOWER(t) = t);
-- no additional dependency created because
-- collate_is_c(PG_C_UTF8) is true
ALTER TABLE foo ADD CHECK (t < 'z');
-- no dependency because collate_is_c
CREATE INDEX foo_idx_1 ON foo USING btree (t);
-- creates a dependency because btree is marked with COLLATE
-- and !collate_is_c
CREATE INDEX foo_idx_2 ON foo USING btree (t COLLATE "en-x-icu");
-- does not create a dependency even though hash is marked
-- with EQUALITY, because "en-x-icu" is deterministic
CREATE INDEX foo_idx_3 ON foo USING hash (t COLLATE "en-x-icu");
If we didn't have separate markers, we'd need to assume that all of
those objects could be affected by a provider update.
=== NORMALIZE() ===
NORMALIZE() is a bit of a strange case because it doesn't depend on the
input collation at all, but we still want to record a dependency. We
have a number of potential solutions:
* Simply exclude it from IsPinnedObject(), which would create a
dependency against the normalize function itself rather than a
collation. That would complicate queries a bit and would not work for
UDFs that depend on our normalization tables.
* Record a dependency against a dummy collation that uses the builtin
provider.
* In the future, when we add catalog entries for the providers
themselves, we can record a dependency directly against the provider
without needing the dummy collation.
If we expect users to create their own functions which depend on our
normalization tables, we can add a fourth marker UNICODE. Otherwise, we
can just special case the few builtin functions we have to create those
dependency entries.
Regards,
Jeff Davis
Hi Jeff,
On Jun 3, 2025, at 23:22, Jeff Davis <pgsql@j-davis.com> wrote:
If we didn't have separate markers, we'd need to assume that all of
those objects could be affected by a provider update.
I understand the need to trace these dependencies, but as a function developer with relatively modest understanding of collation nuances, I’m wondering how I’d know I needed these markers. It seems complicated. Which leads me to think that adoption would be low outside of core.
Best,
David
On Wed, Jun 4, 2025 at 8:34 AM David E. Wheeler <david@justatheory.com> wrote:
On Jun 3, 2025, at 23:22, Jeff Davis <pgsql@j-davis.com> wrote:
If we didn't have separate markers, we'd need to assume that all of
those objects could be affected by a provider update.I understand the need to trace these dependencies, but as a function developer with relatively modest understanding of collation nuances, I’m wondering how I’d know I needed these markers. It seems complicated. Which leads me to think that adoption would be low outside of core.
That was my first thought as well. But my second thought was: does
that matter? There are substantial benefits to having this for just
core functions, no?
Thanks,
Maciek
On Wed, 2025-06-04 at 11:33 -0400, David E. Wheeler wrote:
I understand the need to trace these dependencies, but as a function
developer with relatively modest understanding of collation nuances,
I’m wondering how I’d know I needed these markers. It seems
complicated. Which leads me to think that adoption would be low
outside of core.
In the proposal, I mentioned adding runtime checks as future work.
Perhaps that would be a requirement for this proposal so that users
could get a warning? We could also make the markers negative, so that
the default is to assume that a UDF with collatable inputs depends on
all of the behaviors.
Regards,
Jeff Davis
On 04.06.25 05:22, Jeff Davis wrote:
This proposal would add that dependency information, and importantly,
would be careful about which dependency entries are required for
particular expressions and which are not.
Introduce three new options when creating or altering a function,
operator or index AM: COLLATE, CTYPE, or EQUALITY, representing the
operations that the object is sensitive to.
Yes, this has been on my todo list since the day collations were added,
but for a different reason: We should be able to detect a failed
collation derivation at parse time. This is according to the SQL
standard, and also because it's arguably a better user experience. But
we don't do that, so we have to check it at run time, which is what all
these errmsg("could not determine which collation to use for string
comparison") checks are for.
The reason we don't do it at parse time is that we don't have the
information which functions care about collations, which is exactly what
you are proposing here to add.
In my mind, I had this project listed under "procollate", but feel free
to use a different name. But I would consider making this one setting
with multiple values instead of multiple boolean settings.
I don't mean to say that you should implement the parse-time collation
derivation check as well, but we should design the catalog metadata so
that that is possible.
On 04.06.25 17:53, Jeff Davis wrote:
On Wed, 2025-06-04 at 11:33 -0400, David E. Wheeler wrote:
I understand the need to trace these dependencies, but as a function
developer with relatively modest understanding of collation nuances,
I’m wondering how I’d know I needed these markers. It seems
complicated. Which leads me to think that adoption would be low
outside of core.In the proposal, I mentioned adding runtime checks as future work.
Perhaps that would be a requirement for this proposal so that users
could get a warning? We could also make the markers negative, so that
the default is to assume that a UDF with collatable inputs depends on
all of the behaviors.
Yes, in my research in the past (see other message) I was also wondering
about this. One is, how do we get third-party code to update, the other
is, how many functions, either in core or third-party, do you need to
update, versus having a negative/opt-out approach.
My conclusion was also what you appear to be saying here: We should
assume by default that a function using a collatable data type might use
the collation. If the function doesn't, then the implementer can
explicitly opt out. This is similar to how functions are volatile by
default, but implementers can mark them as stable or immutable as their
own responsibility.
On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
The reason we don't do it at parse time is that we don't have the
information which functions care about collations, which is exactly
what
you are proposing here to add.
Currently, we have:
create table c(x text collate "C", y text collate "en_US");
insert into c values ('x', 'y');
select x < y from c; -- fails (runtime check)
select x || y from c; -- succeeds
Surely, "<" would be marked as ordering-sensitive, and we could move
the error to parse-time.
But what about UDFs? If we assume that all UDFs are ordering-sensitive
unless marked otherwise, then a user-defined version of "||" that
previously worked would now start failing, until they add the ordering-
insensitive mark.
We'd need some kind of migration path where we could retain the runtime
checks and disable the parse time checks until people have a chance to
add the right marks to their UDFs. Migration paths like that are not
great because they take several releases to work out, and we're never
quite sure when to finally remove the deprecated behavior.
If we make the opposite assumption, that none are ordering-sensitive
unless we mark them so, that would allow properly-marked functions to
fail at parse time, and the rest to fail at runtime. But this
assumption doesn't work as well for recording dependencies, because
we'd miss the dependencies for UDFs that aren't properly marked.
Thoughts?
Regards,
Jeff Davis
On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
But I would consider making this one setting
with multiple values instead of multiple boolean settings.
While we're at it, CTYPE is not very descriptive for a user-facing
name. And COLLATE has become overloaded (expression clause,
pg_collation object, ordering, or the superset of behaviors that
includes CTYPE). Let's consider more user-friendly naming for the
markers:
CASE: lower/upper/initcap/fold behavior
CLASS: char classifications such as [[:punct:]]
ORDER: comparisons
Internally, at least for the foreseeable future, CASE and CLASS would
be the same. They'd just be different markers to record the user's
intent.
Also, we could use keywords in the DDL syntax, or we could use a new
options syntax, or a comma-separated list as a string literal to
specify the markers. I don't have a strong opinion on which route to
take, but I chose the above names from existing keywords so we wouldn't
have to add any.
Regards,
Jeff Davis
On Tue, 2025-06-03 at 20:22 -0700, Jeff Davis wrote:
EQUALITY marker: indicates that the function or index AM depends on
CollOid for the equality semantics of the input expression. Examples:
texteq(), btree AM, hash AM. (Note: EQUALITY is only important for
non-
deterministic collations and we might want to omit it if too many
functions qualify, but it helps illustrate the rest of the proposal
so
I will leave it in for now.)
I had trouble contriving a case where tracking EQUALITY as a separate
marker would be practically useful. When we allow non-deterministic
collations at the database level in the future, it might be useful in a
case like:
CREATE TABLE r(t TEXT, CHECK(length(t) <= 10));
because length() doesn't care about equality, so it would have no
dependency on the database collation even if it were, for example,
case-insensitive.
However, even in that case, I'm not sure if it's worth keeping track of
a separate marker.
Regards,
Jeff Davis
On 05.06.25 22:57, Jeff Davis wrote:
On Tue, 2025-06-03 at 20:22 -0700, Jeff Davis wrote:
EQUALITY marker: indicates that the function or index AM depends on
CollOid for the equality semantics of the input expression. Examples:
texteq(), btree AM, hash AM. (Note: EQUALITY is only important for
non-
deterministic collations and we might want to omit it if too many
functions qualify, but it helps illustrate the rest of the proposal
so
I will leave it in for now.)I had trouble contriving a case where tracking EQUALITY as a separate
marker would be practically useful. When we allow non-deterministic
collations at the database level in the future, it might be useful in a
case like:CREATE TABLE r(t TEXT, CHECK(length(t) <= 10));
because length() doesn't care about equality, so it would have no
dependency on the database collation even if it were, for example,
case-insensitive.
But such a length() function would also not care about the "collate" or
"ctype" markers. So it would probably not be marked at all.
On 05.06.25 21:56, Jeff Davis wrote:
On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote:
The reason we don't do it at parse time is that we don't have the
information which functions care about collations, which is exactly
what
you are proposing here to add.Currently, we have:
create table c(x text collate "C", y text collate "en_US");
insert into c values ('x', 'y');
select x < y from c; -- fails (runtime check)
select x || y from c; -- succeedsSurely, "<" would be marked as ordering-sensitive, and we could move
the error to parse-time.But what about UDFs? If we assume that all UDFs are ordering-sensitive
unless marked otherwise, then a user-defined version of "||" that
previously worked would now start failing, until they add the ordering-
insensitive mark.
I think no matter how we slice it, there is going to be some case that
will be degraded until some update is applied. I would be content to
accept this particular variant, because it doesn't seem very realistic.
Why would a user define their own concatenation function? There already
is one. Unless your concatenation function does something special, in
which case you should probably think about this collations topic. More
generally, there are I think only so many operations you can do on
characters strings that you can do without considering the
collation/ctype/etc. These are essentially all the operations that you
can do without looking at the characters, like length(), ||, repeat().
Everything beyond that looks at the characters and needs to take
collation/ctype/etc. into account.
We'd need some kind of migration path where we could retain the runtime
checks and disable the parse time checks until people have a chance to
add the right marks to their UDFs. Migration paths like that are not
great because they take several releases to work out, and we're never
quite sure when to finally remove the deprecated behavior.
Perhaps pg_dump can apply some properties during upgrades?
If we make the opposite assumption, that none are ordering-sensitive
unless we mark them so, that would allow properly-marked functions to
fail at parse time, and the rest to fail at runtime. But this
assumption doesn't work as well for recording dependencies, because
we'd miss the dependencies for UDFs that aren't properly marked.
That feels like the worst of both worlds.
On 05.06.25 22:47, Jeff Davis wrote:
While we're at it, CTYPE is not very descriptive for a user-facing
name. And COLLATE has become overloaded (expression clause,
pg_collation object, ordering, or the superset of behaviors that
includes CTYPE). Let's consider more user-friendly naming for the
markers:CASE: lower/upper/initcap/fold behavior
CLASS: char classifications such as [[:punct:]]
ORDER: comparisonsInternally, at least for the foreseeable future, CASE and CLASS would
be the same. They'd just be different markers to record the user's
intent.
Under what scenario would they become different, and how would that
matter in practice?
I would be worried that this could confuse users and they would apply
these incorrectly, if the differences are too fine.
On Wed, 2025-06-11 at 09:06 +0200, Peter Eisentraut wrote:
CASE: lower/upper/initcap/fold behavior
CLASS: char classifications such as [[:punct:]]
ORDER: comparisonsInternally, at least for the foreseeable future, CASE and CLASS
would
be the same. They'd just be different markers to record the user's
intent.Under what scenario would they become different, and how would that
matter in practice?
I can't think of any reason those behaviors should diverge. If nothing
else, the "uppercase" property should be consistent with the results of
case mapping.
However, I have struggled to come up with a single word that includes
both casing behavior and character classification, but excludes
ordering behavior. Such a word would be useful for documentation, too.
I guess "CTYPE" works, but it's too technical and feels libc-specific.
Regards,
Jeff Davis
On Wed, 2025-06-11 at 09:03 +0200, Peter Eisentraut wrote:
I think no matter how we slice it, there is going to be some case
that
will be degraded until some update is applied.
The problem I see is a conflict between two goals:
1. Record appropriate dependencies if a function is sensitive to
ordering or ctype.
2. Raise parse errors if we cannot infer the collation for a function
call site where the function is sensitive to ordering or ctype.
The safest assumption with respect to the first goal is to assume that
UDFs are sensitive to ordering and ctype. Otherwise, we will miss
recording dependencies for, e.g., a validation function that uses a
regex that depends on character classification.
But the safest assumption with respect to the second goal is to assume
that UDFs are not sensitive to ordering or ctype. Otherwise, we'd throw
an error for queries that work today (see below example).
To resolve this conflict I think we need some notion about whether the
markings are explicitly specified or left as the defaults. If CREATE
FUNCTION doesn't specify any markings, then the dependency tracking
code can make one assumption and the parser can make the opposite
assumption. We need to sort out the actual syntax of CREATE FUNCTION,
and I'm starting to think we need some options syntax (similar to
storage parameters for CREATE TABLE).
Why would a user define their own concatenation function?
It's more likely that someone combines a few primitive functions:
CREATE OR REPLACE FUNCTION shorten(t TEXT) RETURNS TEXT
LANGUAGE plpgsql AS $$
BEGIN
IF (length(t) < 4) THEN
RETURN t;
END IF;
RETURN substr(t,1,1) ||
(length(t) - 2)::text ||
substr(t,length(t),1);
END;
$$;
CREATE TABLE c(x TEXT COLLATE "C", y TEXT COLLATE "en_US");
INSERT INTO c VALUES ('kuber','netes');
SELECT x = y FROM c;
ERROR: could not determine which collation to use for string
comparison
-- currently succeeds even though collation cannot be inferred
SELECT shorten(x || y) FROM c;
shorten
---------
k8s
(1 row)
The example is a bit silly, but I think there are realistic cases along
those lines.
Everything beyond that looks at the characters and needs to take
collation/ctype/etc. into account.
I'm not sure. My guess would be that the various kinds of markings you
might want (or no markings at all) are all common enough cases that
they shouldn't be ignored.
Perhaps pg_dump can apply some properties during upgrades?
Interesting idea. We'd still need to account for CREATE FUNCTION
statements that come from other places (e.g. direct from applications,
or migration scripts, or extension scripts).
Regards,
Jeff Davis