Support regular expressions with nondeterministic collations
This patch allows using regular expression functions and operators with
nondeterministic collations.
This complements the patches "Support LIKE with nondeterministic
collations" and "Support POSITION with nondeterministic collations" but
is independent. These three together fix most of the places where
nondeterministic collations are currently not allowed.
I had to decide here what the semantics should be. The SQL standard
doesn't say anything, it just refers to XQuery. XQuery has no knowledge
of SQL collations. I also studied the relevant Unicode standard (UTS
#18) and it makes no mention of collations. So my conclusion is that
regular expressions should pay no attention to collations. That makes
it easy.
To clarify a bit more: They don't pay attention to the collate part of
collations. So if you have an accent-insensitive collation, that
doesn't make the regular expression match accent-insensitive. But it
does and continues to pay attention to the ctype part of collations.
The latter is a PostgreSQL extension.
Note that UTS #18 has "retracted" support for tailoring in regular
expressions, which supports the idea that regular expressions should be
independent of things like language settings.
I think this is sensible. Regular expressions are inherently based on
sequences of characters, and trying to marry that with nondeterministic
collations just doesn't fit.
But: We also convert SIMILAR TO patterns to standard regular
expressions, and SIMILAR TO is covered in the SQL standard. And the
definition there does take the collation into account. But the
definition there is pretty much impossible to implement for
nondeterministic collations: It basically says, the predicate is true
if the string to be matched is equal, using the applicable collation, to
any of the strings in the set of strings described by the regular
expression. Which is a nice computer-sciency way to define it, but it
doesn't work in practice.
So I need a way to remember whether a regular expression was originally
a SIMILAR TO pattern and then error out if the collation is
nondeterministic. I figured out a way to do that: Regular expressions
support prefixes like "***X", where X is some character. I added a new
prefix "***S". This is not externally visible, it just gets used
internally, and it doesn't conflict with real regular expressions.
In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.
Attachments:
v1-0001-Support-regular-expressions-with-nondeterministic.patchtext/plain; charset=UTF-8; name=v1-0001-Support-regular-expressions-with-nondeterministic.patchDownload+54-18
Peter Eisentraut <peter@eisentraut.org> writes:
This patch allows using regular expression functions and operators with
nondeterministic collations.
...
In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.
I kind of wonder if we really want to do this. It adds no
functionality, and it forecloses the possibility of changing
the definition later. I understand and agree with your conclusion
that it's pretty much impossible to do what the SQL standard suggests
should happen --- but maybe we're both missing something that would
make it feasible. (Have you asked your committee colleagues if
anyone's actually implemented what they wrote about SIMILAR TO?
If they've written something unimplementable, it seems like there
is work for them to do in any case.)
On the whole I'm content with our status quo here.
If we do push forward with this, I doubt that it's okay to throw
the error for SIMILAR TO from where you have it --- it will leak
the partially-built compiled regex, and that will be a
session-lifespan leak. The way forward is illustrated by code
just above: it'd have to look more like
if (!collation-is-allowed)
return freev(v, REG_ECOLLATION);
where you'd need to invent a new regex error code REG_ECOLLATION
and plug that into the appropriate places.
regards, tom lane
On 22.10.24 16:40, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
This patch allows using regular expression functions and operators with
nondeterministic collations.
...
In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.I kind of wonder if we really want to do this. It adds no
functionality, and it forecloses the possibility of changing
the definition later. I understand and agree with your conclusion
that it's pretty much impossible to do what the SQL standard suggests
should happen --- but maybe we're both missing something that would
make it feasible. (Have you asked your committee colleagues if
anyone's actually implemented what they wrote about SIMILAR TO?
If they've written something unimplementable, it seems like there
is work for them to do in any case.)
Good idea; I'll go ask there too.
Btw., one end goal here is to be able to run with a nondeterministic
collation as the global locale. So for example you could make the whole
system insensitive to Unicode normalization forms. But if that
effectively globally disables regular expressions, then people will be
sad, and also most of psql breaks, and so on. So some positive solution
here would be useful.
Peter Eisentraut <peter@eisentraut.org> writes:
On 22.10.24 16:40, Tom Lane wrote:
Peter Eisentraut <peter@eisentraut.org> writes:
In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.
I kind of wonder if we really want to do this. It adds no
functionality, and it forecloses the possibility of changing
the definition later.
Btw., one end goal here is to be able to run with a nondeterministic
collation as the global locale. So for example you could make the whole
system insensitive to Unicode normalization forms. But if that
effectively globally disables regular expressions, then people will be
sad, and also most of psql breaks, and so on. So some positive solution
here would be useful.
Sure, and I'll support this patch once we're sure that no better
functionality is possible. I just want to look into whether the
SQL committee knows something we don't.
regards, tom lane
On Tue, 2024-10-22 at 10:16 +0200, Peter Eisentraut wrote:
I also studied the relevant Unicode standard (UTS
#18) and it makes no mention of collations. So my conclusion is that
regular expressions should pay no attention to collations. That
makes
it easy.
Does normalization play a role here?
Regards,
Jeff Davis
On Tue, 2024-10-22 at 10:40 -0400, Tom Lane wrote:
I understand and agree with your conclusion
that it's pretty much impossible to do what the SQL standard suggests
should happen --- but maybe we're both missing something that would
make it feasible.
It sounds feasible for case-insensitive collations, right? We just
casefold the pattern and the string, and then check for a match.
That's difficult given our current assumption that non-deterministic
collaitons can mean almost anything. But it's not necessarily a problem
with the standard, and perhaps some other systems do something like
that.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
On Tue, 2024-10-22 at 10:40 -0400, Tom Lane wrote:
I understand and agree with your conclusion
that it's pretty much impossible to do what the SQL standard suggests
should happen --- but maybe we're both missing something that would
make it feasible.
It sounds feasible for case-insensitive collations, right? We just
casefold the pattern and the string, and then check for a match.
Yeah, there is some set of collations for which that would work.
But I think it requires nontrivial assumptions both about how
comparison works in the collation, and whether the available
case-folding logic matches that. An important point here is
that the results depend on which direction you choose to smash
case, which is at best a bit uncomfortable-making. For instance,
I believe in German "ß" upcases to "SS" and would therefore match
"ss" if you choose to fold to upper, but not so much if you choose
to fold to lower. (Possibly Peter will correct me on that, but the
point is there are some weird rules out there.)
The existing logic in the regex engine for case-insensitive matching
is to convert every letter to a bracket expression containing all
its case variants. For example, "a" becomes "[aA]" and "[xY1]"
becomes "[xXyY1]". This fails on "ß", so a better way would be
nice...
regards, tom lane
On Mon, 2024-12-16 at 17:16 -0500, Tom Lane wrote:
Yeah, there is some set of collations for which that would work.
But I think it requires nontrivial assumptions both about how
comparison works in the collation, and whether the available
case-folding logic matches that. An important point here is
that the results depend on which direction you choose to smash
case, which is at best a bit uncomfortable-making. For instance,
I believe in German "ß" upcases to "SS" and would therefore match
"ss" if you choose to fold to upper, but not so much if you choose
to fold to lower. (Possibly Peter will correct me on that, but the
point is there are some weird rules out there.)
Unicode specifies case folding separately from case conversion
(lower/title/upper) to deal with these kinds of issues: "ß", "Ss",
"SS", and "ss" all fold to "ss".
I have a couple patches that create that infrastructure:
/messages/by-id/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5.camel@j-davis.com
/messages/by-id/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com
after that's in place, we can even discuss adding a builtin case-
insensitive collation that does memcmp() on the case-folded strings.
The existing logic in the regex engine for case-insensitive matching
is to convert every letter to a bracket expression containing all
its case variants. For example, "a" becomes "[aA]" and "[xY1]"
becomes "[xXyY1]". This fails on "ß", so a better way would be
nice...
We have a couple options:
* create more complex regexes like "(ß|[sS][sS])"
* case fold the pattern first, and then lazily case fold the string as
we match against it
The former sounds faster but the latter sounds simpler.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
On Mon, 2024-12-16 at 17:16 -0500, Tom Lane wrote:
The existing logic in the regex engine for case-insensitive matching
is to convert every letter to a bracket expression containing all
its case variants. For example, "a" becomes "[aA]" and "[xY1]"
becomes "[xXyY1]". This fails on "ß", so a better way would be
nice...
We have a couple options:
* create more complex regexes like "(ß|[sS][sS])"
* case fold the pattern first, and then lazily case fold the string as
we match against it
The former sounds faster but the latter sounds simpler.
Yeah, the latter sounds really slow. It would not actually be too
hard I think to build the right regex, if we had the information
available as to what all the case-variants are. The problem at the
moment is that the existing code assumes that pg_wc_tolower and
pg_wc_toupper together give us all the case variants, and that
API can't cope with multi-glyph expansions.
regards, tom lane
On Wed, 2024-12-18 at 14:55 -0500, Tom Lane wrote:
It would not actually be too
hard I think to build the right regex, if we had the information
available as to what all the case-variants are. The problem at the
moment is that the existing code assumes that pg_wc_tolower and
pg_wc_toupper together give us all the case variants, and that
API can't cope with multi-glyph expansions.
That's doable. I can do that after refactoring the ctype logic to use a
method table.
I'll have to think about how the API should look though. The maximum
amount of expansion that can occur during case folding is from one
codepoint to 3, and the maximum number of case variants is also ~3, so
it could fill in a caller-supplied 3x3 array of pg_wchar. Somewhat
awkward in C, so I welcome better ideas.
Note: if the string is not normalized consistently with the
pattern, pattern matching in general won't work very well. This has
always been true, but as we make pattern matching smarter we should be
more clear about that point.
Regards,
Jeff Davis
On 29.10.24 09:47, Peter Eisentraut wrote:
I kind of wonder if we really want to do this. It adds no
functionality, and it forecloses the possibility of changing
the definition later. I understand and agree with your conclusion
that it's pretty much impossible to do what the SQL standard suggests
should happen --- but maybe we're both missing something that would
make it feasible. (Have you asked your committee colleagues if
anyone's actually implemented what they wrote about SIMILAR TO?
If they've written something unimplementable, it seems like there
is work for them to do in any case.)Good idea; I'll go ask there too.
So the result from that was that no one there knew what to do either.
There was general interest in the various arguments and options, but
there was no consensus about what the right solution should be.
For everyone's amusement, attached is the discussion paper I submitted,
which contains some of my arguments from this thread as well as other
information and examples.
I think a way forward would be to define more special purpose collations
that are just "normal but case insensitive" or "normal but accent
insensitive", like was discussed later in this thread, and what other
implementations apparently also do (see BINARY_CI in the paper).
For now, I'm withdrawing this patch, but I (and I suspect others) will
keep thinking about this.