Support LIKE with nondeterministic collations

Started by Peter Eisentrautover 1 year ago27 messages
#1Peter Eisentraut
peter@eisentraut.org
1 attachment(s)

This patch adds support for using LIKE with nondeterministic collations.
So you can do things such as

col LIKE 'foo%' COLLATE case_insensitive

This currently results in a "not supported" error. The reason for that
is that when I first developed support for nondeterministic collations,
I didn't know what the semantics of that should be, especially since
with nondeterministic collations, strings of different lengths could be
equal, and then dropped the issue for a while.

After further research, the SQL standard's definition of the LIKE
predicate actually provides a clear definition of the semantics: The
pattern is partitioned into substrings at wildcard characters (so
'foo%bar' is partitioned into 'foo', '%', 'bar') and then then whole
predicate matches if a match can be found for each partition under the
applicable collation (so for 'foo%bar' we look to partition the input
string into s1 || s2 || s3 such that s1 = 'foo', s2 is anything, and s3
= 'bar'.) The only difference to deterministic collations is that for
deterministic collations we can optimize this by matching by character,
but for nondeterministic collations we have to go by substring.

Attachments:

v1-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v1-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From 3f6b584a0f34cabecac69f3cfd663dadfd34f464 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 29 Apr 2024 07:58:20 +0200
Subject: [PATCH v1] Support LIKE with nondeterministic collations

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |   7 +-
 src/backend/utils/adt/like.c                  |  30 +++--
 src/backend/utils/adt/like_match.c            | 118 ++++++++++++++++++
 src/backend/utils/adt/like_support.c          |  29 ++---
 .../regress/expected/collate.icu.utf8.out     |  81 ++++++++++--
 src/test/regress/sql/collate.icu.utf8.sql     |  23 +++-
 7 files changed, 250 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index daf671e6205..ec7ffb360dd 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1928de57623..6181cc64712 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5374,9 +5374,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation && !lc_ctype_is_c(collation))
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale = 0;
+	bool		locale_is_c = false;
 
-		if (!pg_locale_deterministic(locale))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	if (lc_ctype_is_c(collation))
+		locale_is_c = true;
+	else
+		locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0, true);
+		return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0, true);
+		return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else
-		return MB_MatchText(s, slen, p, plen, 0, true);
+		return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..e282a50b3ba 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+					if (matched != LIKE_FALSE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched; /* TRUE or ABORT */
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p) != GETCHAR(*t))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 4b8c8f143f3..6a4509aeb46 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,18 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..481a995c998 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,14 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: 5c9f35fc48ea99e59300a267e090e3eafd1b3b0e
-- 
2.44.0

#2Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#1)
Re: Support LIKE with nondeterministic collations

Peter Eisentraut wrote:

This patch adds support for using LIKE with nondeterministic
collations. So you can do things such as

col LIKE 'foo%' COLLATE case_insensitive

Nice!

The pattern is partitioned into substrings at wildcard characters
(so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
whole predicate matches if a match can be found for each partition
under the applicable collation

Trying with a collation that ignores punctuation:

postgres=# CREATE COLLATION "ign_punct" (
provider = 'icu',
locale='und-u-ka-shifted',
deterministic = false
);

postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
?column?
----------
t
(1 row)

postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
?column?
----------
t
(1 row)

postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)

The first two results look fine, but the next one is inconsistent.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#3Peter Eisentraut
peter@eisentraut.org
In reply to: Daniel Verite (#2)
Re: Support LIKE with nondeterministic collations

On 30.04.24 14:39, Daniel Verite wrote:

postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)

The first two results look fine, but the next one is inconsistent.

This is correct, because '_' means "any single character". This is
independent of the collation.

I think with nondeterministic collations, the single-character wildcard
is often not going to be all that useful.

#4Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#3)
Re: Support LIKE with nondeterministic collations

On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 30.04.24 14:39, Daniel Verite wrote:

postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)

The first two results look fine, but the next one is inconsistent.

This is correct, because '_' means "any single character". This is
independent of the collation.

Seems really counterintuitive. I had to think for a long time to be
able to guess what was happening here. Finally I came up with this
guess:

If the collation-aware matching tries to match up f with the initial
period, the period is skipped and the f matches f. But when the
wildcard is matched to the initial period, that uses up the wildcard
and then we're left trying to match o with f, which doesn't work.

Is that right?

It'd probably be good to use something like this as an example in the
documentation. My intuition is that if foo matches a string, then _oo
f_o and fo_ should also match that string. Apparently that's not the
case, but I doubt I'll be the last one who thinks it should be.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Peter Eisentraut
peter@eisentraut.org
In reply to: Robert Haas (#4)
Re: Support LIKE with nondeterministic collations

On 03.05.24 02:11, Robert Haas wrote:

On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 30.04.24 14:39, Daniel Verite wrote:

postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)

The first two results look fine, but the next one is inconsistent.

This is correct, because '_' means "any single character". This is
independent of the collation.

Seems really counterintuitive. I had to think for a long time to be
able to guess what was happening here. Finally I came up with this
guess:

If the collation-aware matching tries to match up f with the initial
period, the period is skipped and the f matches f. But when the
wildcard is matched to the initial period, that uses up the wildcard
and then we're left trying to match o with f, which doesn't work.

Formally, what

X like '_oo'

means is, can X be partitioned into substrings such that the first
substring is a single character and the second substring is equal to
'oo' under the applicable collation? This is false in this case, there
is no such partitioning.

What the implementation does is, it walks through the pattern. It sees
'_', so it steps over one character in the input string, which is '.'
here. Then we have 'foo.' left to match in the input string. Then it
takes from the pattern the next substring up to but not including either
a wildcard character or the end of the string, which is 'oo', and then
it checks if a prefix of the remaining input string can be found that is
"equal to" 'oo'. So here it would try in turn

'' = 'oo' collate ign_punct ?
'f' = 'oo' collate ign_punct ?
'fo' = 'oo' collate ign_punct ?
'foo' = 'oo' collate ign_punct ?
'foo.' = 'oo' collate ign_punct ?

and they all fail, so the match fails.

It'd probably be good to use something like this as an example in the
documentation. My intuition is that if foo matches a string, then _oo
f_o and fo_ should also match that string. Apparently that's not the
case, but I doubt I'll be the last one who thinks it should be.

This intuition fails because with nondeterministic collations, strings
of different lengths can be equal, and so the question arises, what does
the pattern '_' mean. It could mean either, (1) a single character, or
perhaps something like, (2) a string that is equal to some other string
of length one.

The second definition would satisfy the expectation here, because then
'.f' matches '_' because '.f' is equal to some string of length one,
such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.)
However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.

In any case, yes, some explanation and examples should be added.

#6Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#5)
Re: Support LIKE with nondeterministic collations

On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote:

What the implementation does is, it walks through the pattern. It sees
'_', so it steps over one character in the input string, which is '.'
here. Then we have 'foo.' left to match in the input string. Then it
takes from the pattern the next substring up to but not including either
a wildcard character or the end of the string, which is 'oo', and then
it checks if a prefix of the remaining input string can be found that is
"equal to" 'oo'. So here it would try in turn

'' = 'oo' collate ign_punct ?
'f' = 'oo' collate ign_punct ?
'fo' = 'oo' collate ign_punct ?
'foo' = 'oo' collate ign_punct ?
'foo.' = 'oo' collate ign_punct ?

and they all fail, so the match fails.

Interesting. Does that imply that these matches are slower than normal ones?

The second definition would satisfy the expectation here, because then
'.f' matches '_' because '.f' is equal to some string of length one,
such as 'f'. (And then 'oo.' matches 'oo' for the rest of the pattern.)
However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.

Right, those are good arguments.

In any case, yes, some explanation and examples should be added.

Cool.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Peter Eisentraut
peter@eisentraut.org
In reply to: Robert Haas (#6)
Re: Support LIKE with nondeterministic collations

On 03.05.24 15:20, Robert Haas wrote:

On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote:

What the implementation does is, it walks through the pattern. It sees
'_', so it steps over one character in the input string, which is '.'
here. Then we have 'foo.' left to match in the input string. Then it
takes from the pattern the next substring up to but not including either
a wildcard character or the end of the string, which is 'oo', and then
it checks if a prefix of the remaining input string can be found that is
"equal to" 'oo'. So here it would try in turn

'' = 'oo' collate ign_punct ?
'f' = 'oo' collate ign_punct ?
'fo' = 'oo' collate ign_punct ?
'foo' = 'oo' collate ign_punct ?
'foo.' = 'oo' collate ign_punct ?

and they all fail, so the match fails.

Interesting. Does that imply that these matches are slower than normal ones?

Yes, certainly, and there is also no indexing support (other than for
exact matches).

#8Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#7)
Re: Support LIKE with nondeterministic collations

Peter Eisentraut wrote:

Yes, certainly, and there is also no indexing support (other than for
exact matches).

The ICU docs have this note about prefix matching:

https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort-key-prefix-matching

* Generating bounds for a sort key (prefix matching)

Having sort keys for strings allows for easy creation of bounds -
sort keys that are guaranteed to be smaller or larger than any sort
key from a give range. For example, if bounds are produced for a
sortkey of string “smith”, strings between upper and lower bounds
with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
of upper bounds can be generated - the first one will match only
strings of equal length, while the second one will match all the
strings with the same initial prefix.

CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
the maximum primary weight, so that for example the string
“smith\uFFFF” can be used as the upper bound rather than modifying
the sort key for “smith”.

In other words it says that

col LIKE 'smith%' collate "nd"

is equivalent to:

col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

which could be obtained from an index scan, assuming a btree
index on "col" collate "nd".

U+FFFF is a valid code point but a "non-character" [1]https://www.unicode.org/glossary/#noncharacter so it's
not supposed to be present in normal strings.

[1]: https://www.unicode.org/glossary/#noncharacter

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#9Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#5)
Re: Support LIKE with nondeterministic collations

Peter Eisentraut wrote:

However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.

For #1 we're currently using the definition of a "character" as
being any single point of code, but this definition fits poorly
with non-deterministic collation rules.

The simplest illustration I can think of is the canonical
equivalence match between the NFD and NFC forms of an
accented character.

postgres=# CREATE COLLATION nd (
provider = 'icu',
locale = 'und',
deterministic = false
);

-- match NFD form with NFC form of eacute

postgres=# select U&'e\0301' like 'é' collate nd;
?column?
----------
t

postgres=# select U&'e\0301' like '_' collate nd;
?column?
----------
f
(1 row)

I understand why the algorithm produces these opposite results.
But at the semantic level, when asked if the left-hand string matches
a specific character, it says yes, and when asked if it matches any
character, it says no.
To me it goes beyond counter-intuitive, it's not reasonable enough to
be called correct.

What could we do about it?
Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1]https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary, and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]: https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#10Peter Eisentraut
peter@eisentraut.org
In reply to: Daniel Verite (#8)
Re: Support LIKE with nondeterministic collations

On 03.05.24 16:58, Daniel Verite wrote:

* Generating bounds for a sort key (prefix matching)

Having sort keys for strings allows for easy creation of bounds -
sort keys that are guaranteed to be smaller or larger than any sort
key from a give range. For example, if bounds are produced for a
sortkey of string “smith”, strings between upper and lower bounds
with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
of upper bounds can be generated - the first one will match only
strings of equal length, while the second one will match all the
strings with the same initial prefix.

CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
the maximum primary weight, so that for example the string
“smith\uFFFF” can be used as the upper bound rather than modifying
the sort key for “smith”.

In other words it says that

col LIKE 'smith%' collate "nd"

is equivalent to:

col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

which could be obtained from an index scan, assuming a btree
index on "col" collate "nd".

U+FFFF is a valid code point but a "non-character" [1] so it's
not supposed to be present in normal strings.

Thanks, this could be very useful!

#11Peter Eisentraut
peter@eisentraut.org
In reply to: Daniel Verite (#9)
Re: Support LIKE with nondeterministic collations

On 03.05.24 17:47, Daniel Verite wrote:

Peter Eisentraut wrote:

However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.

For #1 we're currently using the definition of a "character" as
being any single point of code,

That is the definition that is used throughout SQL and PostgreSQL. We
can't change that without redefining everything. To pick just one
example, the various trim function also behave in seemingly inconsistent
ways when you apply then to strings in different normalization forms.
The better fix there is to enforce the normalization form somehow.

Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1], and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Even that page says, what we are calling character here is really called
a grapheme cluster.

In a different world, pattern matching, character trimming, etc. would
work by grapheme, but it does not.

#12Peter Eisentraut
peter@eisentraut.org
In reply to: Daniel Verite (#2)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

Here is an updated patch for this.

I have added some more documentation based on the discussions, including
some examples taken directly from the emails here.

One thing I have been struggling with a bit is the correct use of
LIKE_FALSE versus LIKE_ABORT in the MatchText() code. I have made some
small tweaks about this in this version that I think are more correct,
but it could use another look. Maybe also some more tests to verify
this one way or the other.

Show quoted text

On 30.04.24 14:39, Daniel Verite wrote:

Peter Eisentraut wrote:

This patch adds support for using LIKE with nondeterministic
collations. So you can do things such as

col LIKE 'foo%' COLLATE case_insensitive

Nice!

The pattern is partitioned into substrings at wildcard characters
(so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
whole predicate matches if a match can be found for each partition
under the applicable collation

Trying with a collation that ignores punctuation:

postgres=# CREATE COLLATION "ign_punct" (
provider = 'icu',
locale='und-u-ka-shifted',
deterministic = false
);

postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
?column?
----------
t
(1 row)

postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
?column?
----------
t
(1 row)

postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
----------
f
(1 row)

The first two results look fine, but the next one is inconsistent.

Attachments:

v2-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v2-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From 34f5bb1e8f0ffbb39b1efc9777736f6b4d6c4caa Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Fri, 28 Jun 2024 06:55:45 +0200
Subject: [PATCH v2] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 +++++++-
 src/backend/utils/adt/like.c                  |  30 +++--
 src/backend/utils/adt/like_match.c            | 118 ++++++++++++++++++
 src/backend/utils/adt/like_support.c          |  29 ++---
 .../regress/expected/collate.icu.utf8.out     |  81 ++++++++++--
 src/test/regress/sql/collate.icu.utf8.sql     |  23 +++-
 7 files changed, 292 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2609269610b..833db120cb3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5374,9 +5374,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5422,6 +5423,45 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5463,8 +5503,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation && !lc_ctype_is_c(collation))
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale = 0;
+	bool		locale_is_c = false;
 
-		if (!pg_locale_deterministic(locale))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	if (lc_ctype_is_c(collation))
+		locale_is_c = true;
+	else
+		locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0, true);
+		return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0, true);
+		return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else
-		return MB_MatchText(s, slen, p, plen, 0, true);
+		return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..0751b329b8e 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p) != GETCHAR(*t))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7d59fb44316..8bfdf33c2fa 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,18 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..481a995c998 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,14 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: 3e53492aa7084bceaa92757c90e067d79768797e
-- 
2.45.2

#13Paul A Jungwirth
pj@illuminatedcomputing.com
In reply to: Peter Eisentraut (#12)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter@eisentraut.org> wrote:

Here is an updated patch for this.

I took a look at this. I added some tests and found a few that give
the wrong result (I believe). The new tests are included in the
attached patch, along with the results I expect. Here are the
failures:

 -- inner %% matches b then zero:
 SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)
 -- trailing _ matches two codepoints that form one char:
 SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)
-- leading % matches zero:
 SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)
 -- leading % matches zero (with later %):
 SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)

I think the 1st, 3rd, and 4th failures are all from % not backtracking
to match zero chars.

The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
then why does the same test pass with __ leading not trailing? Surely
they should be consistent.

I have added some more documentation based on the discussions, including
some examples taken directly from the emails here.

This looks good to me.

One thing I have been struggling with a bit is the correct use of
LIKE_FALSE versus LIKE_ABORT in the MatchText() code. I have made some
small tweaks about this in this version that I think are more correct,
but it could use another look. Maybe also some more tests to verify
this one way or the other.

I haven't looked at this yet.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com

Attachments:

v3-0001-Support-LIKE-with-nondeterministic-collations.patchapplication/octet-stream; name=v3-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From b421c65fd5391e9f2774fecd5529358c610116ba Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Fri, 28 Jun 2024 06:55:45 +0200
Subject: [PATCH v3] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 ++++-
 src/backend/utils/adt/like.c                  |  30 ++-
 src/backend/utils/adt/like_match.c            | 118 ++++++++++++
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 179 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  51 ++++-
 7 files changed, 418 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b669ab7f977..54b7b1c8372 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5380,9 +5380,10 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5428,6 +5429,45 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5469,8 +5509,9 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation && !lc_ctype_is_c(collation))
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale = 0;
+	bool		locale_is_c = false;
 
-		if (!pg_locale_deterministic(locale))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	if (lc_ctype_is_c(collation))
+		locale_is_c = true;
+	else
+		locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0, true);
+		return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0, true);
+		return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else
-		return MB_MatchText(s, slen, p, plen, 0, true);
+		return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..0751b329b8e 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -198,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p) != GETCHAR(*t))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7d59fb44316..7fd63312dde 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,116 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one char:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing __ matches two codepoints that form one char:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches two codepoints that form one char:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one char:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a char:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..9b966f47073 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ SELECT * FROM test6;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,42 @@ SELECT * FROM test4 WHERE b = 'cote' COLLATE ignore_accents;
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one char:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one char:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ matches two codepoints that form one char:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one char:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a char:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive
-- 
2.45.0

#14Peter Eisentraut
peter@eisentraut.org
In reply to: Paul A Jungwirth (#13)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On 27.07.24 00:32, Paul A Jungwirth wrote:

On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter@eisentraut.org> wrote:

Here is an updated patch for this.

I took a look at this. I added some tests and found a few that give
the wrong result (I believe). The new tests are included in the
attached patch, along with the results I expect. Here are the
failures:

Thanks, these are great test cases.

-- inner %% matches b then zero:
SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
?column?
----------
- t
+ f
(1 row)
-- trailing _ matches two codepoints that form one char:
SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
?column?
----------
- t
+ f
(1 row)
-- leading % matches zero:
SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
?column?
----------
- t
+ f
(1 row)
-- leading % matches zero (with later %):
SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
?column?
----------
- t
+ f
(1 row)

I think the 1st, 3rd, and 4th failures are all from % not backtracking
to match zero chars.

These are all because of this in like_match.c:

* Otherwise, scan for a text position at which we can match the
* rest of the pattern. The first remaining pattern char is known
* to be a regular or escaped literal character, so we can compare
* the first pattern byte to each text byte to avoid recursing
* more than we have to. [...]

This shortcut doesn't work with nondeterministic collations, so we have
to recurse in any case here. I have fixed that in the new patch; these
test cases work now.

The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
then why does the same test pass with __ leading not trailing? Surely
they should be consistent.

The question is why is

SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents; -- false

but

SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents; -- true

The second one matches because

SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;

So the accent character will be ignored if it's adjacent to another
fixed substring in the pattern.

Attachments:

v4-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v4-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From e9cc011946b714ce3d4f0430070521706936a311 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 30 Jul 2024 21:33:49 +0200
Subject: [PATCH v4] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 ++++-
 src/backend/utils/adt/like.c                  |  30 ++-
 src/backend/utils/adt/like_match.c            | 124 +++++++++++-
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 186 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  53 ++++-
 7 files changed, 431 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b39f97dc8de..2be327fd117 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5380,9 +5380,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5428,6 +5429,45 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5469,8 +5509,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 57ead66b5aa..bbbe6c09d18 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -149,22 +149,32 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation && !lc_ctype_is_c(collation))
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale = 0;
+	bool		locale_is_c = false;
 
-		if (!pg_locale_deterministic(locale))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	if (lc_ctype_is_c(collation))
+		locale_is_c = true;
+	else
+		locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0, true);
+		return SB_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0, true);
+		return UTF8_MatchText(s, slen, p, plen, locale, locale_is_c);
 	else
-		return MB_MatchText(s, slen, p, plen, 0, true);
+		return MB_MatchText(s, slen, p, plen, locale, locale_is_c);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f2990edff7e..43e991fd932 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -158,7 +158,9 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 			 * the first pattern byte to each text byte to avoid recursing
 			 * more than we have to.  This fact also guarantees that we don't
 			 * have to consider a match to the zero-length substring at the
-			 * end of the text.
+			 * end of the text.  With a nondeterministic collation, we can't
+			 * rely on the first bytes being equal, so we have to recurse in
+			 * any case.
 			 */
 			if (*p == '\\')
 			{
@@ -173,7 +175,7 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 
 			while (tlen > 0)
 			{
-				if (GETCHAR(*t) == firstpat)
+				if (GETCHAR(*t) == firstpat || (locale && !locale->deterministic))
 				{
 					int			matched = MatchText(t, tlen, p, plen,
 													locale, locale_is_c);
@@ -198,6 +200,124 @@ MatchText(const char *t, int tlen, const char *p, int plen,
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale, locale_is_c);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p) != GETCHAR(*t))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2635050861f..3c691a5cc95 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7d59fb44316..0cb777452ab 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1272,6 +1272,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,6 +1320,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1512,7 +1549,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1630,7 +1672,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1727,7 +1774,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1741,10 +1788,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1838,6 +1893,123 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..e08d1432941 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,44 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: f822be39629cd24a8ad1f8f6aa444e0c9ae1eaad
-- 
2.45.2

#15Jeff Davis
pgsql@j-davis.com
In reply to: Daniel Verite (#8)
Re: Support LIKE with nondeterministic collations

On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:

   * Generating bounds for a sort key (prefix matching)

   Having sort keys for strings allows for easy creation of bounds -
   sort keys that are guaranteed to be smaller or larger than any
sort
   key from a give range. For example, if bounds are produced for a
   sortkey of string “smith”, strings between upper and lower bounds
   with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
   of upper bounds can be generated - the first one will match only
   strings of equal length, while the second one will match all the
   strings with the same initial prefix.

   CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
   the maximum primary weight, so that for example the string
   “smith\uFFFF” can be used as the upper bound rather than modifying
   the sort key for “smith”.

In other words it says that

  col LIKE 'smith%' collate "nd"

is equivalent to:

  col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

That logic seems to assume something about the collation. If you have a
collation that orders strings by their sha256 hash, that would entirely
break the connection between prefixes and ranges, and it wouldn't work.

Is there something about the way collations are defined that inherently
maintains a connection between a prefix and a range? Does it remain
true even when strange rules are added to a collation?

Regards,
Jeff Davis

#16Daniel Verite
daniel@manitou-mail.org
In reply to: Jeff Davis (#15)
Re: Support LIKE with nondeterministic collations

Jeff Davis wrote:

col LIKE 'smith%' collate "nd"

is equivalent to:

col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

That logic seems to assume something about the collation. If you have a
collation that orders strings by their sha256 hash, that would entirely
break the connection between prefixes and ranges, and it wouldn't work.

Indeed I would not trust this trick to work outside of ICU collations.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#17Peter Eisentraut
peter@eisentraut.org
In reply to: Peter Eisentraut (#14)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

Here is an updated patch. It is rebased over the various recent changes
in the locale APIs. No other changes.

Show quoted text

On 30.07.24 21:46, Peter Eisentraut wrote:

On 27.07.24 00:32, Paul A Jungwirth wrote:

On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut
<peter@eisentraut.org> wrote:

Here is an updated patch for this.

I took a look at this. I added some tests and found a few that give
the wrong result (I believe). The new tests are included in the
attached patch, along with the results I expect. Here are the
failures:

Thanks, these are great test cases.

  -- inner %% matches b then zero:
  SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
   ?column?
  ----------
- t
+ f
  (1 row)

  -- trailing _ matches two codepoints that form one char:
  SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
   ?column?
  ----------
- t
+ f
  (1 row)

-- leading % matches zero:
  SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
   ?column?
  ----------
- t
+ f
  (1 row)

  -- leading % matches zero (with later %):
  SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
   ?column?
  ----------
- t
+ f
  (1 row)

I think the 1st, 3rd, and 4th failures are all from % not backtracking
to match zero chars.

These are all because of this in like_match.c:

       * Otherwise, scan for a text position at which we can match the
       * rest of the pattern.  The first remaining pattern char is known
       * to be a regular or escaped literal character, so we can compare
       * the first pattern byte to each text byte to avoid recursing
       * more than we have to.  [...]

This shortcut doesn't work with nondeterministic collations, so we have
to recurse in any case here.  I have fixed that in the new patch; these
test cases work now.

The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
then why does the same test pass with __ leading not trailing? Surely
they should be consistent.

The question is why is

SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;  -- false

but

SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;  -- true

The second one matches because

    SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;

So the accent character will be ignored if it's adjacent to another
fixed substring in the pattern.

Attachments:

v5-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v5-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From d27d247ad824dab5e2bb3638cebef21da4cc925c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 16 Sep 2024 07:57:56 +0200
Subject: [PATCH v5] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

Note that ILIKE is not affected.  It's unclear whether ILIKE makes
sense under nondeterministic collations.

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 ++++-
 src/backend/utils/adt/like.c                  |  26 ++-
 src/backend/utils/adt/like_match.c            | 124 +++++++++++-
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 186 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  53 ++++-
 7 files changed, 427 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..533b3af9045 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 84eb3a45eea..b62268d55c1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5414,9 +5414,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5462,6 +5463,45 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5503,8 +5543,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 0152723b2a6..7b3d1b5be71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -147,22 +147,28 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation)
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale;
 
-		if (!locale->deterministic)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0);
+		return SB_MatchText(s, slen, p, plen, locale);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0);
+		return UTF8_MatchText(s, slen, p, plen, locale);
 	else
-		return MB_MatchText(s, slen, p, plen, 0);
+		return MB_MatchText(s, slen, p, plen, locale);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f561cc15e4c..68f67b496c9 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -157,7 +157,9 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			 * the first pattern byte to each text byte to avoid recursing
 			 * more than we have to.  This fact also guarantees that we don't
 			 * have to consider a match to the zero-length substring at the
-			 * end of the text.
+			 * end of the text.  With a nondeterministic collation, we can't
+			 * rely on the first bytes being equal, so we have to recurse in
+			 * any case.
 			 */
 			if (*p == '\\')
 			{
@@ -172,7 +174,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 
 			while (tlen > 0)
 			{
-				if (GETCHAR(*t, locale) == firstpat)
+				if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
 				{
 					int			matched = MatchText(t, tlen, p, plen, locale);
 
@@ -196,6 +198,124 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 79c4ddc7573..df9c0481a6d 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 31345295c11..33aeb229dec 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1274,6 +1274,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,6 +1322,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1514,7 +1551,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1632,7 +1674,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1729,7 +1776,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1743,10 +1790,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1840,6 +1895,123 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..e08d1432941 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,44 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: 4632e5cf4bc5c496f41dfc6a89533e7afa7262dd
-- 
2.46.0

#18Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#17)
Re: Support LIKE with nondeterministic collations

On Sun, Sep 15, 2024 at 11:26 PM Peter Eisentraut <peter@eisentraut.org> wrote:

Here is an updated patch. It is rebased over the various recent changes
in the locale APIs. No other changes.

libfuzzer is unhappy about the following code in MatchText:

+            while (p1len > 0)
+            {
+                if (*p1 == '\\')
+                {
+                    found_escape = true;
+                    NextByte(p1, p1len);
+                }
+                else if (*p1 == '_' || *p1 == '%')
+                    break;
+                NextByte(p1, p1len);
+            }

If the pattern ends with a backslash, we'll call NextByte() twice,
p1len will wrap around to INT_MAX, and we'll walk off the end of the
buffer. (I fixed it locally by duplicating the ERROR case that's
directly above this.)

So far that's the only thing reported, but fuzzing is slow. The fuzzer
is incentivized to find more and more horrible call stacks, which in
this case means it's finding inefficient patterns with a lot of
backtracking. (Performance drops from 25000+ iterations per second, to
roughly 50 per second, pretty quickly, and that's not fast enough to
make good progress.) I haven't dug in yet to see whether there are
optimizations that would avoid the worst cases.

Thanks,
--Jacob

#19Peter Eisentraut
peter@eisentraut.org
In reply to: Jacob Champion (#18)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On 29.10.24 18:15, Jacob Champion wrote:

libfuzzer is unhappy about the following code in MatchText:

+            while (p1len > 0)
+            {
+                if (*p1 == '\\')
+                {
+                    found_escape = true;
+                    NextByte(p1, p1len);
+                }
+                else if (*p1 == '_' || *p1 == '%')
+                    break;
+                NextByte(p1, p1len);
+            }

If the pattern ends with a backslash, we'll call NextByte() twice,
p1len will wrap around to INT_MAX, and we'll walk off the end of the
buffer. (I fixed it locally by duplicating the ERROR case that's
directly above this.)

Thanks. Here is an updated patch with that fixed.

Attachments:

v6-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v6-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From 17f793293469bae8d5818edee62f0429fd02df67 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 4 Nov 2024 09:16:44 +0100
Subject: [PATCH v6] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.  The implementation follows the specification of
the SQL standard for this.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

ILIKE is not affected.  (It's unclear whether ILIKE makes sense under
nondeterministic collations.)

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 ++++-
 src/backend/utils/adt/like.c                  |  26 ++-
 src/backend/utils/adt/like_match.c            | 128 +++++++++++-
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 189 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  55 ++++-
 7 files changed, 436 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f5e115e8d6e..00e1986849a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 223d869f8c8..285111fc905 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5414,9 +5414,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5462,6 +5463,45 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5503,8 +5543,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 0152723b2a6..7b3d1b5be71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -147,22 +147,28 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation)
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale;
 
-		if (!locale->deterministic)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0);
+		return SB_MatchText(s, slen, p, plen, locale);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0);
+		return UTF8_MatchText(s, slen, p, plen, locale);
 	else
-		return MB_MatchText(s, slen, p, plen, 0);
+		return MB_MatchText(s, slen, p, plen, locale);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f561cc15e4c..79de4c7e4b4 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -157,7 +157,9 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			 * the first pattern byte to each text byte to avoid recursing
 			 * more than we have to.  This fact also guarantees that we don't
 			 * have to consider a match to the zero-length substring at the
-			 * end of the text.
+			 * end of the text.  With a nondeterministic collation, we can't
+			 * rely on the first bytes being equal, so we have to recurse in
+			 * any case.
 			 */
 			if (*p == '\\')
 			{
@@ -172,7 +174,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 
 			while (tlen > 0)
 			{
-				if (GETCHAR(*t, locale) == firstpat)
+				if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
 				{
 					int			matched = MatchText(t, tlen, p, plen, locale);
 
@@ -196,6 +198,128 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+					if (p1len == 0)
+						ereport(ERROR,
+								(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+								 errmsg("LIKE pattern must not end with escape character")));
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 8b15509a3bf..ee71ca89ffd 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index faa376e060c..7da68105b8c 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1274,6 +1274,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,6 +1322,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1514,7 +1551,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1632,7 +1674,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1729,7 +1776,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1743,10 +1790,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1840,6 +1895,126 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+ERROR:  LIKE pattern must not end with escape character
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 80f28a97d78..f8dbafbb655 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,46 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: 5b0c46ea0932e3be64081a277b5cc01fa9571689
-- 
2.47.0

#20Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Eisentraut (#19)
Re: Support LIKE with nondeterministic collations

On 04/11/2024 10:26, Peter Eisentraut wrote:

On 29.10.24 18:15, Jacob Champion wrote:

libfuzzer is unhappy about the following code in MatchText:

+            while (p1len > 0)
+            {
+                if (*p1 == '\\')
+                {
+                    found_escape = true;
+                    NextByte(p1, p1len);
+                }
+                else if (*p1 == '_' || *p1 == '%')
+                    break;
+                NextByte(p1, p1len);
+            }

If the pattern ends with a backslash, we'll call NextByte() twice,
p1len will wrap around to INT_MAX, and we'll walk off the end of the
buffer. (I fixed it locally by duplicating the ERROR case that's
directly above this.)

Thanks.  Here is an updated patch with that fixed.

Sadly the algorithm is O(n^2) with non-deterministic collations.Is there
any way this could be optimized? We make no claims on how expensive any
functions or operators are, so I suppose a slow implementation is
nevertheless better than throwing an error.

Let's at least add some CHECK_FOR_INTERRUPTS(). For example, this takes
a very long time and is uninterruptible:

SELECT repeat('x', 100000) LIKE '%xxxy%' COLLATE ignore_accents;

--
Heikki Linnakangas
Neon (https://neon.tech)

#21Peter Eisentraut
peter@eisentraut.org
In reply to: Heikki Linnakangas (#20)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On 11.11.24 14:25, Heikki Linnakangas wrote:

Sadly the algorithm is O(n^2) with non-deterministic collations.Is there
any way this could be optimized? We make no claims on how expensive any
functions or operators are, so I suppose a slow implementation is
nevertheless better than throwing an error.

Yeah, maybe someone comes up with new ideas in the future.

Let's at least add some CHECK_FOR_INTERRUPTS(). For example, this takes
a very long time and is uninterruptible:

 SELECT repeat('x', 100000) LIKE '%xxxy%' COLLATE ignore_accents;

Good idea. Here is a new patch version with an interrupt check.

Attachments:

v7-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v7-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From bc962a3ad09f9ddaa6e9bc345376f26d16471612 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 12 Nov 2024 08:43:27 +0100
Subject: [PATCH v7] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.  The implementation follows the specification of
the SQL standard for this.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

ILIKE is not affected.  (It's unclear whether ILIKE makes sense under
nondeterministic collations.)

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  51 ++++-
 src/backend/utils/adt/like.c                  |  26 ++-
 src/backend/utils/adt/like_match.c            | 130 +++++++++++-
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 189 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  55 ++++-
 7 files changed, 438 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f5e115e8d6e..00e1986849a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 73979f20fff..4fc29379de4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5414,9 +5414,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5462,6 +5463,45 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings.  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5503,8 +5543,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 0152723b2a6..7b3d1b5be71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -147,22 +147,28 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation)
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale;
 
-		if (!locale->deterministic)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0);
+		return SB_MatchText(s, slen, p, plen, locale);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0);
+		return UTF8_MatchText(s, slen, p, plen, locale);
 	else
-		return MB_MatchText(s, slen, p, plen, 0);
+		return MB_MatchText(s, slen, p, plen, locale);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f561cc15e4c..f3ca4d30823 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -157,7 +157,9 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			 * the first pattern byte to each text byte to avoid recursing
 			 * more than we have to.  This fact also guarantees that we don't
 			 * have to consider a match to the zero-length substring at the
-			 * end of the text.
+			 * end of the text.  With a nondeterministic collation, we can't
+			 * rely on the first bytes being equal, so we have to recurse in
+			 * any case.
 			 */
 			if (*p == '\\')
 			{
@@ -172,7 +174,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 
 			while (tlen > 0)
 			{
-				if (GETCHAR(*t, locale) == firstpat)
+				if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
 				{
 					int			matched = MatchText(t, tlen, p, plen, locale);
 
@@ -196,6 +198,130 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+					if (p1len == 0)
+						ereport(ERROR,
+								(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+								 errmsg("LIKE pattern must not end with escape character")));
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				CHECK_FOR_INTERRUPTS();
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 8b15509a3bf..ee71ca89ffd 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6fa32ae3649..2e7928e1b35 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1274,6 +1274,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,6 +1322,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1514,7 +1551,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1632,7 +1674,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1729,7 +1776,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1743,10 +1790,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1840,6 +1895,126 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+ERROR:  LIKE pattern must not end with escape character
 -- foreign keys (should use collation of primary key)
 -- PK is case-sensitive, FK is case-insensitive
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 49fa9758b40..74628ec06a2 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,46 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+
 -- foreign keys (should use collation of primary key)
 
 -- PK is case-sensitive, FK is case-insensitive

base-commit: db22b900244d3c51918acce44cbe5bb6f6507d32
-- 
2.47.0

#22jian he
jian.universality@gmail.com
In reply to: Peter Eisentraut (#21)
Re: Support LIKE with nondeterministic collations

On Tue, Nov 12, 2024 at 3:45 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 11.11.24 14:25, Heikki Linnakangas wrote:

Sadly the algorithm is O(n^2) with non-deterministic collations.Is there
any way this could be optimized? We make no claims on how expensive any
functions or operators are, so I suppose a slow implementation is
nevertheless better than throwing an error.

Yeah, maybe someone comes up with new ideas in the future.

/*
* Now build a substring of the text and try to match it against
* the subpattern. t is the start of the text, t1 is one past the
* last byte. We start with a zero-length string.
*/
t1 = t
t1len = tlen;
for (;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);

select '.foo.' LIKE '_oo' COLLATE ign_punct;
pg_strncoll's iteration of the first 4 argument values.
oo 2 foo. 0
oo 2 foo. 1
oo 2 foo. 2
oo 2 foo. 3
oo 2 foo. 4

seems there is a shortcut/optimization.
if subpat don't have wildcard(percent sign, underscore)
then we can have less pg_strncoll calls?

minimum case to trigger error within GenericMatchText
since no related tests.
create table t1(a text collate case_insensitive, b text collate "C");
insert into t1 values ('a','a');
select a like b from t1;

at 9.7.1. LIKE section, we still don't know what "wildcard" is.
we mentioned it at 9.7.2.
maybe we can add a sentence at the end of:
<para>
If <replaceable>pattern</replaceable> does not contain percent
signs or underscores, then the pattern only represents the string
itself; in that case <function>LIKE</function> acts like the
equals operator. An underscore (<literal>_</literal>) in
<replaceable>pattern</replaceable> stands for (matches) any single
character; a percent sign (<literal>%</literal>) matches any sequence
of zero or more characters.
</para>

saying underscore and percent sign are wildcards in LIKE.
other than that, I can understand the doc.

#23Peter Eisentraut
peter@eisentraut.org
In reply to: jian he (#22)
Re: Support LIKE with nondeterministic collations

On 15.11.24 05:26, jian he wrote:

/*
* Now build a substring of the text and try to match it against
* the subpattern. t is the start of the text, t1 is one past the
* last byte. We start with a zero-length string.
*/
t1 = t
t1len = tlen;
for (;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);

select '.foo.' LIKE '_oo' COLLATE ign_punct;
pg_strncoll's iteration of the first 4 argument values.
oo 2 foo. 0
oo 2 foo. 1
oo 2 foo. 2
oo 2 foo. 3
oo 2 foo. 4

seems there is a shortcut/optimization.
if subpat don't have wildcard(percent sign, underscore)
then we can have less pg_strncoll calls?

How would you do that? You need to try all combinations to find one
that matches.

minimum case to trigger error within GenericMatchText
since no related tests.
create table t1(a text collate case_insensitive, b text collate "C");
insert into t1 values ('a','a');
select a like b from t1;

This results in

ERROR: 42P22: could not determine which collation to use for LIKE
HINT: Use the COLLATE clause to set the collation explicitly.

which is the expected behavior.

at 9.7.1. LIKE section, we still don't know what "wildcard" is.
we mentioned it at 9.7.2.
maybe we can add a sentence at the end of:
<para>
If <replaceable>pattern</replaceable> does not contain percent
signs or underscores, then the pattern only represents the string
itself; in that case <function>LIKE</function> acts like the
equals operator. An underscore (<literal>_</literal>) in
<replaceable>pattern</replaceable> stands for (matches) any single
character; a percent sign (<literal>%</literal>) matches any sequence
of zero or more characters.
</para>

saying underscore and percent sign are wildcards in LIKE.
other than that, I can understand the doc.

Ok, I agree that could be clarified.

#24jian he
jian.universality@gmail.com
In reply to: Peter Eisentraut (#23)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On Fri, Nov 15, 2024 at 11:42 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 15.11.24 05:26, jian he wrote:

/*
* Now build a substring of the text and try to match it against
* the subpattern. t is the start of the text, t1 is one past the
* last byte. We start with a zero-length string.
*/
t1 = t
t1len = tlen;
for (;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);

select '.foo.' LIKE '_oo' COLLATE ign_punct;
pg_strncoll's iteration of the first 4 argument values.
oo 2 foo. 0
oo 2 foo. 1
oo 2 foo. 2
oo 2 foo. 3
oo 2 foo. 4

seems there is a shortcut/optimization.
if subpat don't have wildcard(percent sign, underscore)
then we can have less pg_strncoll calls?

How would you do that? You need to try all combinations to find one
that matches.

we can optimize when trailing (last character) is not wildcards.

SELECT 'Ha12foo' LIKE '%foo' COLLATE ignore_accents;
within the for loop
for(;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
....
}

pg_strncoll comparison will become
Ha12foo foo
a12foo foo
12foo foo
2foo foo
foo foo

it's safe because in MatchText we have:
else if (*p == '%')
{
while (tlen > 0)
{
if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
{
int matched = MatchText(t, tlen, p, plen, locale);
if (matched != LIKE_FALSE)
return matched; /* TRUE or ABORT */
}
NextChar(t, tlen);
}
}

please check attached.

minimum case to trigger error within GenericMatchText
since no related tests.
create table t1(a text collate case_insensitive, b text collate "C");
insert into t1 values ('a','a');
select a like b from t1;

This results in

ERROR: 42P22: could not determine which collation to use for LIKE
HINT: Use the COLLATE clause to set the collation explicitly.

which is the expected behavior.

sorry, didn't mention it clearly, i mean we can add it to the regress test.

Attachments:

v8-0001-LIKE-with-nondeterministic-collations-no-trail.no-cfbotapplication/octet-stream; name=v8-0001-LIKE-with-nondeterministic-collations-no-trail.no-cfbotDownload
From 3f82ad436deaf81218dc4b4898f6d6edb3005c78 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Mon, 18 Nov 2024 11:06:17 +0800
Subject: [PATCH v8 1/1] LIKE with nondeterministic collations no trailing wild
 cards

optimize where there is no trailing wildcards.
---
 src/backend/utils/adt/like_match.c | 38 ++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f3ca4d3082..dd1b136255 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -212,6 +212,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			const char *t1;
 			size_t		t1len;
 			bool		found_escape;
+			bool		trailing_wildcards = false;
 			const char *subpat;
 			size_t		subpatlen;
 			char	   *buf = NULL;
@@ -279,7 +280,33 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 
 				CHECK_FOR_INTERRUPTS();
 
-				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+				if (!trailing_wildcards)
+				{
+					/* we check subpatlen and next character is wild-card or
+					 * not. if subpatlen == 1, then we don't have to check the
+					 * next char. if we didn't found trailing_wildcards then we can
+					 * safely compare whole subpat with t. if cmp != 0 we return
+					 * LIKE_FALSE.
+					*/
+					if (subpatlen == 1 && (subpat[0] == '%' || subpat[0] == '_'))
+					{
+						trailing_wildcards = true;
+						break;
+					}
+					for (int i = 0; i <=subpatlen; i++)
+					{
+						if (subpat[i] == '%' || subpat[i] == '_')
+						{
+							trailing_wildcards = true;
+							break;
+						}
+					}
+				}
+
+				if (!trailing_wildcards)
+					cmp = pg_strncoll(subpat, subpatlen, t, tlen, locale);
+				else
+					cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
 
 				/*
 				 * If we found a match, we have to test if the rest of pattern
@@ -299,7 +326,12 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 				 */
 				if (cmp == 0)
 				{
-					int			matched = MatchText(t1, t1len, p1, p1len, locale);
+					int			matched;
+
+					if (!trailing_wildcards)
+						return LIKE_TRUE;
+
+					matched = MatchText(t1, t1len, p1, p1len, locale);
 
 					if (matched == LIKE_TRUE)
 					{
@@ -308,6 +340,8 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 						return matched;
 					}
 				}
+				else if (!trailing_wildcards)
+					return LIKE_FALSE;
 
 				/*
 				 * Didn't match.  If we used up the whole text, then the match
-- 
2.34.1

#25Peter Eisentraut
peter@eisentraut.org
In reply to: jian he (#24)
1 attachment(s)
Re: Support LIKE with nondeterministic collations

On 18.11.24 04:30, jian he wrote:

we can optimize when trailing (last character) is not wildcards.

SELECT 'Ha12foo' LIKE '%foo' COLLATE ignore_accents;
within the for loop
for(;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
....
}

pg_strncoll comparison will become
Ha12foo foo
a12foo foo
12foo foo
2foo foo
foo foo

it's safe because in MatchText we have:
else if (*p == '%')
{
while (tlen > 0)
{
if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
{
int matched = MatchText(t, tlen, p, plen, locale);
if (matched != LIKE_FALSE)
return matched; /* TRUE or ABORT */
}
NextChar(t, tlen);
}
}

please check attached.

I see, good idea. I implemented it a bit differently. See "Shortcut:
If this is the end of the pattern ..." in this patch. Please check if
this is what you had in mind.

Attachments:

v8-0001-Support-LIKE-with-nondeterministic-collations.patchtext/plain; charset=UTF-8; name=v8-0001-Support-LIKE-with-nondeterministic-collations.patchDownload
From e9252c1c8ec60e7c4813490908d6ceb575840420 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 19 Nov 2024 14:47:39 +0100
Subject: [PATCH v8] Support LIKE with nondeterministic collations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.  The implementation follows the specification of
the SQL standard for this.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

ILIKE is not affected.  (It's unclear whether ILIKE makes sense under
nondeterministic collations.)

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
---
 doc/src/sgml/charset.sgml                     |   2 +-
 doc/src/sgml/func.sgml                        |  52 ++++-
 src/backend/utils/adt/like.c                  |  26 ++-
 src/backend/utils/adt/like_match.c            | 148 +++++++++++++-
 src/backend/utils/adt/like_support.c          |  29 ++-
 .../regress/expected/collate.icu.utf8.out     | 189 +++++++++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  55 ++++-
 7 files changed, 457 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f5e115e8d6e..00e1986849a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1197,7 +1197,7 @@ <title>Nondeterministic Collations</title>
      to a performance penalty.  Note, in particular, that B-tree cannot use
      deduplication with indexes that use a nondeterministic collation.  Also,
      certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 73979f20fff..1039984cb49 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5414,9 +5414,10 @@ <title>Pattern Matching</title>
    </caution>
 
    <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
    </para>
 
   <sect2 id="functions-like">
@@ -5462,6 +5463,46 @@ <title><function>LIKE</function></title>
 </programlisting>
    </para>
 
+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings (wildcards being
+    <literal>_</literal> and <literal>%</literal>).  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
    <para>
     <function>LIKE</function> pattern matching always covers the entire
     string.  Therefore, if it's desired to match a sequence anywhere within
@@ -5503,8 +5544,9 @@ <title><function>LIKE</function></title>
 
    <para>
     The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
     <productname>PostgreSQL</productname> extension.
    </para>
 
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 0152723b2a6..7b3d1b5be71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -147,22 +147,28 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
 static inline int
 GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
 {
-	if (collation)
-	{
-		pg_locale_t locale = pg_newlocale_from_collation(collation);
+	pg_locale_t locale;
 
-		if (!locale->deterministic)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("nondeterministic collations are not supported for LIKE")));
+	if (!OidIsValid(collation))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for LIKE"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
 	}
 
+	locale = pg_newlocale_from_collation(collation);
+
 	if (pg_database_encoding_max_length() == 1)
-		return SB_MatchText(s, slen, p, plen, 0);
+		return SB_MatchText(s, slen, p, plen, locale);
 	else if (GetDatabaseEncoding() == PG_UTF8)
-		return UTF8_MatchText(s, slen, p, plen, 0);
+		return UTF8_MatchText(s, slen, p, plen, locale);
 	else
-		return MB_MatchText(s, slen, p, plen, 0);
+		return MB_MatchText(s, slen, p, plen, locale);
 }
 
 static inline int
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index f561cc15e4c..87011459c34 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -157,7 +157,9 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			 * the first pattern byte to each text byte to avoid recursing
 			 * more than we have to.  This fact also guarantees that we don't
 			 * have to consider a match to the zero-length substring at the
-			 * end of the text.
+			 * end of the text.  With a nondeterministic collation, we can't
+			 * rely on the first bytes being equal, so we have to recurse in
+			 * any case.
 			 */
 			if (*p == '\\')
 			{
@@ -172,7 +174,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 
 			while (tlen > 0)
 			{
-				if (GETCHAR(*t, locale) == firstpat)
+				if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
 				{
 					int			matched = MatchText(t, tlen, p, plen, locale);
 
@@ -196,6 +198,148 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
 			NextByte(p, plen);
 			continue;
 		}
+		else if (locale && !locale->deterministic)
+		{
+			/*
+			 * For nondeterministic locales, we find the next substring of the
+			 * pattern that does not contain wildcards and try to find a
+			 * matching substring in the text.  Crucially, we cannot do this
+			 * character by character, as in the normal case, but must do it
+			 * substring by substring, partitioned by the wildcard characters.
+			 */
+			const char *p1;
+			size_t		p1len;
+			const char *t1;
+			size_t		t1len;
+			bool		found_escape;
+			const char *subpat;
+			size_t		subpatlen;
+			char	   *buf = NULL;
+
+			/*
+			 * Determine next substring of pattern without wildcards.  p is
+			 * the start of the subpattern, p1 is one past the last byte. Also
+			 * track if we found an escape character.
+			 */
+			p1 = p;
+			p1len = plen;
+			found_escape = false;
+			while (p1len > 0)
+			{
+				if (*p1 == '\\')
+				{
+					found_escape = true;
+					NextByte(p1, p1len);
+					if (p1len == 0)
+						ereport(ERROR,
+								(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+								 errmsg("LIKE pattern must not end with escape character")));
+				}
+				else if (*p1 == '_' || *p1 == '%')
+					break;
+				NextByte(p1, p1len);
+			}
+
+			/*
+			 * If we found an escape character, then make an unescaped copy of
+			 * the subpattern.
+			 */
+			if (found_escape)
+			{
+				char	   *b;
+
+				b = buf = palloc(p1 - p);
+				for (const char *c = p; c < p1; c++)
+				{
+					if (*c == '\\')
+						;
+					else
+						*(b++) = *c;
+				}
+
+				subpat = buf;
+				subpatlen = b - buf;
+			}
+			else
+			{
+				subpat = p;
+				subpatlen = p1 - p;
+			}
+
+			/*
+			 * Shortcut: If this is the end of the pattern, then the rest of
+			 * the text has to match the rest of the pattern.
+			 */
+			if (p1len == 0)
+			{
+				int			cmp;
+
+				cmp = pg_strncoll(subpat, subpatlen, t, tlen, locale);
+
+				if (buf)
+					pfree(buf);
+				if (cmp == 0)
+					return LIKE_TRUE;
+				else
+					return LIKE_FALSE;
+			}
+
+			/*
+			 * Now build a substring of the text and try to match it against
+			 * the subpattern.  t is the start of the text, t1 is one past the
+			 * last byte.  We start with a zero-length string.
+			 */
+			t1 = t;
+			t1len = tlen;
+			for (;;)
+			{
+				int			cmp;
+
+				CHECK_FOR_INTERRUPTS();
+
+				cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
+
+				/*
+				 * If we found a match, we have to test if the rest of pattern
+				 * can match against the rest of the string.  Otherwise we
+				 * have to continue here try matching with a longer substring.
+				 * (This is similar to the recursion for the '%' wildcard
+				 * above.)
+				 *
+				 * Note that we can't just wind forward p and t and continue
+				 * with the main loop.  This would fail for example with
+				 *
+				 * U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents
+				 *
+				 * You'd find that t=\0061 matches p=\00E4, but then the rest
+				 * won't match; but t=\0061\0308 also matches p=\00E4, and
+				 * then the rest will match.
+				 */
+				if (cmp == 0)
+				{
+					int			matched = MatchText(t1, t1len, p1, p1len, locale);
+
+					if (matched == LIKE_TRUE)
+					{
+						if (buf)
+							pfree(buf);
+						return matched;
+					}
+				}
+
+				/*
+				 * Didn't match.  If we used up the whole text, then the match
+				 * fails.  Otherwise, try again with a longer substring.
+				 */
+				if (t1len == 0)
+					return LIKE_FALSE;
+				else
+					NextChar(t1, t1len);
+			}
+			if (buf)
+				pfree(buf);
+			continue;
+		}
 		else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
 		{
 			/* non-wildcard pattern char fails to match text char */
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 8b15509a3bf..ee71ca89ffd 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -272,22 +272,6 @@ match_pattern_prefix(Node *leftop,
 		return NIL;
 	patt = (Const *) rightop;
 
-	/*
-	 * Not supported if the expression collation is nondeterministic.  The
-	 * optimized equality or prefix tests use bytewise comparisons, which is
-	 * not consistent with nondeterministic collations.  The actual
-	 * pattern-matching implementation functions will later error out that
-	 * pattern-matching is not supported with nondeterministic collations. (We
-	 * could also error out here, but by doing it later we get more precise
-	 * error messages.)  (It should be possible to support at least
-	 * Pattern_Prefix_Exact, but no point as long as the actual
-	 * pattern-matching implementations don't support it.)
-	 *
-	 * expr_coll is not set for a non-collation-aware data type such as bytea.
-	 */
-	if (expr_coll && !get_collation_isdeterministic(expr_coll))
-		return NIL;
-
 	/*
 	 * Try to extract a fixed prefix from the pattern.
 	 */
@@ -404,6 +388,8 @@ match_pattern_prefix(Node *leftop,
 	{
 		if (!op_in_opfamily(eqopr, opfamily))
 			return NIL;
+		if (indexcollation != expr_coll)
+			return NIL;
 		expr = make_opclause(eqopr, BOOLOID, false,
 							 (Expr *) leftop, (Expr *) prefix,
 							 InvalidOid, indexcollation);
@@ -411,6 +397,17 @@ match_pattern_prefix(Node *leftop,
 		return result;
 	}
 
+	/*
+	 * Anything other than Pattern_Prefix_Exact is not supported if the
+	 * expression collation is nondeterministic.  The optimized equality or
+	 * prefix tests use bytewise comparisons, which is not consistent with
+	 * nondeterministic collations.
+	 *
+	 * expr_coll is not set for a non-collation-aware data type such as bytea.
+	 */
+	if (expr_coll && !get_collation_isdeterministic(expr_coll))
+		return NIL;
+
 	/*
 	 * Otherwise, we have a nonempty required prefix of the values.  Some
 	 * opclasses support prefix checks directly, otherwise we'll try to
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index de17f7db6ce..79e76d4b850 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1274,6 +1274,30 @@ CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 NOTICE:  using standard form "und" for ICU locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 NOTICE:  using standard form "und" for ICU locale ""
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,6 +1322,19 @@ SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
  2 | äbc
 (2 rows)
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+ a |  b  
+---+-----
+ 1 | äbc
+(1 row)
+
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+ a |  b  
+---+-----
+ 1 | äbc
+ 2 | äbc
+(2 rows)
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -1514,7 +1551,12 @@ SELECT x FROM test3ci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3ci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3ci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3ci WHERE x SIMILAR TO 'a%';
@@ -1632,7 +1674,12 @@ SELECT x FROM test3bpci WHERE x <> 'abc';
 (2 rows)
 
 SELECT x FROM test3bpci WHERE x LIKE 'a%';
-ERROR:  nondeterministic collations are not supported for LIKE
+  x  
+-----
+ abc
+ ABC
+(2 rows)
+
 SELECT x FROM test3bpci WHERE x ILIKE 'a%';
 ERROR:  nondeterministic collations are not supported for ILIKE
 SELECT x FROM test3bpci WHERE x SIMILAR TO 'a%';
@@ -1729,7 +1776,7 @@ SELECT string_to_array('ABCDEFGHI'::char(9) COLLATE case_insensitive, NULL, 'b')
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
@@ -1743,10 +1790,18 @@ SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
 ---
 (0 rows)
 
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
-ERROR:  nondeterministic collations are not supported for LIKE
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
+  x  
+-----
+ abc
+(1 row)
+
 RESET enable_seqscan;
 -- Unicode special case: different variants of Greek lower case sigma.
 -- A naive implementation like citext that just does lower(x) =
@@ -1840,6 +1895,126 @@ SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
  1 | cote
 (1 row)
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+ ?column? 
+----------
+ f
+(1 row)
+
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+ERROR:  LIKE pattern must not end with escape character
 -- foreign keys (mixing different nondeterministic collations not allowed)
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
 CREATE TABLE test10fk (x text COLLATE case_insensitive REFERENCES test10pk (x) ON UPDATE CASCADE ON DELETE CASCADE);  -- error
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 0c9491c260e..797e93ac714 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -514,6 +514,12 @@ CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
 
+SELECT 'abc' LIKE 'abc' COLLATE ctest_det;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_det;
+
+SELECT 'abc' LIKE 'abc' COLLATE ctest_nondet;
+SELECT 'abc' LIKE 'a\bc' COLLATE ctest_nondet;
+
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -522,6 +528,9 @@ CREATE TABLE test6 (a int, b text);
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_det;
 SELECT * FROM test6 WHERE b = 'äbc' COLLATE ctest_nondet;
 
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_det;
+SELECT * FROM test6 WHERE b LIKE 'äbc' COLLATE ctest_nondet;
+
 -- same with arrays
 CREATE TABLE test6a (a int, b text[]);
 INSERT INTO test6a VALUES (1, ARRAY[U&'\00E4bc']);
@@ -637,14 +646,14 @@ CREATE UNIQUE INDEX ON test3bpci (x);  -- error
 -- This tests the issue described in match_pattern_prefix().  In the
 -- absence of that check, the case_insensitive tests below would
 -- return no rows where they should logically return one.
-CREATE TABLE test4c (x text COLLATE "C");
+CREATE TABLE test4c (x text COLLATE case_insensitive);
 INSERT INTO test4c VALUES ('abc');
 CREATE INDEX ON test4c (x);
 SET enable_seqscan = off;
 SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_sensitive;  -- ok, no rows
 SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_sensitive;  -- ok, no rows
-SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- error
-SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- error
+SELECT x FROM test4c WHERE x LIKE 'ABC' COLLATE case_insensitive;  -- ok
+SELECT x FROM test4c WHERE x LIKE 'ABC%' COLLATE case_insensitive;  -- ok
 RESET enable_seqscan;
 
 -- Unicode special case: different variants of Greek lower case sigma.
@@ -687,6 +696,46 @@ CREATE TABLE test4 (a int, b text);
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE ignore_accents;  -- still case-sensitive
 SELECT * FROM test4 WHERE b = 'Cote' COLLATE case_insensitive;
 
+-- This is a tricky one.  A naive implementation would first test
+-- \00E4 matches \0061, which is true under ignore_accents, but then
+-- the rest of the string won't match anymore.  Therefore, the
+-- algorithm has to test whether the rest of the string matches, and
+-- if not try matching \00E4 against a longer substring like
+-- \0061\0308, which will then work out.
+SELECT U&'\0061\0308bc' LIKE U&'\00E4_c' COLLATE ignore_accents;
+-- and in reverse:
+SELECT U&'\00E4bc' LIKE U&'\0061\0308_c' COLLATE ignore_accents;
+-- inner % matches b:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%%c' COLLATE ignore_accents;
+-- inner %% matches b then zero:
+SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
+-- trailing _ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
+-- trailing __ matches two codepoints that form one grapheme:
+SELECT U&'cb\0061\0308' LIKE U&'cb__' COLLATE ignore_accents;
+-- leading % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
+-- leading % matches zero (with later %):
+SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
+-- trailing % matches zero:
+SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
+-- trailing % matches zero (with previous %):
+SELECT U&'\0061\0308bc' LIKE U&'\00E4%c%' COLLATE ignore_accents;
+-- _ versus two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;
+-- (actually this matches because)
+SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
+-- __ matches two codepoints that form one grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'__bc' COLLATE ignore_accents;
+-- _ matches one codepoint that forms half a grapheme:
+SELECT U&'\0061\0308bc' LIKE U&'_\0308bc' COLLATE ignore_accents;
+-- doesn't match because \00e4 doesn't match only \0308
+SELECT U&'\0061\0308bc' LIKE U&'_\00e4bc' COLLATE ignore_accents;
+-- escape character at end of pattern
+SELECT 'foox' LIKE 'foo\' COLLATE ignore_accents;
+
 -- foreign keys (mixing different nondeterministic collations not allowed)
 CREATE TABLE test10pk (x text COLLATE case_sensitive PRIMARY KEY);
 CREATE TABLE test10fk (x text COLLATE case_insensitive REFERENCES test10pk (x) ON UPDATE CASCADE ON DELETE CASCADE);  -- error

base-commit: c1c09007e219ae68d1f8428a54baf68ccc1f8683
-- 
2.47.0

#26jian he
jian.universality@gmail.com
In reply to: Peter Eisentraut (#25)
Re: Support LIKE with nondeterministic collations

On Tue, Nov 19, 2024 at 9:51 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 18.11.24 04:30, jian he wrote:

we can optimize when trailing (last character) is not wildcards.

SELECT 'Ha12foo' LIKE '%foo' COLLATE ignore_accents;
within the for loop
for(;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
....
}

pg_strncoll comparison will become
Ha12foo foo
a12foo foo
12foo foo
2foo foo
foo foo

it's safe because in MatchText we have:
else if (*p == '%')
{
while (tlen > 0)
{
if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
{
int matched = MatchText(t, tlen, p, plen, locale);
if (matched != LIKE_FALSE)
return matched; /* TRUE or ABORT */
}
NextChar(t, tlen);
}
}

please check attached.

I see, good idea. I implemented it a bit differently. See "Shortcut:
If this is the end of the pattern ..." in this patch. Please check if
this is what you had in mind.

your implementation is far more simpler than mine.
I think I understand it.

i am trying to optimize case where pattern is begin_with like `pattern%`
but failed on case like:
SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
basically the_string like the_pattern%. the length of the_string and
length of the_pattern
can vary, we can not just do one pg_strncoll.

in match_pattern_prefix maybe change
if (expr_coll && !get_collation_isdeterministic(expr_coll))
return NIL;
to
if (OidIsValid(expr_coll) && !get_collation_isdeterministic(expr_coll))
return NIL;

other than that, I didn't find any issue.

#27Peter Eisentraut
peter@eisentraut.org
In reply to: jian he (#26)
Re: Support LIKE with nondeterministic collations

On 20.11.24 08:29, jian he wrote:

in match_pattern_prefix maybe change
if (expr_coll && !get_collation_isdeterministic(expr_coll))
return NIL;
to
if (OidIsValid(expr_coll) && !get_collation_isdeterministic(expr_coll))
return NIL;

I left it like it was, because this was existing code that I'm just
moving around.

other than that, I didn't find any issue.

I have committed it, thanks.