[18] Unintentional behavior change in commit e9931bfb75

Started by Jeff Davisabout 1 year ago12 messages

Jeff Davis

pgsql@j-davis.com

about 1 year ago

Commit e9931bfb75 (version 18) contained an unexpected behavior change.
LOWER('I') returns:

In the locale tr_TR.iso88599 (single byte encoding):
17 18
default i ı
specified ı ı

In the locale tr_TR.utf8:
17 18
default ı ı
specified ı ı

(Look carefully to see the dotted vs dotless "i".)

The behavior is commented (commit 176d5bae1d) in formatting.c:

* ... When using the default
* collation, we apply the traditional Postgres behavior that
* forces ASCII-style treatment of I/i, but in non-default
* collations you get exactly what the collation says.
*/
for (p = result; *p; p++)
{
if (mylocale)
*p = tolower_l((unsigned char) *p, mylocale->info.lt);
else
*p = pg_tolower((unsigned char) *p);
}

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

My commit e9931bfb75 (version 18) unifies the code paths for default
and explicit collations, and in the process it eliminates the special
case, and always uses tolower_l() for single-byte libc (whether default
collation or not).

Should I put the special case back? If not, it could break expression
indexes on LOWER()/UPPER() after an upgrade for users with the database
default collation of tr_TR who use libc and a single-byte encoding. But
preserving the special case seems weirdly inconsistent -- surely the
results should not depend on the encoding, right?

Regards,
Jeff Davis

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Jeff Davis (#1)

Re: [18] Unintentional behavior change in commit e9931bfb75

Jeff Davis <pgsql@j-davis.com> writes:

The behavior is commented (commit 176d5bae1d) in formatting.c:

* ... When using the default
* collation, we apply the traditional Postgres behavior that
* forces ASCII-style treatment of I/i, but in non-default
* collations you get exactly what the collation says.

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

Should I put the special case back?

I think so. It's stood for a lot of years now without field
complaints, and I'm fairly sure there *were* field complaints
before. (I think that behavior long predates 176d5bae1d, which
was just restoring the status quo ante after somebody else's
overenthusiastic application of system locale infrastructure.)

regards, tom lane

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Tom Lane (#2)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

But why only for single-byte encodings? And why not for ICU?

Should I put the special case back?

I think so.

Done. I put the special case back in (commit e3fa2b037c) because the
earlier commit wasn't intended to be a behavior change.

I'm still not convinced that the special case behavior is great, but a
lot of users are unaffected because they are on UTF8 anyway, so I'm
fine keeping it.

Regards,
Jeff Davis

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Jeff Davis (#3)

Re: [18] Unintentional behavior change in commit e9931bfb75

Jeff Davis <pgsql@j-davis.com> writes:

On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

But why only for single-byte encodings? And why not for ICU?

I think the not-for-utf8 exclusion was mostly because that was
how it was before, which was probably mostly historical accident.
(I do vaguely recall that there was discussion on the point, but
I'm too tired to go look in the archives for it.)

As for ICU, that didn't exist back then, and I'm not going to
defend whether it was a good idea for that code path to fail
to reproduce this behavior.

regards, tom lane

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

In reply to: Tom Lane (#2)

Re: [18] Unintentional behavior change in commit e9931bfb75

On 02.12.24 23:25, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

The behavior is commented (commit 176d5bae1d) in formatting.c:

* ... When using the default
* collation, we apply the traditional Postgres behavior that
* forces ASCII-style treatment of I/i, but in non-default
* collations you get exactly what the collation says.

That's a somewhat strange special case (along with similar ones for
INITCAP() and UPPER()) that applies to single-byte encodings with the
libc provider and the database default collation only. I assume it was
done for backwards compatibility?

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

Maybe that was relevant before the "name" type got its own collation?

Should I put the special case back?

I think so. It's stood for a lot of years now without field
complaints, and I'm fairly sure there *were* field complaints
before. (I think that behavior long predates 176d5bae1d, which
was just restoring the status quo ante after somebody else's
overenthusiastic application of system locale infrastructure.)

I've been tempted several times recently to suggest that we should
remove the separate libc-with-single-byte-encoding-but-not-C-locale code
paths, because they have zero test coverage and probably about zero
users. But if those code paths actually have different semantics in
some cases, then that will be difficult. Just something to keep in mind.

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Peter Eisentraut (#5)

Re: [18] Unintentional behavior change in commit e9931bfb75

Peter Eisentraut <peter@eisentraut.org> writes:

On 02.12.24 23:25, Tom Lane wrote:

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

Maybe that was relevant before the "name" type got its own collation?

downcase_identifier doesn't give a fig about name's collation.

regards, tom lane

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Tom Lane (#6)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Wed, 2024-12-04 at 12:21 -0500, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

On 02.12.24 23:25, Tom Lane wrote:

Well, also for compatibility with our SQL parser's understanding
of identifier lowercasing.

Maybe that was relevant before the "name" type got its own
collation?

downcase_identifier doesn't give a fig about name's collation.

I'd like to understand better the relationship between the parser's
casefolding, object names in the catalog, the default collation, and
the default ctype.

The comment in downcase_identifier() says: "SQL99 specifies Unicode-
aware case normalization, which we don't yet have the infrastructure
for". The good news is that, with

https://commitfest.postgresql.org/51/5436/

we hopefully will have the infrastructure in place soon.

(a) Do we want to use that infrastructure?
(b) Do we want to use the default collation to provide the case
folding behavior, or do we want to always use the unicode behavior
that's built in to postgres?

I'm not quite sure where the single-byte encoding special case fits in
or how it helps. It seems like what it's really doing is locking the
database-wide ctype behavior down to either libc "C" or libc with any
non-Turkish-related locale. The reasoning behind this might answer
question (b) above, but I'm not sure which direction we should be
moving.

Thoughts?

Regards,
Jeff Davis

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Jeff Davis (#3)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Mon, Dec 02, 2024 at 10:24:07PM -0800, Jeff Davis wrote:

On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:

Should I put the special case back?

I think so.

Done. I put the special case back in (commit e3fa2b037c) because the
earlier commit wasn't intended to be a behavior change.

Commit e9931bf had also removed the corresponding regex special case:

@@ -620,20 +545,6 @@ pg_wc_toupper(pg_wchar c)
return c;
case PG_REGEX_BUILTIN:
return unicode_uppercase_simple(c);
-		case PG_REGEX_LOCALE_WIDE:
-			/* force C behavior for ASCII characters, per comments above */
-			if (c <= (pg_wchar) 127)
-				return pg_ascii_toupper((unsigned char) c);
-			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper((wint_t) c);

The "comments above" still exist:

* 2. In the "default" collation (which is supposed to obey LC_CTYPE):
*
* 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
* This assumes that every platform uses Unicode codepoints directly
* as the wchar_t representation of Unicode. On some platforms
* wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
*
* 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
* values up to 255, and punt for values above that. This is 100% correct
* only in single-byte encodings such as LATINn. However, non-Unicode
* multibyte encodings are mostly Far Eastern character sets for which the
* properties being tested here aren't very relevant for higher code values
* anyway. The difficulty with using the <wctype.h> functions with
* non-Unicode multibyte encodings is that we can have no certainty that
* the platform's wchar_t representation matches what we do in pg_wchar
* conversions.
*
* 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
* functions, under exactly the same cases as #2.
*
* There is one notable difference between cases 2 and 3: in the "default"
* collation we force ASCII letters to follow ASCII upcase/downcase rules,
* while in a non-default collation we just let the library functions do what
* they will. The case where this matters is treatment of I/i in Turkish,
* and the behavior is meant to match the upper()/lower() SQL functions.

I think the code for (2) and for "I/i in Turkish" haven't returned. Given
commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain lower(),
the regex code likely needs a similar restoration. If not, the regex comments
would need to change to match the code.

Jeff Davis

pgsql@j-davis.com

9 months ago

In reply to: Noah Misch (#8)

1 attachment(s)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:

I think the code for (2) and for "I/i in Turkish" haven't returned.
Given
commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
lower(),
the regex code likely needs a similar restoration. If not, the regex
comments
would need to change to match the code.

Great find, thank you! I'm curious how you came about this difference,
was it through testing or code inspection?

Patch attached. I also updated the top of the comment so that it's
clear that it's referring to the libc provider specifically, and that
ICU still has an issue with non-UTF8 encodings.

Also, the force-to-ASCII-behavior special case is different for
pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on
whether it's the default locale, whereas the latter depends on whether
it's the default locale and the encoding is single-byte. Therefore the
results in the tr_TR.UTF-8 locale for the libc provider are
inconsistent:

=> select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I';
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | t | f | f

That behavior goes back a long way, so I'm not suggesting that we
change it.

Regards,
Jeff Davis

Attachments:

v1-0001-Another-unintentional-behavior-change-in-commit-e.patchtext/x-patch; charset=UTF-8; name=v1-0001-Another-unintentional-behavior-change-in-commit-e.patchDownload

From e8a68f42f5802d138ba04043b25b7d42862be29d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 14 Apr 2025 11:34:11 -0700
Subject: [PATCH v1] Another unintentional behavior change in commit
 e9931bfb75.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
---
 src/backend/regex/regc_pg_locale.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index ed7411df83d..41b993ad773 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -21,9 +21,10 @@
 #include "utils/pg_locale.h"
 
 /*
- * To provide as much functionality as possible on a variety of platforms,
- * without going so far as to implement everything from scratch, we use
- * several implementation strategies depending on the situation:
+ * For the libc provider, to provide as much functionality as possible on a
+ * variety of platforms without going so far as to implement everything from
+ * scratch, we use several implementation strategies depending on the
+ * situation:
  *
  * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
  * the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
@@ -33,8 +34,9 @@
  *
  * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
  * This assumes that every platform uses Unicode codepoints directly
- * as the wchar_t representation of Unicode.  On some platforms
- * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
+ * as the wchar_t representation of Unicode.  (XXX: This could be a problem
+ * for ICU in non-UTF8 encodings.)  On some platforms wchar_t is only 16 bits
+ * wide, so we have to punt for codepoints > 0xFFFF.
  *
  * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
  * values up to 255, and punt for values above that.  This is 100% correct
@@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_uppercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
@@ -590,10 +598,16 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_lowercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
-- 
2.34.1

#10

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Jeff Davis (#9)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote:

On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:

I think the code for (2) and for "I/i in Turkish" haven't returned.ï¿½
Given
commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
lower(),
the regex code likely needs a similar restoration.ï¿½ If not, the regex
comments
would need to change to match the code.

Great find, thank you! I'm curious how you came about this difference,
was it through testing or code inspection?

Code inspection. I saw commit e9931bf remove "per comments above" references
without editing the referenced "comments above".

--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -21,9 +21,10 @@
#include "utils/pg_locale.h"

/*
- * To provide as much functionality as possible on a variety of platforms,
- * without going so far as to implement everything from scratch, we use
- * several implementation strategies depending on the situation:
+ * For the libc provider, to provide as much functionality as possible on a
+ * variety of platforms without going so far as to implement everything from
+ * scratch, we use several implementation strategies depending on the
+ * situation:
*
* 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
* the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
@@ -33,8 +34,9 @@
*
* 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
* This assumes that every platform uses Unicode codepoints directly
- * as the wchar_t representation of Unicode.  On some platforms
- * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
+ * as the wchar_t representation of Unicode.  (XXX: This could be a problem
+ * for ICU in non-UTF8 encodings.)  On some platforms wchar_t is only 16 bits
+ * wide, so we have to punt for codepoints > 0xFFFF.
*
* 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
* values up to 255, and punt for values above that.  This is 100% correct

This leaves outdated material in the big comment edited here. Most prominent:

* 2. In the "default" collation (which is supposed to obey LC_CTYPE):
...
* 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
* functions, under exactly the same cases as #2.

v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to
depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18 uses the
latter only.

@@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
case PG_REGEX_STRATEGY_BUILTIN:
return unicode_uppercase_simple(c);
case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);

v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and
PG_REGEX_LOCALE_WIDE_L (locale_t) case. Consolidating them looked attractive
when this is_default case was going away. Now that is_default will remain a
factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out
well. The "_L" suffix no longer captures the distinction, since there's no
case that uses tolower() as opposed to tolower_l(). However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is still the
right thing for v18. In other words, this code should change to look more
like v17, with the removal of non-locale_t calls being the main change.

#11

Jeff Davis

pgsql@j-davis.com

9 months ago

In reply to: Noah Misch (#10)

3 attachment(s)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote:

v18 regc_pg_locale.c uses only the locale_t-extended forms, and it
aims not to
depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18
uses the
latter only.

Fixed in v2-0001 by rewording the comment slightly.

However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is
still the
right thing for v18. In other words, this code should change to look
more
like v17, with the removal of non-locale_t calls being the main
change.

I tried that in v2-0003, but I think it ended up worse. Most
pg_wc_xyz() functions don't care if it's the default collation or not,
so there are a lot of duplicate cases.

The previous approach is still there as v2-0002.

Regrads,
Jeff Davis

Attachments:

v2-0001-Improve-comment-in-regc_pg_locale.c.patchtext/x-patch; charset=UTF-8; name=v2-0001-Improve-comment-in-regc_pg_locale.c.patchDownload

From 9724181f715ce3468e9342763fadde450db08d26 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 15 Apr 2025 15:04:37 -0700
Subject: [PATCH v2 1/3] Improve comment in regc_pg_locale.c.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
---
 src/backend/regex/regc_pg_locale.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index ed7411df83d..31d8ba0322a 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -21,22 +21,22 @@
 #include "utils/pg_locale.h"
 
 /*
- * To provide as much functionality as possible on a variety of platforms,
- * without going so far as to implement everything from scratch, we use
- * several implementation strategies depending on the situation:
+ * For the libc provider, to provide as much functionality as possible on a
+ * variety of platforms without going so far as to implement everything from
+ * scratch, we use several implementation strategies depending on the
+ * situation:
  *
  * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
  * the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
  * collations don't give a fig about multibyte characters.
  *
- * 2. In the "default" collation (which is supposed to obey LC_CTYPE):
- *
- * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
+ * 2. When working in UTF8 encoding, we use the <wctype.h> functions.
  * This assumes that every platform uses Unicode codepoints directly
- * as the wchar_t representation of Unicode.  On some platforms
+ * as the wchar_t representation of Unicode.  (XXX: ICU makes this assumption
+ * even for non-UTF8 encodings, which may be a problem.)  On some platforms
  * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
  *
- * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
+ * 3. In all other encodings, we use the <ctype.h> functions for pg_wchar
  * values up to 255, and punt for values above that.  This is 100% correct
  * only in single-byte encodings such as LATINn.  However, non-Unicode
  * multibyte encodings are mostly Far Eastern character sets for which the
@@ -46,14 +46,11 @@
  * the platform's wchar_t representation matches what we do in pg_wchar
  * conversions.
  *
- * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
- * functions, under exactly the same cases as #2.
- *
- * There is one notable difference between cases 2 and 3: in the "default"
- * collation we force ASCII letters to follow ASCII upcase/downcase rules,
- * while in a non-default collation we just let the library functions do what
- * they will.  The case where this matters is treatment of I/i in Turkish,
- * and the behavior is meant to match the upper()/lower() SQL functions.
+ * As a special case, in the "default" collation we force ASCII letters to
+ * follow ASCII upcase/downcase rules, while in a non-default collation we
+ * just let the library functions do what they will.  The case where this
+ * matters is treatment of I/i in Turkish, and the behavior is meant to match
+ * the upper()/lower() SQL functions.
  *
  * We store the active collation setting in static variables.  In principle
  * it could be passed down to here via the regex library's "struct vars" data
-- 
2.34.1

v2-0002-Another-unintentional-behavior-change-in-commit-e.patchtext/x-patch; charset=UTF-8; name=v2-0002-Another-unintentional-behavior-change-in-commit-e.patchDownload

From d7e18174cb25fc981ba906101c3f4582fc41a826 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 15 Apr 2025 15:05:08 -0700
Subject: [PATCH v2 2/3] Another unintentional behavior change in commit
 e9931bfb75.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
---
 src/backend/regex/regc_pg_locale.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 31d8ba0322a..7377f459607 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -559,10 +559,16 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_uppercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
@@ -587,10 +593,16 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_lowercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
-- 
2.34.1

v2-0003-alternative-approach.patchtext/x-patch; charset=UTF-8; name=v2-0003-alternative-approach.patchDownload

From 280929ad8f70106be7bc4d59d90957ecf86a421c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 15 Apr 2025 15:49:44 -0700
Subject: [PATCH v2 3/3] alternative approach

---
 src/backend/regex/regc_pg_locale.c | 60 +++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 10 deletions(-)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 7377f459607..ec9bb8e460b 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -64,6 +64,8 @@ typedef enum
 {
 	PG_REGEX_STRATEGY_C,		/* C locale (encoding independent) */
 	PG_REGEX_STRATEGY_BUILTIN,	/* built-in Unicode semantics */
+	PG_REGEX_STRATEGY_DEF_WIDE,	/* libc default, locale_t <wctype.h> functions */
+	PG_REGEX_STRATEGY_DEF_1BYTE,	/* libc default, locale_t <ctype.h> functions */
 	PG_REGEX_STRATEGY_LIBC_WIDE,	/* Use locale_t <wctype.h> functions */
 	PG_REGEX_STRATEGY_LIBC_1BYTE,	/* Use locale_t <ctype.h> functions */
 	PG_REGEX_STRATEGY_ICU,		/* Use ICU uchar.h functions */
@@ -285,9 +287,19 @@ pg_set_regex_collation(Oid collation)
 		{
 			Assert(locale->provider == COLLPROVIDER_LIBC);
 			if (GetDatabaseEncoding() == PG_UTF8)
-				strategy = PG_REGEX_STRATEGY_LIBC_WIDE;
+			{
+				if (locale->is_default)
+					strategy = PG_REGEX_STRATEGY_DEF_WIDE;
+				else
+					strategy = PG_REGEX_STRATEGY_LIBC_WIDE;
+			}
 			else
-				strategy = PG_REGEX_STRATEGY_LIBC_1BYTE;
+			{
+				if (locale->is_default)
+					strategy = PG_REGEX_STRATEGY_DEF_1BYTE;
+				else
+					strategy = PG_REGEX_STRATEGY_LIBC_1BYTE;
+			}
 		}
 	}
 
@@ -305,10 +317,12 @@ pg_wc_isdigit(pg_wchar c)
 					(pg_char_properties[c] & PG_ISDIGIT));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isdigit(c, !pg_regex_locale->info.builtin.casemap_full);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -332,10 +346,12 @@ pg_wc_isalpha(pg_wchar c)
 					(pg_char_properties[c] & PG_ISALPHA));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isalpha(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -359,10 +375,12 @@ pg_wc_isalnum(pg_wchar c)
 					(pg_char_properties[c] & PG_ISALNUM));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isalnum(c, !pg_regex_locale->info.builtin.casemap_full);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -395,10 +413,12 @@ pg_wc_isupper(pg_wchar c)
 					(pg_char_properties[c] & PG_ISUPPER));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isupper(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -422,10 +442,12 @@ pg_wc_islower(pg_wchar c)
 					(pg_char_properties[c] & PG_ISLOWER));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_islower(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					islower_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -449,10 +471,12 @@ pg_wc_isgraph(pg_wchar c)
 					(pg_char_properties[c] & PG_ISGRAPH));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isgraph(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -476,10 +500,12 @@ pg_wc_isprint(pg_wchar c)
 					(pg_char_properties[c] & PG_ISPRINT));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isprint(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -503,10 +529,12 @@ pg_wc_ispunct(pg_wchar c)
 					(pg_char_properties[c] & PG_ISPUNCT));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_ispunct(c, !pg_regex_locale->info.builtin.casemap_full);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -530,10 +558,12 @@ pg_wc_isspace(pg_wchar c)
 					(pg_char_properties[c] & PG_ISSPACE));
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return pg_u_isspace(c);
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			return (c <= (pg_wchar) UCHAR_MAX &&
 					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
@@ -558,17 +588,21 @@ pg_wc_toupper(pg_wchar c)
 			return c;
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_uppercase_simple(c);
-		case PG_REGEX_STRATEGY_LIBC_WIDE:
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 			/* force C behavior for ASCII characters, per comments above */
-			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+			if (c <= (pg_wchar) 127)
 				return pg_ascii_toupper((unsigned char) c);
+			/* FALL THRU */
+		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
-		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 			/* force C behavior for ASCII characters, per comments above */
-			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+			if (c <= (pg_wchar) 127)
 				return pg_ascii_toupper((unsigned char) c);
+			/* FALL THRU */
+		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
@@ -592,17 +626,21 @@ pg_wc_tolower(pg_wchar c)
 			return c;
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_lowercase_simple(c);
-		case PG_REGEX_STRATEGY_LIBC_WIDE:
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 			/* force C behavior for ASCII characters, per comments above */
-			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+			if (c <= (pg_wchar) 127)
 				return pg_ascii_tolower((unsigned char) c);
+			/* FALL THRU */
+		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
-		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 			/* force C behavior for ASCII characters, per comments above */
-			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+			if (c <= (pg_wchar) 127)
 				return pg_ascii_tolower((unsigned char) c);
+			/* FALL THRU */
+		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
@@ -751,9 +789,11 @@ pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			max_chr = (pg_wchar) MAX_SIMPLE_CHR;
 			break;
+		case PG_REGEX_STRATEGY_DEF_WIDE:
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
 			max_chr = (pg_wchar) MAX_SIMPLE_CHR;
 			break;
+		case PG_REGEX_STRATEGY_DEF_1BYTE:
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
 #if MAX_SIMPLE_CHR >= UCHAR_MAX
 			max_chr = (pg_wchar) UCHAR_MAX;
-- 
2.34.1

#12

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Jeff Davis (#11)

Re: [18] Unintentional behavior change in commit e9931bfb75

On Tue, Apr 15, 2025 at 04:08:55PM -0700, Jeff Davis wrote:

On Mon, 2025-04-14 at 13:44 -0700, Noah Misch wrote:

However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is
still the
right thing for v18.ï¿½ In other words, this code should change to look
more
like v17, with the removal of non-locale_t calls being the main
change.

I tried that in v2-0003, but I think it ended up worse. Most
pg_wc_xyz() functions don't care if it's the default collation or not,
so there are a lot of duplicate cases.

I'd likely adopt 0003, but I'm fine if you stop at 0002.

+ * As a special case, in the "default" collation we force ASCII letters to

I'd change s/we force/(2) and (3) force/ to make explicit that this isn't
specific to (3). That's all I would change in v2.