fixing tsearch locale support

Started by Peter Eisentrautabout 1 year ago13 messages

peter_e@gmx.net

about 1 year ago

3 attachment(s)

Infamously, the tsearch locale support in
src/backend/tsearch/ts_locale.c still depends on libc environment
variable locale settings and is not caught up with pg_locale_t,
collations, ICU, and all that newer stuff. This is used in the tsearch
facilities themselves, but also in other modules such as ltree, pg_trgm,
and unaccent.

Several of the functions are wrappers around <ctype.h> functions, like

int
t_isalpha(const char *ptr)
{
int clen = pg_mblen(ptr);
wchar_t character[WC_BUF_LEN];
pg_locale_t mylocale = 0; /* TODO */

if (clen == 1 || database_ctype_is_c)
return isalpha(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe
locale provider or collation settings.

As an easy start toward fixing this, I think several of these functions
we don't even need.

t_isdigit() and t_isspace() are just used to parse various configuration
and data files, and surely we don't need support for encoding-dependent
multibyte support for parsing ASCII digits and ASCII spaces. At least,
I didn't find any indications in the documentation of these file formats
that they are supposed to support that kind of thing. So these can be
replaced by the normal isdigit() and isspace().

There is one call to t_isprint(), which is similarly used only to parse
some flags in a configuration file. From the surrounding code you can
deduce that it's only called on single-byte characters, so it can
similarly be replaced by plain issprint().

Note, pg_trgm has some compile-time options with macros such as
KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the
non-default variant is not supported by any test cases. So as part of
this undertaking, I'm going to remove the non-default variants if they
are in the way of cleanup.

Attachments:

0001-Remove-t_isdigit.patchtext/plain; charset=UTF-8; name=0001-Remove-t_isdigit.patchDownload

From 7abc0f2333d8045004911040c856f1522d03b050 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH 1/3] Remove t_isdigit()

---
 contrib/ltree/ltree_io.c                |  8 ++++----
 src/backend/tsearch/spell.c             |  4 ++--
 src/backend/tsearch/ts_locale.c         | 15 ---------------
 src/backend/utils/adt/tsquery.c         |  2 +-
 src/backend/utils/adt/tsvector_parser.c |  4 ++--
 src/include/tsearch/ts_locale.h         |  1 -
 6 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 11eefc809b2..b54a15d6c68 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -411,7 +411,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 			case LQPRS_WAITFNUM:
 				if (t_iseq(ptr, ','))
 					state = LQPRS_WAITSNUM;
-				else if (t_isdigit(ptr))
+				else if (isdigit((unsigned char) *ptr))
 				{
 					int			low = atoi(ptr);
 
@@ -429,7 +429,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 					UNCHAR;
 				break;
 			case LQPRS_WAITSNUM:
-				if (t_isdigit(ptr))
+				if (isdigit((unsigned char) *ptr))
 				{
 					int			high = atoi(ptr);
 
@@ -460,7 +460,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 			case LQPRS_WAITCLOSE:
 				if (t_iseq(ptr, '}'))
 					state = LQPRS_WAITEND;
-				else if (!t_isdigit(ptr))
+				else if (!isdigit((unsigned char) *ptr))
 					UNCHAR;
 				break;
 			case LQPRS_WAITND:
@@ -471,7 +471,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 				}
 				else if (t_iseq(ptr, ','))
 					state = LQPRS_WAITSNUM;
-				else if (!t_isdigit(ptr))
+				else if (!isdigit((unsigned char) *ptr))
 					UNCHAR;
 				break;
 			case LQPRS_WAITEND:
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index aaedb0aa852..7800f794e84 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -390,7 +390,7 @@ getNextFlagFromString(IspellDict *Conf, const char **sflagset, char *sflag)
 				*sflagset = next;
 				while (**sflagset)
 				{
-					if (t_isdigit(*sflagset))
+					if (isdigit((unsigned char) **sflagset))
 					{
 						if (!met_comma)
 							ereport(ERROR,
@@ -1750,7 +1750,7 @@ NISortDictionary(IspellDict *Conf)
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
 									Conf->Spell[i]->p.flag)));
-				if (*end != '\0' && !t_isdigit(end) && !t_isspace(end))
+				if (*end != '\0' && !isdigit((unsigned char) *end) && !t_isspace(end))
 					ereport(ERROR,
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index f8367b41312..7247b8cbe8a 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -31,21 +31,6 @@ static void tsearch_readline_callback(void *arg);
  */
 #define WC_BUF_LEN  3
 
-int
-t_isdigit(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isdigit(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswdigit((wint_t) character[0]);
-}
-
 int
 t_isspace(const char *ptr)
 {
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 6f532188392..219ab543f62 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -197,7 +197,7 @@ parse_phrase_operator(TSQueryParserState pstate, int16 *distance)
 					continue;
 				}
 
-				if (!t_isdigit(ptr))
+				if (!isdigit((unsigned char) *ptr))
 					return false;
 
 				errno = 0;
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index ea961bb8a4a..9e33de0bde7 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -317,7 +317,7 @@ gettoken_tsvector(TSVectorParseState state,
 		}
 		else if (statecode == INPOSINFO)
 		{
-			if (t_isdigit(state->prsbuf))
+			if (isdigit((unsigned char) *state->prsbuf))
 			{
 				if (posalen == 0)
 				{
@@ -375,7 +375,7 @@ gettoken_tsvector(TSVectorParseState state,
 			else if (t_isspace(state->prsbuf) ||
 					 *(state->prsbuf) == '\0')
 				RETURN_TOKEN;
-			else if (!t_isdigit(state->prsbuf))
+			else if (!isdigit((unsigned char) *state->prsbuf))
 				PRSSYNTAXERROR;
 		}
 		else					/* internal error */
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index abc21a7ebea..8ef380791fe 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -39,7 +39,6 @@ typedef struct
 
 #define COPYCHAR(d,s)	memcpy(d, s, pg_mblen(s))
 
-extern int	t_isdigit(const char *ptr);
 extern int	t_isspace(const char *ptr);
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
-- 
2.47.1

0002-Remove-t_isspace.patchtext/plain; charset=UTF-8; name=0002-Remove-t_isspace.patchDownload

From 662b08d3a7733e69ff31cb7ceab1530a9412962f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH 2/3] Remove t_isspace()

---
 contrib/dict_xsyn/dict_xsyn.c           |  4 +--
 contrib/ltree/ltxtquery_io.c            |  2 +-
 contrib/pg_trgm/trgm.h                  |  6 ----
 contrib/unaccent/unaccent.c             |  2 +-
 src/backend/tsearch/dict_synonym.c      |  4 +--
 src/backend/tsearch/dict_thesaurus.c    | 10 +++---
 src/backend/tsearch/spell.c             | 42 ++++++++++++-------------
 src/backend/tsearch/ts_locale.c         | 15 ---------
 src/backend/tsearch/ts_utils.c          |  2 +-
 src/backend/utils/adt/tsquery.c         | 10 +++---
 src/backend/utils/adt/tsvector_parser.c |  6 ++--
 src/include/tsearch/ts_locale.h         |  1 -
 12 files changed, 41 insertions(+), 63 deletions(-)

diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index 3635ed1df84..f8c0a5bf5c5 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -48,14 +48,14 @@ find_word(char *in, char **end)
 	char	   *start;
 
 	*end = NULL;
-	while (*in && t_isspace(in))
+	while (*in && isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	if (!*in || *in == '#')
 		return NULL;
 	start = in;
 
-	while (*in && !t_isspace(in))
+	while (*in && !isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	*end = in;
diff --git a/contrib/ltree/ltxtquery_io.c b/contrib/ltree/ltxtquery_io.c
index 121fc55e469..7b8fba17ff2 100644
--- a/contrib/ltree/ltxtquery_io.c
+++ b/contrib/ltree/ltxtquery_io.c
@@ -88,7 +88,7 @@ gettoken_query(QPRS_STATE *state, int32 *val, int32 *lenval, char **strval, uint
 					*lenval = charlen;
 					*flag = 0;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 					ereturn(state->escontext, ERR,
 							(errcode(ERRCODE_SYNTAX_ERROR),
 							 errmsg("operand syntax error")));
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222b..10827563694 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -15,7 +15,6 @@
  */
 #define LPADDING		2
 #define RPADDING		1
-#define KEEPONLYALNUM
 /*
  * Caution: IGNORECASE macro means that trigrams are case-insensitive.
  * If this macro is disabled, the ~* and ~~* operators must be removed from
@@ -51,13 +50,8 @@ typedef char trgm[3];
 	*(((char*)(a))+2) = *(((char*)(b))+2);	\
 } while(0)
 
-#ifdef KEEPONLYALNUM
 #define ISWORDCHR(c)	(t_isalnum(c))
 #define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
-#else
-#define ISWORDCHR(c)	(!t_isspace(c))
-#define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && isprint( *(unsigned char*)(a) ) )
-#endif
 #define ISPRINTABLETRGM(t)	( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
 
 #define ISESCAPECHAR(x) (*(x) == '\\')	/* Wildcard escape character */
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 0217696aac1..fcc25dc7139 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -155,7 +155,7 @@ initTrie(const char *filename)
 				{
 					ptrlen = pg_mblen(ptr);
 					/* ignore whitespace, but end src or trg */
-					if (t_isspace(ptr))
+					if (isspace((unsigned char) *ptr))
 					{
 						if (state == 1)
 							state = 2;
diff --git a/src/backend/tsearch/dict_synonym.c b/src/backend/tsearch/dict_synonym.c
index 77cd511ee51..77c0d7a3593 100644
--- a/src/backend/tsearch/dict_synonym.c
+++ b/src/backend/tsearch/dict_synonym.c
@@ -47,7 +47,7 @@ findwrd(char *in, char **end, uint16 *flags)
 	char	   *lastchar;
 
 	/* Skip leading spaces */
-	while (*in && t_isspace(in))
+	while (*in && isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	/* Return NULL on empty lines */
@@ -60,7 +60,7 @@ findwrd(char *in, char **end, uint16 *flags)
 	lastchar = start = in;
 
 	/* Find end of word */
-	while (*in && !t_isspace(in))
+	while (*in && !isspace((unsigned char) *in))
 	{
 		lastchar = in;
 		in += pg_mblen(in);
diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c
index 6b159f9f569..f1449b5607f 100644
--- a/src/backend/tsearch/dict_thesaurus.c
+++ b/src/backend/tsearch/dict_thesaurus.c
@@ -190,7 +190,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 		ptr = line;
 
 		/* is it a comment? */
-		while (*ptr && t_isspace(ptr))
+		while (*ptr && isspace((unsigned char) *ptr))
 			ptr += pg_mblen(ptr);
 
 		if (t_iseq(ptr, '#') || *ptr == '\0' ||
@@ -212,7 +212,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 								 errmsg("unexpected delimiter")));
 					state = TR_WAITSUBS;
 				}
-				else if (!t_isspace(ptr))
+				else if (!isspace((unsigned char) *ptr))
 				{
 					beginwrd = ptr;
 					state = TR_INLEX;
@@ -225,7 +225,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 					newLexeme(d, beginwrd, ptr, idsubst, posinsubst++);
 					state = TR_WAITSUBS;
 				}
-				else if (t_isspace(ptr))
+				else if (isspace((unsigned char) *ptr))
 				{
 					newLexeme(d, beginwrd, ptr, idsubst, posinsubst++);
 					state = TR_WAITLEX;
@@ -245,7 +245,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 					state = TR_INSUBS;
 					beginwrd = ptr + pg_mblen(ptr);
 				}
-				else if (!t_isspace(ptr))
+				else if (!isspace((unsigned char) *ptr))
 				{
 					useasis = false;
 					beginwrd = ptr;
@@ -254,7 +254,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 			}
 			else if (state == TR_INSUBS)
 			{
-				if (t_isspace(ptr))
+				if (isspace((unsigned char) *ptr))
 				{
 					if (ptr == beginwrd)
 						ereport(ERROR,
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index 7800f794e84..b41afbd7322 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -408,7 +408,7 @@ getNextFlagFromString(IspellDict *Conf, const char **sflagset, char *sflag)
 											*sflagset)));
 						met_comma = true;
 					}
-					else if (!t_isspace(*sflagset))
+					else if (!isspace((unsigned char) **sflagset))
 					{
 						ereport(ERROR,
 								(errcode(ERRCODE_CONFIG_FILE_ERROR),
@@ -542,7 +542,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 			while (*s)
 			{
 				/* we allow only single encoded flags for faster works */
-				if (pg_mblen(s) == 1 && t_isprint(s) && !t_isspace(s))
+				if (pg_mblen(s) == 1 && t_isprint(s) && !isspace((unsigned char) *s))
 					s++;
 				else
 				{
@@ -558,7 +558,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 		s = line;
 		while (*s)
 		{
-			if (t_isspace(s))
+			if (isspace((unsigned char) *s))
 			{
 				*s = '\0';
 				break;
@@ -799,7 +799,7 @@ get_nextfield(char **str, char *next)
 		{
 			if (t_iseq(*str, '#'))
 				return false;
-			else if (!t_isspace(*str))
+			else if (!isspace((unsigned char) **str))
 			{
 				int			clen = pg_mblen(*str);
 
@@ -814,7 +814,7 @@ get_nextfield(char **str, char *next)
 		}
 		else					/* state == PAE_INMASK */
 		{
-			if (t_isspace(*str))
+			if (isspace((unsigned char) **str))
 			{
 				*next = '\0';
 				return true;
@@ -925,7 +925,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 		{
 			if (t_iseq(str, '#'))
 				return false;
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 			{
 				COPYCHAR(pmask, str);
 				pmask += pg_mblen(str);
@@ -939,7 +939,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				*pmask = '\0';
 				state = PAE_WAIT_FIND;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 			{
 				COPYCHAR(pmask, str);
 				pmask += pg_mblen(str);
@@ -957,7 +957,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				prepl += pg_mblen(str);
 				state = PAE_INREPL;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -974,7 +974,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				COPYCHAR(pfind, str);
 				pfind += pg_mblen(str);
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -991,7 +991,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				prepl += pg_mblen(str);
 				state = PAE_INREPL;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -1008,7 +1008,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				COPYCHAR(prepl, str);
 				prepl += pg_mblen(str);
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -1070,7 +1070,7 @@ addCompoundAffixFlagValue(IspellDict *Conf, char *s, uint32 val)
 	char	   *sflag;
 	int			clen;
 
-	while (*s && t_isspace(s))
+	while (*s && isspace((unsigned char) *s))
 		s += pg_mblen(s);
 
 	if (!*s)
@@ -1080,7 +1080,7 @@ addCompoundAffixFlagValue(IspellDict *Conf, char *s, uint32 val)
 
 	/* Get flag without \n */
 	sflag = sbuf;
-	while (*s && !t_isspace(s) && *s != '\n')
+	while (*s && !isspace((unsigned char) *s) && *s != '\n')
 	{
 		clen = pg_mblen(s);
 		COPYCHAR(sflag, s);
@@ -1225,7 +1225,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 
 	while ((recoded = tsearch_readline(&trst)) != NULL)
 	{
-		if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
+		if (*recoded == '\0' || isspace((unsigned char) *recoded) || t_iseq(recoded, '#'))
 		{
 			pfree(recoded);
 			continue;
@@ -1262,7 +1262,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 		{
 			char	   *s = recoded + strlen("FLAG");
 
-			while (*s && t_isspace(s))
+			while (*s && isspace((unsigned char) *s))
 				s += pg_mblen(s);
 
 			if (*s)
@@ -1298,7 +1298,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 	{
 		int			fields_read;
 
-		if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
+		if (*recoded == '\0' || isspace((unsigned char) *recoded) || t_iseq(recoded, '#'))
 			goto nextline;
 
 		fields_read = parse_ooaffentry(recoded, type, sflag, find, repl, mask);
@@ -1461,9 +1461,9 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 			s = findchar2(recoded, 'l', 'L');
 			if (s)
 			{
-				while (*s && !t_isspace(s))
+				while (*s && !isspace((unsigned char) *s))
 					s += pg_mblen(s);
-				while (*s && t_isspace(s))
+				while (*s && isspace((unsigned char) *s))
 					s += pg_mblen(s);
 
 				if (*s && pg_mblen(s) == 1)
@@ -1494,7 +1494,7 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 			s = recoded + 4;	/* we need non-lowercased string */
 			flagflags = 0;
 
-			while (*s && t_isspace(s))
+			while (*s && isspace((unsigned char) *s))
 				s += pg_mblen(s);
 
 			if (*s == '*')
@@ -1523,7 +1523,7 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 
 				s++;
 				if (*s == '\0' || *s == '#' || *s == '\n' || *s == ':' ||
-					t_isspace(s))
+					isspace((unsigned char) *s))
 				{
 					oldformat = true;
 					goto nextline;
@@ -1750,7 +1750,7 @@ NISortDictionary(IspellDict *Conf)
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
 									Conf->Spell[i]->p.flag)));
-				if (*end != '\0' && !isdigit((unsigned char) *end) && !t_isspace(end))
+				if (*end != '\0' && !isdigit((unsigned char) *end) && !isspace((unsigned char) *end))
 					ereport(ERROR,
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 7247b8cbe8a..70a39f48814 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -31,21 +31,6 @@ static void tsearch_readline_callback(void *arg);
  */
 #define WC_BUF_LEN  3
 
-int
-t_isspace(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isspace(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswspace((wint_t) character[0]);
-}
-
 int
 t_isalpha(const char *ptr)
 {
diff --git a/src/backend/tsearch/ts_utils.c b/src/backend/tsearch/ts_utils.c
index 81967d29e9a..f20e61d4c8c 100644
--- a/src/backend/tsearch/ts_utils.c
+++ b/src/backend/tsearch/ts_utils.c
@@ -88,7 +88,7 @@ readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *))
 			char	   *pbuf = line;
 
 			/* Trim trailing space */
-			while (*pbuf && !t_isspace(pbuf))
+			while (*pbuf && !isspace((unsigned char) *pbuf))
 				pbuf += pg_mblen(pbuf);
 			*pbuf = '\0';
 
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 219ab543f62..0366c2a2acd 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -274,7 +274,7 @@ parse_or_operator(TSQueryParserState pstate)
 		 * So we still treat OR literal as operation with possibly incorrect
 		 * operand and will not search it as lexeme
 		 */
-		if (!t_isspace(ptr))
+		if (!isspace((unsigned char) *ptr))
 			break;
 	}
 
@@ -315,7 +315,7 @@ gettoken_query_standard(TSQueryParserState state, int8 *operator,
 					/* generic syntax error message is fine */
 					return PT_ERR;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/*
 					 * We rely on the tsvector parser to parse the value for
@@ -383,7 +383,7 @@ gettoken_query_standard(TSQueryParserState state, int8 *operator,
 				{
 					return (state->count) ? PT_ERR : PT_END;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					return PT_ERR;
 				}
@@ -444,7 +444,7 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
 					state->state = WAITOPERAND;
 					continue;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/*
 					 * We rely on the tsvector parser to parse the value for
@@ -492,7 +492,7 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
 					state->buf++;
 					continue;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/* insert implicit AND between operands */
 					state->state = WAITOPERAND;
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 9e33de0bde7..750a1e8e8d9 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -206,7 +206,7 @@ gettoken_tsvector(TSVectorParseState state,
 			else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
 					 (state->is_web && t_iseq(state->prsbuf, '"')))
 				PRSSYNTAXERROR;
-			else if (!t_isspace(state->prsbuf))
+			else if (!isspace((unsigned char) *state->prsbuf))
 			{
 				COPYCHAR(curpos, state->prsbuf);
 				curpos += pg_mblen(state->prsbuf);
@@ -236,7 +236,7 @@ gettoken_tsvector(TSVectorParseState state,
 				statecode = WAITNEXTCHAR;
 				oldstate = WAITENDWORD;
 			}
-			else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
+			else if (isspace((unsigned char) *state->prsbuf) || *(state->prsbuf) == '\0' ||
 					 (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
 					 (state->is_web && t_iseq(state->prsbuf, '"')))
 			{
@@ -372,7 +372,7 @@ gettoken_tsvector(TSVectorParseState state,
 					PRSSYNTAXERROR;
 				WEP_SETWEIGHT(pos[npos - 1], 0);
 			}
-			else if (t_isspace(state->prsbuf) ||
+			else if (isspace((unsigned char) *state->prsbuf) ||
 					 *(state->prsbuf) == '\0')
 				RETURN_TOKEN;
 			else if (!isdigit((unsigned char) *state->prsbuf))
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 8ef380791fe..9606bb30983 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -39,7 +39,6 @@ typedef struct
 
 #define COPYCHAR(d,s)	memcpy(d, s, pg_mblen(s))
 
-extern int	t_isspace(const char *ptr);
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
 extern int	t_isprint(const char *ptr);
-- 
2.47.1

0003-Remove-t_isprint.patchtext/plain; charset=UTF-8; name=0003-Remove-t_isprint.patchDownload

From 61ea9c9738659e120700afc18c43ec19813f328b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH 3/3] Remove t_isprint()

---
 src/backend/tsearch/spell.c     |  2 +-
 src/backend/tsearch/ts_locale.c | 15 ---------------
 src/include/tsearch/ts_locale.h |  1 -
 3 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b41afbd7322..7eca1714e9b 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -542,7 +542,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 			while (*s)
 			{
 				/* we allow only single encoded flags for faster works */
-				if (pg_mblen(s) == 1 && t_isprint(s) && !isspace((unsigned char) *s))
+				if (pg_mblen(s) == 1 && isprint((unsigned char) *s) && !isspace((unsigned char) *s))
 					s++;
 				else
 				{
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 70a39f48814..a61fd36022e 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -61,21 +61,6 @@ t_isalnum(const char *ptr)
 	return iswalnum((wint_t) character[0]);
 }
 
-int
-t_isprint(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isprint(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswprint((wint_t) character[0]);
-}
-
 
 /*
  * Set up to read a file using tsearch_readline().  This facility is
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 9606bb30983..71e1f78fa36 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -41,7 +41,6 @@ typedef struct
 
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
-extern int	t_isprint(const char *ptr);
 
 extern char *lowerstr(const char *str);
 extern char *lowerstr_with_len(const char *str, int len);
-- 
2.47.1

Peter Eisentraut

peter_e@gmx.net

about 1 year ago

In reply to: Peter Eisentraut (#1)

4 attachment(s)

Re: fixing tsearch locale support

I have expanded this patch set. The first three patches are the same as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower(), except that the former don't use the common locale
provider framework but instead use the global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID). For instances that use a libc locale globally,
this will result in exactly the same behavior. For instances that use
other locale providers, you now get consistent behavior and are no
longer dependent on the libc locale settings.

Most uses of these functions are for processing dictionary and
configuration files. In those cases, using the default collation seems
appropriate. At least we don't have a more specific collation
available. But the code in contrib/pg_trgm should really depend on the
collation of the columns being processed. This is not done here, this
can be done in a separate patch.

(You can probably construct some edge cases where this change would
create some locale-related upgrade incompatibility, for example if
before you used a combination of ICU and a differently-behaving libc
locale. We can document this in the release notes, but I don't think
there is anything more we can do about this.)

After these patches, the only problematic things left in ts_locale.{c,h} are

extern int t_isalpha(const char *ptr);
extern int t_isalnum(const char *ptr);

My current assessment is that these are best addressed after patch [0]/messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com,
because we need locale-provider-aware character classification functions.

[0]: /messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com
/messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com

Show quoted text

On 02.12.24 11:57, Peter Eisentraut wrote:

Infamously, the tsearch locale support in src/backend/tsearch/
ts_locale.c still depends on libc environment variable locale settings
and is not caught up with pg_locale_t, collations, ICU, and all that
newer stuff. This is used in the tsearch facilities themselves, but
also in other modules such as ltree, pg_trgm, and unaccent.

Several of the functions are wrappers around <ctype.h> functions, like

int
t_isalpha(const char *ptr)
{
    int         clen = pg_mblen(ptr);
    wchar_t     character[WC_BUF_LEN];
    pg_locale_t mylocale = 0;   /* TODO */

    if (clen == 1 || database_ctype_is_c)
        return isalpha(TOUCHAR(ptr));

    char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

    return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe
locale provider or collation settings.

As an easy start toward fixing this, I think several of these functions
we don't even need.

t_isdigit() and t_isspace() are just used to parse various configuration
and data files, and surely we don't need support for encoding-dependent
multibyte support for parsing ASCII digits and ASCII spaces. At least,
I didn't find any indications in the documentation of these file formats
that they are supposed to support that kind of thing. So these can be
replaced by the normal isdigit() and isspace().

There is one call to t_isprint(), which is similarly used only to parse
some flags in a configuration file. From the surrounding code you can
deduce that it's only called on single-byte characters, so it can
similarly be replaced by plain issprint().

Note, pg_trgm has some compile-time options with macros such as
KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the
non-default variant is not supported by any test cases. So as part of
this undertaking, I'm going to remove the non-default variants if they
are in the way of cleanup.

Attachments:

v2-0001-Remove-t_isdigit.patchtext/plain; charset=UTF-8; name=v2-0001-Remove-t_isdigit.patchDownload

From 29800e82d24e70bd7e127ea1e6574dbbf89684a6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH v2 1/4] Remove t_isdigit()

---
 contrib/ltree/ltree_io.c                |  8 ++++----
 src/backend/tsearch/spell.c             |  4 ++--
 src/backend/tsearch/ts_locale.c         | 15 ---------------
 src/backend/utils/adt/tsquery.c         |  2 +-
 src/backend/utils/adt/tsvector_parser.c |  4 ++--
 src/include/tsearch/ts_locale.h         |  1 -
 6 files changed, 9 insertions(+), 25 deletions(-)

diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 11eefc809b2..b54a15d6c68 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -411,7 +411,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 			case LQPRS_WAITFNUM:
 				if (t_iseq(ptr, ','))
 					state = LQPRS_WAITSNUM;
-				else if (t_isdigit(ptr))
+				else if (isdigit((unsigned char) *ptr))
 				{
 					int			low = atoi(ptr);
 
@@ -429,7 +429,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 					UNCHAR;
 				break;
 			case LQPRS_WAITSNUM:
-				if (t_isdigit(ptr))
+				if (isdigit((unsigned char) *ptr))
 				{
 					int			high = atoi(ptr);
 
@@ -460,7 +460,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 			case LQPRS_WAITCLOSE:
 				if (t_iseq(ptr, '}'))
 					state = LQPRS_WAITEND;
-				else if (!t_isdigit(ptr))
+				else if (!isdigit((unsigned char) *ptr))
 					UNCHAR;
 				break;
 			case LQPRS_WAITND:
@@ -471,7 +471,7 @@ parse_lquery(const char *buf, struct Node *escontext)
 				}
 				else if (t_iseq(ptr, ','))
 					state = LQPRS_WAITSNUM;
-				else if (!t_isdigit(ptr))
+				else if (!isdigit((unsigned char) *ptr))
 					UNCHAR;
 				break;
 			case LQPRS_WAITEND:
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index aaedb0aa852..7800f794e84 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -390,7 +390,7 @@ getNextFlagFromString(IspellDict *Conf, const char **sflagset, char *sflag)
 				*sflagset = next;
 				while (**sflagset)
 				{
-					if (t_isdigit(*sflagset))
+					if (isdigit((unsigned char) **sflagset))
 					{
 						if (!met_comma)
 							ereport(ERROR,
@@ -1750,7 +1750,7 @@ NISortDictionary(IspellDict *Conf)
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
 									Conf->Spell[i]->p.flag)));
-				if (*end != '\0' && !t_isdigit(end) && !t_isspace(end))
+				if (*end != '\0' && !isdigit((unsigned char) *end) && !t_isspace(end))
 					ereport(ERROR,
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index f8367b41312..7247b8cbe8a 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -31,21 +31,6 @@ static void tsearch_readline_callback(void *arg);
  */
 #define WC_BUF_LEN  3
 
-int
-t_isdigit(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isdigit(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswdigit((wint_t) character[0]);
-}
-
 int
 t_isspace(const char *ptr)
 {
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 6f532188392..219ab543f62 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -197,7 +197,7 @@ parse_phrase_operator(TSQueryParserState pstate, int16 *distance)
 					continue;
 				}
 
-				if (!t_isdigit(ptr))
+				if (!isdigit((unsigned char) *ptr))
 					return false;
 
 				errno = 0;
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index ea961bb8a4a..9e33de0bde7 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -317,7 +317,7 @@ gettoken_tsvector(TSVectorParseState state,
 		}
 		else if (statecode == INPOSINFO)
 		{
-			if (t_isdigit(state->prsbuf))
+			if (isdigit((unsigned char) *state->prsbuf))
 			{
 				if (posalen == 0)
 				{
@@ -375,7 +375,7 @@ gettoken_tsvector(TSVectorParseState state,
 			else if (t_isspace(state->prsbuf) ||
 					 *(state->prsbuf) == '\0')
 				RETURN_TOKEN;
-			else if (!t_isdigit(state->prsbuf))
+			else if (!isdigit((unsigned char) *state->prsbuf))
 				PRSSYNTAXERROR;
 		}
 		else					/* internal error */
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index abc21a7ebea..8ef380791fe 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -39,7 +39,6 @@ typedef struct
 
 #define COPYCHAR(d,s)	memcpy(d, s, pg_mblen(s))
 
-extern int	t_isdigit(const char *ptr);
 extern int	t_isspace(const char *ptr);
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
-- 
2.47.1

v2-0002-Remove-t_isspace.patchtext/plain; charset=UTF-8; name=v2-0002-Remove-t_isspace.patchDownload

From d79f5b85e36eba026368322757864fc773085c25 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH v2 2/4] Remove t_isspace()

---
 contrib/dict_xsyn/dict_xsyn.c           |  4 +--
 contrib/ltree/ltxtquery_io.c            |  2 +-
 contrib/pg_trgm/trgm.h                  |  6 ----
 contrib/unaccent/unaccent.c             |  2 +-
 src/backend/tsearch/dict_synonym.c      |  4 +--
 src/backend/tsearch/dict_thesaurus.c    | 10 +++---
 src/backend/tsearch/spell.c             | 42 ++++++++++++-------------
 src/backend/tsearch/ts_locale.c         | 15 ---------
 src/backend/tsearch/ts_utils.c          |  2 +-
 src/backend/utils/adt/tsquery.c         | 10 +++---
 src/backend/utils/adt/tsvector_parser.c |  6 ++--
 src/include/tsearch/ts_locale.h         |  1 -
 12 files changed, 41 insertions(+), 63 deletions(-)

diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index 3635ed1df84..f8c0a5bf5c5 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -48,14 +48,14 @@ find_word(char *in, char **end)
 	char	   *start;
 
 	*end = NULL;
-	while (*in && t_isspace(in))
+	while (*in && isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	if (!*in || *in == '#')
 		return NULL;
 	start = in;
 
-	while (*in && !t_isspace(in))
+	while (*in && !isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	*end = in;
diff --git a/contrib/ltree/ltxtquery_io.c b/contrib/ltree/ltxtquery_io.c
index 121fc55e469..7b8fba17ff2 100644
--- a/contrib/ltree/ltxtquery_io.c
+++ b/contrib/ltree/ltxtquery_io.c
@@ -88,7 +88,7 @@ gettoken_query(QPRS_STATE *state, int32 *val, int32 *lenval, char **strval, uint
 					*lenval = charlen;
 					*flag = 0;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 					ereturn(state->escontext, ERR,
 							(errcode(ERRCODE_SYNTAX_ERROR),
 							 errmsg("operand syntax error")));
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index afb0adb222b..10827563694 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -15,7 +15,6 @@
  */
 #define LPADDING		2
 #define RPADDING		1
-#define KEEPONLYALNUM
 /*
  * Caution: IGNORECASE macro means that trigrams are case-insensitive.
  * If this macro is disabled, the ~* and ~~* operators must be removed from
@@ -51,13 +50,8 @@ typedef char trgm[3];
 	*(((char*)(a))+2) = *(((char*)(b))+2);	\
 } while(0)
 
-#ifdef KEEPONLYALNUM
 #define ISWORDCHR(c)	(t_isalnum(c))
 #define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
-#else
-#define ISWORDCHR(c)	(!t_isspace(c))
-#define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && isprint( *(unsigned char*)(a) ) )
-#endif
 #define ISPRINTABLETRGM(t)	( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
 
 #define ISESCAPECHAR(x) (*(x) == '\\')	/* Wildcard escape character */
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index 0217696aac1..fcc25dc7139 100644
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -155,7 +155,7 @@ initTrie(const char *filename)
 				{
 					ptrlen = pg_mblen(ptr);
 					/* ignore whitespace, but end src or trg */
-					if (t_isspace(ptr))
+					if (isspace((unsigned char) *ptr))
 					{
 						if (state == 1)
 							state = 2;
diff --git a/src/backend/tsearch/dict_synonym.c b/src/backend/tsearch/dict_synonym.c
index 77cd511ee51..77c0d7a3593 100644
--- a/src/backend/tsearch/dict_synonym.c
+++ b/src/backend/tsearch/dict_synonym.c
@@ -47,7 +47,7 @@ findwrd(char *in, char **end, uint16 *flags)
 	char	   *lastchar;
 
 	/* Skip leading spaces */
-	while (*in && t_isspace(in))
+	while (*in && isspace((unsigned char) *in))
 		in += pg_mblen(in);
 
 	/* Return NULL on empty lines */
@@ -60,7 +60,7 @@ findwrd(char *in, char **end, uint16 *flags)
 	lastchar = start = in;
 
 	/* Find end of word */
-	while (*in && !t_isspace(in))
+	while (*in && !isspace((unsigned char) *in))
 	{
 		lastchar = in;
 		in += pg_mblen(in);
diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c
index 6b159f9f569..f1449b5607f 100644
--- a/src/backend/tsearch/dict_thesaurus.c
+++ b/src/backend/tsearch/dict_thesaurus.c
@@ -190,7 +190,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 		ptr = line;
 
 		/* is it a comment? */
-		while (*ptr && t_isspace(ptr))
+		while (*ptr && isspace((unsigned char) *ptr))
 			ptr += pg_mblen(ptr);
 
 		if (t_iseq(ptr, '#') || *ptr == '\0' ||
@@ -212,7 +212,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 								 errmsg("unexpected delimiter")));
 					state = TR_WAITSUBS;
 				}
-				else if (!t_isspace(ptr))
+				else if (!isspace((unsigned char) *ptr))
 				{
 					beginwrd = ptr;
 					state = TR_INLEX;
@@ -225,7 +225,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 					newLexeme(d, beginwrd, ptr, idsubst, posinsubst++);
 					state = TR_WAITSUBS;
 				}
-				else if (t_isspace(ptr))
+				else if (isspace((unsigned char) *ptr))
 				{
 					newLexeme(d, beginwrd, ptr, idsubst, posinsubst++);
 					state = TR_WAITLEX;
@@ -245,7 +245,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 					state = TR_INSUBS;
 					beginwrd = ptr + pg_mblen(ptr);
 				}
-				else if (!t_isspace(ptr))
+				else if (!isspace((unsigned char) *ptr))
 				{
 					useasis = false;
 					beginwrd = ptr;
@@ -254,7 +254,7 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 			}
 			else if (state == TR_INSUBS)
 			{
-				if (t_isspace(ptr))
+				if (isspace((unsigned char) *ptr))
 				{
 					if (ptr == beginwrd)
 						ereport(ERROR,
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index 7800f794e84..b41afbd7322 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -408,7 +408,7 @@ getNextFlagFromString(IspellDict *Conf, const char **sflagset, char *sflag)
 											*sflagset)));
 						met_comma = true;
 					}
-					else if (!t_isspace(*sflagset))
+					else if (!isspace((unsigned char) **sflagset))
 					{
 						ereport(ERROR,
 								(errcode(ERRCODE_CONFIG_FILE_ERROR),
@@ -542,7 +542,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 			while (*s)
 			{
 				/* we allow only single encoded flags for faster works */
-				if (pg_mblen(s) == 1 && t_isprint(s) && !t_isspace(s))
+				if (pg_mblen(s) == 1 && t_isprint(s) && !isspace((unsigned char) *s))
 					s++;
 				else
 				{
@@ -558,7 +558,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 		s = line;
 		while (*s)
 		{
-			if (t_isspace(s))
+			if (isspace((unsigned char) *s))
 			{
 				*s = '\0';
 				break;
@@ -799,7 +799,7 @@ get_nextfield(char **str, char *next)
 		{
 			if (t_iseq(*str, '#'))
 				return false;
-			else if (!t_isspace(*str))
+			else if (!isspace((unsigned char) **str))
 			{
 				int			clen = pg_mblen(*str);
 
@@ -814,7 +814,7 @@ get_nextfield(char **str, char *next)
 		}
 		else					/* state == PAE_INMASK */
 		{
-			if (t_isspace(*str))
+			if (isspace((unsigned char) **str))
 			{
 				*next = '\0';
 				return true;
@@ -925,7 +925,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 		{
 			if (t_iseq(str, '#'))
 				return false;
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 			{
 				COPYCHAR(pmask, str);
 				pmask += pg_mblen(str);
@@ -939,7 +939,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				*pmask = '\0';
 				state = PAE_WAIT_FIND;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 			{
 				COPYCHAR(pmask, str);
 				pmask += pg_mblen(str);
@@ -957,7 +957,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				prepl += pg_mblen(str);
 				state = PAE_INREPL;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -974,7 +974,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				COPYCHAR(pfind, str);
 				pfind += pg_mblen(str);
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -991,7 +991,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				prepl += pg_mblen(str);
 				state = PAE_INREPL;
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -1008,7 +1008,7 @@ parse_affentry(char *str, char *mask, char *find, char *repl)
 				COPYCHAR(prepl, str);
 				prepl += pg_mblen(str);
 			}
-			else if (!t_isspace(str))
+			else if (!isspace((unsigned char) *str))
 				ereport(ERROR,
 						(errcode(ERRCODE_CONFIG_FILE_ERROR),
 						 errmsg("syntax error")));
@@ -1070,7 +1070,7 @@ addCompoundAffixFlagValue(IspellDict *Conf, char *s, uint32 val)
 	char	   *sflag;
 	int			clen;
 
-	while (*s && t_isspace(s))
+	while (*s && isspace((unsigned char) *s))
 		s += pg_mblen(s);
 
 	if (!*s)
@@ -1080,7 +1080,7 @@ addCompoundAffixFlagValue(IspellDict *Conf, char *s, uint32 val)
 
 	/* Get flag without \n */
 	sflag = sbuf;
-	while (*s && !t_isspace(s) && *s != '\n')
+	while (*s && !isspace((unsigned char) *s) && *s != '\n')
 	{
 		clen = pg_mblen(s);
 		COPYCHAR(sflag, s);
@@ -1225,7 +1225,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 
 	while ((recoded = tsearch_readline(&trst)) != NULL)
 	{
-		if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
+		if (*recoded == '\0' || isspace((unsigned char) *recoded) || t_iseq(recoded, '#'))
 		{
 			pfree(recoded);
 			continue;
@@ -1262,7 +1262,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 		{
 			char	   *s = recoded + strlen("FLAG");
 
-			while (*s && t_isspace(s))
+			while (*s && isspace((unsigned char) *s))
 				s += pg_mblen(s);
 
 			if (*s)
@@ -1298,7 +1298,7 @@ NIImportOOAffixes(IspellDict *Conf, const char *filename)
 	{
 		int			fields_read;
 
-		if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
+		if (*recoded == '\0' || isspace((unsigned char) *recoded) || t_iseq(recoded, '#'))
 			goto nextline;
 
 		fields_read = parse_ooaffentry(recoded, type, sflag, find, repl, mask);
@@ -1461,9 +1461,9 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 			s = findchar2(recoded, 'l', 'L');
 			if (s)
 			{
-				while (*s && !t_isspace(s))
+				while (*s && !isspace((unsigned char) *s))
 					s += pg_mblen(s);
-				while (*s && t_isspace(s))
+				while (*s && isspace((unsigned char) *s))
 					s += pg_mblen(s);
 
 				if (*s && pg_mblen(s) == 1)
@@ -1494,7 +1494,7 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 			s = recoded + 4;	/* we need non-lowercased string */
 			flagflags = 0;
 
-			while (*s && t_isspace(s))
+			while (*s && isspace((unsigned char) *s))
 				s += pg_mblen(s);
 
 			if (*s == '*')
@@ -1523,7 +1523,7 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 
 				s++;
 				if (*s == '\0' || *s == '#' || *s == '\n' || *s == ':' ||
-					t_isspace(s))
+					isspace((unsigned char) *s))
 				{
 					oldformat = true;
 					goto nextline;
@@ -1750,7 +1750,7 @@ NISortDictionary(IspellDict *Conf)
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
 									Conf->Spell[i]->p.flag)));
-				if (*end != '\0' && !isdigit((unsigned char) *end) && !t_isspace(end))
+				if (*end != '\0' && !isdigit((unsigned char) *end) && !isspace((unsigned char) *end))
 					ereport(ERROR,
 							(errcode(ERRCODE_CONFIG_FILE_ERROR),
 							 errmsg("invalid affix alias \"%s\"",
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 7247b8cbe8a..70a39f48814 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -31,21 +31,6 @@ static void tsearch_readline_callback(void *arg);
  */
 #define WC_BUF_LEN  3
 
-int
-t_isspace(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isspace(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswspace((wint_t) character[0]);
-}
-
 int
 t_isalpha(const char *ptr)
 {
diff --git a/src/backend/tsearch/ts_utils.c b/src/backend/tsearch/ts_utils.c
index 81967d29e9a..f20e61d4c8c 100644
--- a/src/backend/tsearch/ts_utils.c
+++ b/src/backend/tsearch/ts_utils.c
@@ -88,7 +88,7 @@ readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *))
 			char	   *pbuf = line;
 
 			/* Trim trailing space */
-			while (*pbuf && !t_isspace(pbuf))
+			while (*pbuf && !isspace((unsigned char) *pbuf))
 				pbuf += pg_mblen(pbuf);
 			*pbuf = '\0';
 
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 219ab543f62..0366c2a2acd 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -274,7 +274,7 @@ parse_or_operator(TSQueryParserState pstate)
 		 * So we still treat OR literal as operation with possibly incorrect
 		 * operand and will not search it as lexeme
 		 */
-		if (!t_isspace(ptr))
+		if (!isspace((unsigned char) *ptr))
 			break;
 	}
 
@@ -315,7 +315,7 @@ gettoken_query_standard(TSQueryParserState state, int8 *operator,
 					/* generic syntax error message is fine */
 					return PT_ERR;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/*
 					 * We rely on the tsvector parser to parse the value for
@@ -383,7 +383,7 @@ gettoken_query_standard(TSQueryParserState state, int8 *operator,
 				{
 					return (state->count) ? PT_ERR : PT_END;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					return PT_ERR;
 				}
@@ -444,7 +444,7 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
 					state->state = WAITOPERAND;
 					continue;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/*
 					 * We rely on the tsvector parser to parse the value for
@@ -492,7 +492,7 @@ gettoken_query_websearch(TSQueryParserState state, int8 *operator,
 					state->buf++;
 					continue;
 				}
-				else if (!t_isspace(state->buf))
+				else if (!isspace((unsigned char) *state->buf))
 				{
 					/* insert implicit AND between operands */
 					state->state = WAITOPERAND;
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 9e33de0bde7..750a1e8e8d9 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -206,7 +206,7 @@ gettoken_tsvector(TSVectorParseState state,
 			else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
 					 (state->is_web && t_iseq(state->prsbuf, '"')))
 				PRSSYNTAXERROR;
-			else if (!t_isspace(state->prsbuf))
+			else if (!isspace((unsigned char) *state->prsbuf))
 			{
 				COPYCHAR(curpos, state->prsbuf);
 				curpos += pg_mblen(state->prsbuf);
@@ -236,7 +236,7 @@ gettoken_tsvector(TSVectorParseState state,
 				statecode = WAITNEXTCHAR;
 				oldstate = WAITENDWORD;
 			}
-			else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
+			else if (isspace((unsigned char) *state->prsbuf) || *(state->prsbuf) == '\0' ||
 					 (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
 					 (state->is_web && t_iseq(state->prsbuf, '"')))
 			{
@@ -372,7 +372,7 @@ gettoken_tsvector(TSVectorParseState state,
 					PRSSYNTAXERROR;
 				WEP_SETWEIGHT(pos[npos - 1], 0);
 			}
-			else if (t_isspace(state->prsbuf) ||
+			else if (isspace((unsigned char) *state->prsbuf) ||
 					 *(state->prsbuf) == '\0')
 				RETURN_TOKEN;
 			else if (!isdigit((unsigned char) *state->prsbuf))
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 8ef380791fe..9606bb30983 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -39,7 +39,6 @@ typedef struct
 
 #define COPYCHAR(d,s)	memcpy(d, s, pg_mblen(s))
 
-extern int	t_isspace(const char *ptr);
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
 extern int	t_isprint(const char *ptr);
-- 
2.47.1

v2-0003-Remove-t_isprint.patchtext/plain; charset=UTF-8; name=v2-0003-Remove-t_isprint.patchDownload

From 006ff36c6262367888f67afb9b9ddabcc1e203b2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 2 Dec 2024 11:34:17 +0100
Subject: [PATCH v2 3/4] Remove t_isprint()

---
 src/backend/tsearch/spell.c     |  2 +-
 src/backend/tsearch/ts_locale.c | 15 ---------------
 src/include/tsearch/ts_locale.h |  1 -
 3 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b41afbd7322..7eca1714e9b 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -542,7 +542,7 @@ NIImportDictionary(IspellDict *Conf, const char *filename)
 			while (*s)
 			{
 				/* we allow only single encoded flags for faster works */
-				if (pg_mblen(s) == 1 && t_isprint(s) && !isspace((unsigned char) *s))
+				if (pg_mblen(s) == 1 && isprint((unsigned char) *s) && !isspace((unsigned char) *s))
 					s++;
 				else
 				{
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 70a39f48814..a61fd36022e 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -61,21 +61,6 @@ t_isalnum(const char *ptr)
 	return iswalnum((wint_t) character[0]);
 }
 
-int
-t_isprint(const char *ptr)
-{
-	int			clen = pg_mblen(ptr);
-	wchar_t		character[WC_BUF_LEN];
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (clen == 1 || database_ctype_is_c)
-		return isprint(TOUCHAR(ptr));
-
-	char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
-
-	return iswprint((wint_t) character[0]);
-}
-
 
 /*
  * Set up to read a file using tsearch_readline().  This facility is
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 9606bb30983..71e1f78fa36 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -41,7 +41,6 @@ typedef struct
 
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
-extern int	t_isprint(const char *ptr);
 
 extern char *lowerstr(const char *str);
 extern char *lowerstr_with_len(const char *str, int len);
-- 
2.47.1

v2-0004-Remove-ts_locale.c-s-lowerstr.patchtext/plain; charset=UTF-8; name=v2-0004-Remove-ts_locale.c-s-lowerstr.patchDownload

From cc8b59a382bb9283bb61e826917268a4c98b8a57 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Mon, 9 Dec 2024 10:48:34 +0100
Subject: [PATCH v2 4/4] Remove ts_locale.c's lowerstr()

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower() that the rest of the system uses, except that the former
don't use the common locale provider framework but instead use the
global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID).  For instances that use a libc locale
globally, this will result in exactly the same behavior.  For
instances that use other locale providers, you now get consistent
behavior and are no longer dependent on the libc locale settings (for
this case; there are others).

Most uses of these functions are for processing dictionary and
configuration files.  In those cases, using the default collation
seems appropriate.  At least we don't have a more specific collation
available.  But the code in contrib/pg_trgm should really depend on
the collation of the columns being processed.  This is not done here,
this can be done in a separate patch.
---
 contrib/dict_xsyn/dict_xsyn.c        |  6 +-
 contrib/pg_trgm/trgm_op.c            |  6 +-
 contrib/pg_trgm/trgm_regexp.c        |  8 ++-
 src/backend/snowball/dict_snowball.c |  8 ++-
 src/backend/tsearch/dict_ispell.c    |  7 ++-
 src/backend/tsearch/dict_simple.c    |  7 ++-
 src/backend/tsearch/dict_synonym.c   |  8 ++-
 src/backend/tsearch/spell.c          |  7 ++-
 src/backend/tsearch/ts_locale.c      | 89 ----------------------------
 src/backend/tsearch/ts_utils.c       |  5 +-
 src/include/tsearch/ts_locale.h      |  3 -
 src/include/tsearch/ts_public.h      |  2 +-
 12 files changed, 39 insertions(+), 117 deletions(-)

diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index f8c0a5bf5c5..2206300f7b5 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -14,9 +14,11 @@
 
 #include <ctype.h>
 
+#include "catalog/pg_collation_d.h"
 #include "commands/defrem.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
+#include "utils/formatting.h"
 
 PG_MODULE_MAGIC;
 
@@ -93,7 +95,7 @@ read_dictionary(DictSyn *d, const char *filename)
 		if (*line == '\0')
 			continue;
 
-		value = lowerstr(line);
+		value = str_tolower(line, strlen(line), DEFAULT_COLLATION_OID);
 		pfree(line);
 
 		pos = value;
@@ -210,7 +212,7 @@ dxsyn_lexize(PG_FUNCTION_ARGS)
 	{
 		char	   *temp = pnstrdup(in, length);
 
-		word.key = lowerstr(temp);
+		word.key = str_tolower(temp, length, DEFAULT_COLLATION_OID);
 		pfree(temp);
 		word.value = NULL;
 	}
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index c509d15ee40..d0833b3e4a1 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -5,12 +5,14 @@
 
 #include <ctype.h>
 
+#include "catalog/pg_collation_d.h"
 #include "catalog/pg_type.h"
 #include "common/int.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
 #include "trgm.h"
 #include "tsearch/ts_locale.h"
+#include "utils/formatting.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -303,7 +305,7 @@ generate_trgm_only(trgm *trg, char *str, int slen, TrgmBound *bounds)
 	while ((bword = find_word(eword, slen - (eword - str), &eword, &charlen)) != NULL)
 	{
 #ifdef IGNORECASE
-		bword = lowerstr_with_len(bword, eword - bword);
+		bword = str_tolower(bword, eword - bword, DEFAULT_COLLATION_OID);
 		bytelen = strlen(bword);
 #else
 		bytelen = eword - bword;
@@ -899,7 +901,7 @@ generate_wildcard_trgm(const char *str, int slen)
 									  buf, &bytelen, &charlen)) != NULL)
 	{
 #ifdef IGNORECASE
-		buf2 = lowerstr_with_len(buf, bytelen);
+		buf2 = str_tolower(buf, bytelen, DEFAULT_COLLATION_OID);
 		bytelen = strlen(buf2);
 #else
 		buf2 = buf;
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 75d6d1d4a8d..457b21b8302 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -191,9 +191,11 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_collation_d.h"
 #include "regex/regexport.h"
 #include "trgm.h"
 #include "tsearch/ts_locale.h"
+#include "utils/formatting.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -847,16 +849,16 @@ convertPgWchar(pg_wchar c, trgm_mb_char *result)
 	 * within each color, since we used the REG_ICASE option; so there's no
 	 * need to process the uppercase version.
 	 *
-	 * XXX this code is dependent on the assumption that lowerstr() works the
+	 * XXX this code is dependent on the assumption that str_tolower() works the
 	 * same as the regex engine's internal case folding machinery.  Might be
 	 * wiser to expose pg_wc_tolower and test whether c == pg_wc_tolower(c).
 	 * On the other hand, the trigrams in the index were created using
-	 * lowerstr(), so we're probably screwed if there's any incompatibility
+	 * str_tolower(), so we're probably screwed if there's any incompatibility
 	 * anyway.
 	 */
 #ifdef IGNORECASE
 	{
-		char	   *lowerCased = lowerstr(s);
+		char	   *lowerCased = str_tolower(s, strlen(s), DEFAULT_COLLATION_OID);
 
 		if (strcmp(lowerCased, s) != 0)
 		{
diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index caf86490683..12f7485bcde 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -12,9 +12,11 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_collation_d.h"
 #include "commands/defrem.h"
-#include "tsearch/ts_locale.h"
+#include "mb/pg_wchar.h"
 #include "tsearch/ts_public.h"
+#include "utils/formatting.h"
 
 /* Some platforms define MAXINT and/or MININT, causing conflicts */
 #ifdef MAXINT
@@ -236,7 +238,7 @@ dsnowball_init(PG_FUNCTION_ARGS)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("multiple StopWords parameters")));
-			readstoplist(defGetString(defel), &d->stoplist, lowerstr);
+			readstoplist(defGetString(defel), &d->stoplist, str_tolower);
 			stoploaded = true;
 		}
 		else if (strcmp(defel->defname, "language") == 0)
@@ -272,7 +274,7 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
 	DictSnowball *d = (DictSnowball *) PG_GETARG_POINTER(0);
 	char	   *in = (char *) PG_GETARG_POINTER(1);
 	int32		len = PG_GETARG_INT32(2);
-	char	   *txt = lowerstr_with_len(in, len);
+	char	   *txt = str_tolower(in, len, DEFAULT_COLLATION_OID);
 	TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);
 
 	/*
diff --git a/src/backend/tsearch/dict_ispell.c b/src/backend/tsearch/dict_ispell.c
index 07b9ad794de..8772c95038f 100644
--- a/src/backend/tsearch/dict_ispell.c
+++ b/src/backend/tsearch/dict_ispell.c
@@ -13,11 +13,12 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_collation_d.h"
 #include "commands/defrem.h"
 #include "tsearch/dicts/spell.h"
-#include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
 #include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
 
 
 typedef struct
@@ -72,7 +73,7 @@ dispell_init(PG_FUNCTION_ARGS)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("multiple StopWords parameters")));
-			readstoplist(defGetString(defel), &(d->stoplist), lowerstr);
+			readstoplist(defGetString(defel), &(d->stoplist), str_tolower);
 			stoploaded = true;
 		}
 		else
@@ -121,7 +122,7 @@ dispell_lexize(PG_FUNCTION_ARGS)
 	if (len <= 0)
 		PG_RETURN_POINTER(NULL);
 
-	txt = lowerstr_with_len(in, len);
+	txt = str_tolower(in, len, DEFAULT_COLLATION_OID);
 	res = NINormalizeWord(&(d->obj), txt);
 
 	if (res == NULL)
diff --git a/src/backend/tsearch/dict_simple.c b/src/backend/tsearch/dict_simple.c
index b0c9fd7946f..b914875dd96 100644
--- a/src/backend/tsearch/dict_simple.c
+++ b/src/backend/tsearch/dict_simple.c
@@ -13,10 +13,11 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_collation_d.h"
 #include "commands/defrem.h"
-#include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
 #include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
 
 
 typedef struct
@@ -47,7 +48,7 @@ dsimple_init(PG_FUNCTION_ARGS)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("multiple StopWords parameters")));
-			readstoplist(defGetString(defel), &d->stoplist, lowerstr);
+			readstoplist(defGetString(defel), &d->stoplist, str_tolower);
 			stoploaded = true;
 		}
 		else if (strcmp(defel->defname, "accept") == 0)
@@ -80,7 +81,7 @@ dsimple_lexize(PG_FUNCTION_ARGS)
 	char	   *txt;
 	TSLexeme   *res;
 
-	txt = lowerstr_with_len(in, len);
+	txt = str_tolower(in, len, DEFAULT_COLLATION_OID);
 
 	if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
 	{
diff --git a/src/backend/tsearch/dict_synonym.c b/src/backend/tsearch/dict_synonym.c
index 77c0d7a3593..70adbba546c 100644
--- a/src/backend/tsearch/dict_synonym.c
+++ b/src/backend/tsearch/dict_synonym.c
@@ -13,10 +13,12 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_collation_d.h"
 #include "commands/defrem.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
 #include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
 
 typedef struct
 {
@@ -183,8 +185,8 @@ dsynonym_init(PG_FUNCTION_ARGS)
 		}
 		else
 		{
-			d->syn[cur].in = lowerstr(starti);
-			d->syn[cur].out = lowerstr(starto);
+			d->syn[cur].in = str_tolower(starti, strlen(starti), DEFAULT_COLLATION_OID);
+			d->syn[cur].out = str_tolower(starto, strlen(starto), DEFAULT_COLLATION_OID);
 		}
 
 		d->syn[cur].outlen = strlen(starto);
@@ -223,7 +225,7 @@ dsynonym_lexize(PG_FUNCTION_ARGS)
 	if (d->case_sensitive)
 		key.in = pnstrdup(in, len);
 	else
-		key.in = lowerstr_with_len(in, len);
+		key.in = str_tolower(in, len, DEFAULT_COLLATION_OID);
 
 	key.out = NULL;
 
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index 7eca1714e9b..fcbda395946 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -66,6 +66,7 @@
 #include "miscadmin.h"
 #include "tsearch/dicts/spell.h"
 #include "tsearch/ts_locale.h"
+#include "utils/formatting.h"
 #include "utils/memutils.h"
 
 
@@ -169,7 +170,7 @@ cpstrdup(IspellDict *Conf, const char *str)
 
 
 /*
- * Apply lowerstr(), producing a temporary result (in the buildCxt).
+ * Apply str_tolower(), producing a temporary result (in the buildCxt).
  */
 static char *
 lowerstr_ctx(IspellDict *Conf, const char *src)
@@ -178,7 +179,7 @@ lowerstr_ctx(IspellDict *Conf, const char *src)
 	char	   *dst;
 
 	saveCtx = MemoryContextSwitchTo(Conf->buildCxt);
-	dst = lowerstr(src);
+	dst = str_tolower(src, strlen(src), DEFAULT_COLLATION_OID);
 	MemoryContextSwitchTo(saveCtx);
 
 	return dst;
@@ -1449,7 +1450,7 @@ NIImportAffixes(IspellDict *Conf, const char *filename)
 
 	while ((recoded = tsearch_readline(&trst)) != NULL)
 	{
-		pstr = lowerstr(recoded);
+		pstr = str_tolower(recoded, strlen(recoded), DEFAULT_COLLATION_OID);
 
 		/* Skip comments and empty lines */
 		if (*pstr == '#' || *pstr == '\n')
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index a61fd36022e..b2aefa31c26 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -197,92 +197,3 @@ tsearch_readline_callback(void *arg)
 				   stp->lineno,
 				   stp->filename);
 }
-
-
-/*
- * lowerstr --- fold null-terminated string to lower case
- *
- * Returned string is palloc'd
- */
-char *
-lowerstr(const char *str)
-{
-	return lowerstr_with_len(str, strlen(str));
-}
-
-/*
- * lowerstr_with_len --- fold string to lower case
- *
- * Input string need not be null-terminated.
- *
- * Returned string is palloc'd
- */
-char *
-lowerstr_with_len(const char *str, int len)
-{
-	char	   *out;
-	pg_locale_t mylocale = 0;	/* TODO */
-
-	if (len == 0)
-		return pstrdup("");
-
-	/*
-	 * Use wide char code only when max encoding length > 1 and ctype != C.
-	 * Some operating systems fail with multi-byte encodings and a C locale.
-	 * Also, for a C locale there is no need to process as multibyte. From
-	 * backend/utils/adt/oracle_compat.c Teodor
-	 */
-	if (pg_database_encoding_max_length() > 1 && !database_ctype_is_c)
-	{
-		wchar_t    *wstr,
-				   *wptr;
-		int			wlen;
-
-		/*
-		 * alloc number of wchar_t for worst case, len contains number of
-		 * bytes >= number of characters and alloc 1 wchar_t for 0, because
-		 * wchar2char wants zero-terminated string
-		 */
-		wptr = wstr = (wchar_t *) palloc(sizeof(wchar_t) * (len + 1));
-
-		wlen = char2wchar(wstr, len + 1, str, len, mylocale);
-		Assert(wlen <= len);
-
-		while (*wptr)
-		{
-			*wptr = towlower((wint_t) *wptr);
-			wptr++;
-		}
-
-		/*
-		 * Alloc result string for worst case + '\0'
-		 */
-		len = pg_database_encoding_max_length() * wlen + 1;
-		out = (char *) palloc(len);
-
-		wlen = wchar2char(out, wstr, len, mylocale);
-
-		pfree(wstr);
-
-		if (wlen < 0)
-			ereport(ERROR,
-					(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-					 errmsg("conversion from wchar_t to server encoding failed: %m")));
-		Assert(wlen < len);
-	}
-	else
-	{
-		const char *ptr = str;
-		char	   *outptr;
-
-		outptr = out = (char *) palloc(sizeof(char) * (len + 1));
-		while ((ptr - str) < len && *ptr)
-		{
-			*outptr++ = tolower(TOUCHAR(ptr));
-			ptr++;
-		}
-		*outptr = '\0';
-	}
-
-	return out;
-}
diff --git a/src/backend/tsearch/ts_utils.c b/src/backend/tsearch/ts_utils.c
index f20e61d4c8c..89d5ce4ca85 100644
--- a/src/backend/tsearch/ts_utils.c
+++ b/src/backend/tsearch/ts_utils.c
@@ -16,6 +16,7 @@
 
 #include <ctype.h>
 
+#include "catalog/pg_collation_d.h"
 #include "miscadmin.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
@@ -65,7 +66,7 @@ get_tsearch_config_filename(const char *basename,
  * or palloc a new version.
  */
 void
-readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *))
+readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *, size_t, Oid))
 {
 	char	  **stop = NULL;
 
@@ -115,7 +116,7 @@ readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *))
 
 			if (wordop)
 			{
-				stop[s->len] = wordop(line);
+				stop[s->len] = wordop(line, strlen(line), DEFAULT_COLLATION_OID);
 				if (stop[s->len] != line)
 					pfree(line);
 			}
diff --git a/src/include/tsearch/ts_locale.h b/src/include/tsearch/ts_locale.h
index 71e1f78fa36..38b1a1ba90e 100644
--- a/src/include/tsearch/ts_locale.h
+++ b/src/include/tsearch/ts_locale.h
@@ -42,9 +42,6 @@ typedef struct
 extern int	t_isalpha(const char *ptr);
 extern int	t_isalnum(const char *ptr);
 
-extern char *lowerstr(const char *str);
-extern char *lowerstr_with_len(const char *str, int len);
-
 extern bool tsearch_readline_begin(tsearch_readline_state *stp,
 								   const char *filename);
 extern char *tsearch_readline(tsearch_readline_state *stp);
diff --git a/src/include/tsearch/ts_public.h b/src/include/tsearch/ts_public.h
index e1549863a12..959bbcc00af 100644
--- a/src/include/tsearch/ts_public.h
+++ b/src/include/tsearch/ts_public.h
@@ -104,7 +104,7 @@ typedef struct
 } StopList;
 
 extern void readstoplist(const char *fname, StopList *s,
-						 char *(*wordop) (const char *));
+						 char *(*wordop) (const char *, size_t, Oid));
 extern bool searchstoplist(StopList *s, char *key);
 
 /*
-- 
2.47.1

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Peter Eisentraut (#1)

Re: fixing tsearch locale support

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:

t_isdigit() and t_isspace() are just used to parse various
configuration
and data files, and surely we don't need support for encoding-
dependent
multibyte support for parsing ASCII digits and ASCII spaces.
... So these can
be
replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

There was also some discussion about forcing LC_COLLATE and LC_CTYPE to
C, now that the default collation doesn't depend on them any more (cf.
option 1):

/messages/by-id/CA+hUKGL82jG2PdgfQtwWG+_51TQ--6M9XNa3rtt7ub+S3Pmfsw@mail.gmail.com

If we do that, then it would be fine to use isdigit/isspace.

Regards,
Jeff Davis

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Peter Eisentraut (#2)

Re: fixing tsearch locale support

On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:

I have expanded this patch set. The first three patches are the same
as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

+1 to the patch series.

Note: I posted a patch series to support case folding:

https://commitfest.postgresql.org/51/5436/

and we may want to use case folding for some of these purposes
internally. But this is not a blocker.

There is some kind of compatibility risk with these changes, so we will
need a release note. And we should try to get all the changes in one
release to avoid repeated small breakages.

Regards,
Jeff Davis

Peter Eisentraut

peter_e@gmx.net

about 1 year ago

In reply to: Jeff Davis (#3)

Re: fixing tsearch locale support

On 12.12.24 19:14, Jeff Davis wrote:

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:

t_isdigit() and t_isspace() are just used to parse various
configuration
and data files, and surely we don't need support for encoding-
dependent
multibyte support for parsing ASCII digits and ASCII spaces.
... So these can
be
replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

isdigit() and isspace() in particular are widely used throughout the
backend code without such concerns. I think the assumption is that this
is not a problem in practice: For multibyte encodings, these functions
would only be able to process the ASCII subset, and the character
classification of that should be consistent across all locales. For
single-byte encodings, among the encodings that PostgreSQL supports, I
don't think any of them actually provide non-ASCII digits or space
characters.

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Peter Eisentraut (#5)

Re: fixing tsearch locale support

On Fri, 2024-12-13 at 07:16 +0100, Peter Eisentraut wrote:

isdigit() and isspace() in particular are widely used throughout the
backend code without such concerns. I think the assumption is that
this
is not a problem in practice: For multibyte encodings, these
functions
would only be able to process the ASCII subset, and the character
classification of that should be consistent across all locales. For
single-byte encodings, among the encodings that PostgreSQL supports,
I
don't think any of them actually provide non-ASCII digits or space
characters.

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

Regards,
Jeff Davis

Andreas Karlsson

andreas.karlsson@percona.com

about 1 year ago

In reply to: Jeff Davis (#6)

Re: fixing tsearch locale support

On 12/13/24 6:07 PM, Jeff Davis wrote:

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Andreas

Peter Eisentraut

peter_e@gmx.net

about 1 year ago

In reply to: Andreas Karlsson (#7)

Re: fixing tsearch locale support

On 17.12.24 16:25, Andreas Karlsson wrote:

On 12/13/24 6:07 PM, Jeff Davis wrote:

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Yes, I think that could make sense.

Peter Eisentraut

peter_e@gmx.net

about 1 year ago

In reply to: Jeff Davis (#4)

Re: fixing tsearch locale support

On 12.12.24 19:20, Jeff Davis wrote:

On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:

I have expanded this patch set. The first three patches are the same
as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

+1 to the patch series.

Note: I posted a patch series to support case folding:

https://commitfest.postgresql.org/51/5436/

and we may want to use case folding for some of these purposes
internally. But this is not a blocker.

There is some kind of compatibility risk with these changes, so we will
need a release note. And we should try to get all the changes in one
release to avoid repeated small breakages.

I have committed this and made a note on the open items wiki page about
making a release note or similar.

I'll close this commitfest entry now and will come back with a new patch
series for t_isalpha/t_isalnum when the locale-provider-aware character
classification functions are available.

#10

Peter Eisentraut

peter_e@gmx.net

5 months ago

In reply to: Peter Eisentraut (#2)

Re: fixing tsearch locale support

On 09.12.24 11:11, Peter Eisentraut wrote:

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower(), except that the former don't use the common locale
provider framework but instead use the global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID). For instances that use a libc locale globally,
this will result in exactly the same behavior. For instances that use
other locale providers, you now get consistent behavior and are no
longer dependent on the libc locale settings.

Most uses of these functions are for processing dictionary and
configuration files. In those cases, using the default collation seems
appropriate. At least we don't have a more specific collation
available. But the code in contrib/pg_trgm should really depend on the
collation of the columns being processed. This is not done here, this
can be done in a separate patch.

(You can probably construct some edge cases where this change would
create some locale-related upgrade incompatibility, for example if
before you used a combination of ICU and a differently-behaving libc
locale. We can document this in the release notes, but I don't think
there is anything more we can do about this.)

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting. In
database clusters that use a locale provider other than libc and where
the locale configured through that locale provider behaves differently
from the LC_CTYPE setting configured for the database, this could cause
changes in behavior of some functions related to full-text search as
well as the pg_trgm extension. When upgrading such database clusters
using pg_upgrade, it is recommended to reindex all indexes related to
full-text search and pg_trgm after the upgrade.
"""

The commit reference is fb1a18810f0.

Thoughts?

#11

Daniel Verite

daniel@manitou-mail.org

5 months ago

In reply to: Peter Eisentraut (#10)

Re: fixing tsearch locale support

Peter Eisentraut wrote:

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting
[...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCD—EFGH | {abcd—efgh}

with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | — |
asciiword | EFGH | {efgh}

OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

5 months ago

In reply to: Daniel Verite (#11)

Re: fixing tsearch locale support

On 18/08/2025 18:56, Daniel Verite wrote:

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting
[...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCD—EFGH | {abcd—efgh}

with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | — |
asciiword | EFGH | {efgh}

OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Is it only used for converting to lower case, or is there any other
operations that need to be mentioned? Converting to upper case too I
presume. (I haven't been following this thread)

We only support two collation providers, libc and ICU right? That makes
Peter's phrasing "In database clusters that use a locale provider other
than libc ..." an unnecessarily complicated way of saying ICU.

Putting those two changes together:

"""
The locale implementation underlying full-text search was improved. It
now observes the collation provider configured for the database for
converting strings to upper and lower case. It was previously hardcoded
to use libc. In databases that use the ICU collation provider and where
the configured ICU locale behaves differently from the LC_CTYPE setting
configured for the database, this could cause changes in behavior of
some functions related to full-text search as well as the pg_trgm
extension. When upgrading such database clusters using pg_upgrade, it
is recommended to reindex all indexes related to full-text search and
pg_trgm after the upgrade.
"""

I wonder if it's clear enough that this applies to full-text search, not
upper/lower case conversions in general. (Is that true?)

It's pretty urgent to get the release notes in shape, people are testing
upgrades with the betas already...

- Heikki

#13

Peter Eisentraut

peter_e@gmx.net

5 months ago

In reply to: Heikki Linnakangas (#12)

Re: fixing tsearch locale support

On 26.08.25 18:52, Heikki Linnakangas wrote:

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Is it only used for converting to lower case, or is there any other
operations that need to be mentioned? Converting to upper case too I
presume. (I haven't been following this thread)

It's actually only lower case. (It should really be casefold, but that
might be a separate project for another day.) But after reading this a
few times, just writing "for converting to lower case" led me to ask
"but what about upper case", so I reworded it to "case conversion".

We only support two collation providers, libc and ICU right? That makes
Peter's phrasing "In database clusters that use a locale provider other
than libc ..." an unnecessarily complicated way of saying ICU.

There is the "builtin" provider, and it is affected by this as well.

It's pretty urgent to get the release notes in shape, people are testing
upgrades with the betas already...

I have committed this release note item with some adjustment now.