fixing tsearch locale support

Started by Peter Eisentrautabout 1 year ago13 messages

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

3 attachment(s)

Infamously, the tsearch locale support in
src/backend/tsearch/ts_locale.c still depends on libc environment
variable locale settings and is not caught up with pg_locale_t,
collations, ICU, and all that newer stuff. This is used in the tsearch
facilities themselves, but also in other modules such as ltree, pg_trgm,
and unaccent.

Several of the functions are wrappers around <ctype.h> functions, like

int
t_isalpha(const char *ptr)
{
int clen = pg_mblen(ptr);
wchar_t character[WC_BUF_LEN];
pg_locale_t mylocale = 0; /* TODO */

if (clen == 1 || database_ctype_is_c)
return isalpha(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe
locale provider or collation settings.

As an easy start toward fixing this, I think several of these functions
we don't even need.

t_isdigit() and t_isspace() are just used to parse various configuration
and data files, and surely we don't need support for encoding-dependent
multibyte support for parsing ASCII digits and ASCII spaces. At least,
I didn't find any indications in the documentation of these file formats
that they are supposed to support that kind of thing. So these can be
replaced by the normal isdigit() and isspace().

There is one call to t_isprint(), which is similarly used only to parse
some flags in a configuration file. From the surrounding code you can
deduce that it's only called on single-byte characters, so it can
similarly be replaced by plain issprint().

Note, pg_trgm has some compile-time options with macros such as
KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the
non-default variant is not supported by any test cases. So as part of
this undertaking, I'm going to remove the non-default variants if they
are in the way of cleanup.

Peter Eisentraut

peter@eisentraut.org

about 1 year ago

In reply to: Peter Eisentraut (#1)

4 attachment(s)

Re: fixing tsearch locale support

I have expanded this patch set. The first three patches are the same as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower(), except that the former don't use the common locale
provider framework but instead use the global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID). For instances that use a libc locale globally,
this will result in exactly the same behavior. For instances that use
other locale providers, you now get consistent behavior and are no
longer dependent on the libc locale settings.

Most uses of these functions are for processing dictionary and
configuration files. In those cases, using the default collation seems
appropriate. At least we don't have a more specific collation
available. But the code in contrib/pg_trgm should really depend on the
collation of the columns being processed. This is not done here, this
can be done in a separate patch.

(You can probably construct some edge cases where this change would
create some locale-related upgrade incompatibility, for example if
before you used a combination of ICU and a differently-behaving libc
locale. We can document this in the release notes, but I don't think
there is anything more we can do about this.)

After these patches, the only problematic things left in ts_locale.{c,h} are

extern int t_isalpha(const char *ptr);
extern int t_isalnum(const char *ptr);

My current assessment is that these are best addressed after patch [0]/messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com,
because we need locale-provider-aware character classification functions.

[0]: /messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com
/messages/by-id/2830211e1b6e6a2e26d845780b03e125281ea17b.camel@j-davis.com

Show quoted text

On 02.12.24 11:57, Peter Eisentraut wrote:

Infamously, the tsearch locale support in src/backend/tsearch/
ts_locale.c still depends on libc environment variable locale settings
and is not caught up with pg_locale_t, collations, ICU, and all that
newer stuff.Â This is used in the tsearch facilities themselves, but
also in other modules such as ltree, pg_trgm, and unaccent.

Several of the functions are wrappers around <ctype.h> functions, like

int
t_isalpha(const char *ptr)
{
Â Â Â intÂ Â Â Â Â Â Â Â clen = pg_mblen(ptr);
Â Â Â wchar_tÂ Â Â Â character[WC_BUF_LEN];
Â Â Â pg_locale_t mylocale = 0;Â Â /* TODO */

Â Â Â if (clen == 1 || database_ctype_is_c)
Â Â Â Â Â Â Â return isalpha(TOUCHAR(ptr));

Â Â Â char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

Â Â Â return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe
locale provider or collation settings.

As an easy start toward fixing this, I think several of these functions
we don't even need.

t_isdigit() and t_isspace() are just used to parse various configuration
and data files, and surely we don't need support for encoding-dependent
multibyte support for parsing ASCII digits and ASCII spaces.Â At least,
I didn't find any indications in the documentation of these file formats
that they are supposed to support that kind of thing.Â So these can be
replaced by the normal isdigit() and isspace().

There is one call to t_isprint(), which is similarly used only to parse
some flags in a configuration file.Â From the surrounding code you can
deduce that it's only called on single-byte characters, so it can
similarly be replaced by plain issprint().

Note, pg_trgm has some compile-time options with macros such as
KEEPONLYALNUM and IGNORECASE.Â AFAICT, these are not documented, and the
non-default variant is not supported by any test cases.Â So as part of
this undertaking, I'm going to remove the non-default variants if they
are in the way of cleanup.

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Peter Eisentraut (#1)

Re: fixing tsearch locale support

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:

t_isdigit() and t_isspace() are just used to parse various
configuration
and data files, and surely we don't need support for encoding-
dependent
multibyte support for parsing ASCII digits and ASCII spaces.
... So these can
be
replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

There was also some discussion about forcing LC_COLLATE and LC_CTYPE to
C, now that the default collation doesn't depend on them any more (cf.
option 1):

/messages/by-id/CA+hUKGL82jG2PdgfQtwWG+_51TQ--6M9XNa3rtt7ub+S3Pmfsw@mail.gmail.com

If we do that, then it would be fine to use isdigit/isspace.

Regards,
Jeff Davis

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Peter Eisentraut (#2)

Re: fixing tsearch locale support

On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:

I have expanded this patch set. The first three patches are the same
as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

+1 to the patch series.

Note: I posted a patch series to support case folding:

https://commitfest.postgresql.org/51/5436/

and we may want to use case folding for some of these purposes
internally. But this is not a blocker.

There is some kind of compatibility risk with these changes, so we will
need a release note. And we should try to get all the changes in one
release to avoid repeated small breakages.

Regards,
Jeff Davis

Peter Eisentraut

peter@eisentraut.org

12 months ago

In reply to: Jeff Davis (#3)

Re: fixing tsearch locale support

On 12.12.24 19:14, Jeff Davis wrote:

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:

t_isdigit() and t_isspace() are just used to parse various
configuration
and data files, and surely we don't need support for encoding-
dependent
multibyte support for parsing ASCII digits and ASCII spaces.
... So these can
be
replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

isdigit() and isspace() in particular are widely used throughout the
backend code without such concerns. I think the assumption is that this
is not a problem in practice: For multibyte encodings, these functions
would only be able to process the ASCII subset, and the character
classification of that should be consistent across all locales. For
single-byte encodings, among the encodings that PostgreSQL supports, I
don't think any of them actually provide non-ASCII digits or space
characters.

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: Peter Eisentraut (#5)

Re: fixing tsearch locale support

On Fri, 2024-12-13 at 07:16 +0100, Peter Eisentraut wrote:

isdigit() and isspace() in particular are widely used throughout the
backend code without such concerns. I think the assumption is that
this
is not a problem in practice: For multibyte encodings, these
functions
would only be able to process the ASCII subset, and the character
classification of that should be consistent across all locales. For
single-byte encodings, among the encodings that PostgreSQL supports,
I
don't think any of them actually provide non-ASCII digits or space
characters.

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

Regards,
Jeff Davis

Andreas Karlsson

andreas@proxel.se

12 months ago

In reply to: Jeff Davis (#6)

Re: fixing tsearch locale support

On 12/13/24 6:07 PM, Jeff Davis wrote:

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Andreas

Peter Eisentraut

peter@eisentraut.org

12 months ago

In reply to: Andreas Karlsson (#7)

Re: fixing tsearch locale support

On 17.12.24 16:25, Andreas Karlsson wrote:

On 12/13/24 6:07 PM, Jeff Davis wrote:

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Yes, I think that could make sense.

Peter Eisentraut

peter@eisentraut.org

12 months ago

In reply to: Jeff Davis (#4)

Re: fixing tsearch locale support

On 12.12.24 19:20, Jeff Davis wrote:

On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:

I have expanded this patch set.Â The first three patches are the same
as
before.Â I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

+1 to the patch series.

Note: I posted a patch series to support case folding:

https://commitfest.postgresql.org/51/5436/

and we may want to use case folding for some of these purposes
internally. But this is not a blocker.

There is some kind of compatibility risk with these changes, so we will
need a release note. And we should try to get all the changes in one
release to avoid repeated small breakages.

I have committed this and made a note on the open items wiki page about
making a release note or similar.

I'll close this commitfest entry now and will come back with a new patch
series for t_isalpha/t_isalnum when the locale-provider-aware character
classification functions are available.

#10

Peter Eisentraut

peter@eisentraut.org

4 months ago

In reply to: Peter Eisentraut (#2)

Re: fixing tsearch locale support

On 09.12.24 11:11, Peter Eisentraut wrote:

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower(), except that the former don't use the common locale
provider framework but instead use the global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID).Â For instances that use a libc locale globally,
this will result in exactly the same behavior.Â For instances that use
other locale providers, you now get consistent behavior and are no
longer dependent on the libc locale settings.

Most uses of these functions are for processing dictionary and
configuration files.Â In those cases, using the default collation seems
appropriate.Â At least we don't have a more specific collation
available.Â But the code in contrib/pg_trgm should really depend on the
collation of the columns being processed.Â This is not done here, this
can be done in a separate patch.

(You can probably construct some edge cases where this change would
create some locale-related upgrade incompatibility, for example if
before you used a combination of ICU and a differently-behaving libc
locale.Â We can document this in the release notes, but I don't think
there is anything more we can do about this.)

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting. In
database clusters that use a locale provider other than libc and where
the locale configured through that locale provider behaves differently
from the LC_CTYPE setting configured for the database, this could cause
changes in behavior of some functions related to full-text search as
well as the pg_trgm extension. When upgrading such database clusters
using pg_upgrade, it is recommended to reindex all indexes related to
full-text search and pg_trgm after the upgrade.
"""

The commit reference is fb1a18810f0.

Thoughts?

#11

Daniel Verite

daniel@manitou-mail.org

4 months ago

In reply to: Peter Eisentraut (#10)

Re: fixing tsearch locale support

Peter Eisentraut wrote:

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting
[...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCD—EFGH | {abcd—efgh}

with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | — |
asciiword | EFGH | {efgh}

OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
alias | token | lexemes
-------+-------+---------
word | ÉTÉ | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#12

Heikki Linnakangas

hlinnaka@iki.fi

4 months ago

In reply to: Daniel Verite (#11)

Re: fixing tsearch locale support

On 18/08/2025 18:56, Daniel Verite wrote:

There is a PG18 open item to document this possible upgrade incompatibility.

I think the following text could be added to the release notes:

"""
The locale implementation underlying full-text search was improved. It
now observes the locale provider configured for the database. It was
previously hardcoded to use the configured libc LC_CTYPE setting
[...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-------+-----------+-------------
word | ABCDâEFGH | {abcdâefgh}

with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
alias | token | lexemes
-----------+-------+---------
asciiword | ABCD | {abcd}
blank | â |
asciiword | EFGH | {efgh}

OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÃTÃ');
alias | token | lexemes
-------+-------+---------
word | ÃTÃ | {ÃtÃ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÃTÃ');
alias | token | lexemes
-------+-------+---------
word | ÃTÃ | {Ã©tÃ©}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Is it only used for converting to lower case, or is there any other
operations that need to be mentioned? Converting to upper case too I
presume. (I haven't been following this thread)

We only support two collation providers, libc and ICU right? That makes
Peter's phrasing "In database clusters that use a locale provider other
than libc ..." an unnecessarily complicated way of saying ICU.

Putting those two changes together:

"""
The locale implementation underlying full-text search was improved. It
now observes the collation provider configured for the database for
converting strings to upper and lower case. It was previously hardcoded
to use libc. In databases that use the ICU collation provider and where
the configured ICU locale behaves differently from the LC_CTYPE setting
configured for the database, this could cause changes in behavior of
some functions related to full-text search as well as the pg_trgm
extension. When upgrading such database clusters using pg_upgrade, it
is recommended to reindex all indexes related to full-text search and
pg_trgm after the upgrade.
"""

I wonder if it's clear enough that this applies to full-text search, not
upper/lower case conversions in general. (Is that true?)

It's pretty urgent to get the release notes in shape, people are testing
upgrades with the betas already...

- Heikki

#13

Peter Eisentraut

peter@eisentraut.org

4 months ago

In reply to: Heikki Linnakangas (#12)

Re: fixing tsearch locale support

On 26.08.25 18:52, Heikki Linnakangas wrote:

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Is it only used for converting to lower case, or is there any other
operations that need to be mentioned? Converting to upper case too I
presume. (I haven't been following this thread)

It's actually only lower case. (It should really be casefold, but that
might be a separate project for another day.) But after reading this a
few times, just writing "for converting to lower case" led me to ask
"but what about upper case", so I reworded it to "case conversion".

We only support two collation providers, libc and ICU right? That makes
Peter's phrasing "In database clusters that use a locale provider other
than libc ..." an unnecessarily complicated way of saying ICU.

There is the "builtin" provider, and it is affected by this as well.

It's pretty urgent to get the release notes in shape, people are testing
upgrades with the betas already...

I have committed this release note item with some adjustment now.

Participants

Thread Outline

fixing tsearch locale support

Attachments:

Attachments: