make tsearch use the database default locale

Started by Jeff Davis6 months ago5 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

tsvector and tsquery are not collatable types, but they do need locale
information to parse the original text. It would not do any good to
make it a collatable type, because a COLLATE clause would typically be
applied after the parsing is done.

Previously, tsearch used the database CTYPE for parsing, but that's not
good because it creates an unnecessary dependency on libc even when the
user has requested another provider.

This patch series allows tsearch to use the database default locale for
parsing. If the database collation is libc, there's no change.

Motivation:

(a) it reduces the dependence on setlocale(), which is not thread-
safe;
(b) if a user is using the builtin or ICU providers, understanding
the effects of LC_CTYPE can be very confusing;
(c) it would allow us to test more of the tsearch parsing behavior.

Notes:

* Should have the the exact same behavior as before if the database
locale provider is libc. If the database locale provider is builtin or
ICU, then there will be some differences in tsearch parsing behavior.

* Most of the patches are straightforward, but v1-0005 might need extra
attention. There are quite a few cases there with subtle distinctions,
and I might have missed something. For example, in the "C" locale,
tsearch treats non-ascii characters as alpha, even though the libc
functions do not do so (I preserved this behavior).

* This introduces redundancy between the character isxyz() functions in
recg_pg_locale.c and similar functions in pg_locale.c. It would be easy
enough to refactor to eliminate the redundancy, but that might have
performance implications, so I didn't do it yet.

Regards,
Jeff Davis

Attachments:

v1-0001-Rename-static-functions-pg_wc_xyz-to-regc_wc_xyz.patchtext/x-patch; charset=UTF-8; name=v1-0001-Rename-static-functions-pg_wc_xyz-to-regc_wc_xyz.patchDownload+58-59
v1-0002-Add-pg_wc_xyz-exported-functions.patchtext/x-patch; charset=UTF-8; name=v1-0002-Add-pg_wc_xyz-exported-functions.patchDownload+301-145
v1-0003-Add-pg_wc_isxdigit-useful-for-tsearch.patchtext/x-patch; charset=UTF-8; name=v1-0003-Add-pg_wc_isxdigit-useful-for-tsearch.patchDownload+51-1
v1-0004-Add-pg_database_locale-to-retrieve-database-defau.patchtext/x-patch; charset=UTF-8; name=v1-0004-Add-pg_database_locale-to-retrieve-database-defau.patchDownload+10-1
v1-0005-tsearch-use-database-default-collation-for-parsin.patchtext/x-patch; charset=UTF-8; name=v1-0005-tsearch-use-database-default-collation-for-parsin.patchDownload+27-85
v1-0006-Remove-obsolete-global-database_ctype_is_c.patchtext/x-patch; charset=UTF-8; name=v1-0006-Remove-obsolete-global-database_ctype_is_c.patchDownload+0-11
#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: make tsearch use the database default locale

On Tue, 2025-10-07 at 15:49 -0700, Jeff Davis wrote:

This patch series allows tsearch to use the database default locale
for
parsing. If the database collation is libc, there's no change.

I committed a couple of the refactoring patches and rebased. v3
attached.

v3-0003 which eliminates the "wstr" logic and uses only the "pgwstr". I
was a bit confused why both were needed, as the purpose of pg_wchar is
to abstract away the problems with wchar_t. Perhaps it's historical, or
perhaps I missed something.

Regarding the risk of behavior changes: this affects parsing the
values, but not the interpretation of values after parsing, so the risk
of index inconsistencies seems low. There's risk that a document parsed
in the old version would be parsed differently in the new version,
though. Overall, it seems comparable to the risk of fb1a18810f.

Regards,
Jeff Davis

Attachments:

v3-0001-Add-pg_iswxdigit-useful-for-tsearch.patchtext/x-patch; charset=UTF-8; name=v3-0001-Add-pg_iswxdigit-useful-for-tsearch.patchDownload+51-1
v3-0002-Add-pg_database_locale-to-retrieve-database-defau.patchtext/x-patch; charset=UTF-8; name=v3-0002-Add-pg_database_locale-to-retrieve-database-defau.patchDownload+10-1
v3-0003-tsearch-use-database-default-collation-for-parsin.patchtext/x-patch; charset=UTF-8; name=v3-0003-tsearch-use-database-default-collation-for-parsin.patchDownload+27-85
v3-0004-Remove-obsolete-global-database_ctype_is_c.patchtext/x-patch; charset=UTF-8; name=v3-0004-Remove-obsolete-global-database_ctype_is_c.patchDownload+0-11
#3Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#1)
Re: make tsearch use the database default locale

On 08.10.25 00:49, Jeff Davis wrote:

Previously, tsearch used the database CTYPE for parsing, but that's not
good because it creates an unnecessary dependency on libc even when the
user has requested another provider.

This patch series allows tsearch to use the database default locale for
parsing. If the database collation is libc, there's no change.

This looks good to me overall.

* Most of the patches are straightforward, but v1-0005 might need extra
attention. There are quite a few cases there with subtle distinctions,
and I might have missed something. For example, in the "C" locale,
tsearch treats non-ascii characters as alpha, even though the libc
functions do not do so (I preserved this behavior).

This is indeed a bit mysterious. AFAICT, the behavior you describe is
conditional on if (prs->usewide), so it apparently depends also on the
encoding? I'm not sure if the new code covers this.

After this patch set, char2wchar() can become a local function in
pg_locale_libc.c. (But we still need wchar2char() externally, so maybe
it's not worth changing this (yes).)

#4Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#3)
Re: make tsearch use the database default locale

On Fri, 2025-10-17 at 18:15 +0200, Peter Eisentraut wrote:

This is indeed a bit mysterious.  AFAICT, the behavior you describe
is
conditional on if (prs->usewide), so it apparently depends also on
the
encoding?  I'm not sure if the new code covers this.

I believe the new code does cover this case:

Previously, the code was effectively:
if (prs->usewide && prs->pgwstr != NULL && c > 0x7f)
retirm nonascii

and the new code is:
if (prs->charmaxlen > 1 && locale->ctype_is_c && wc > 0x7f)
return nonascii;

unless I missed something, those are equivalent.

After this patch set, char2wchar() can become a local function in
pg_locale_libc.c.  (But we still need wchar2char() externally, so
maybe
it's not worth changing this (yes).)

Done.

The rest of the patches are rebased with no other changes. I plan to
commit soon.

Regards,
Jeff Davis

Attachments:

v4-0001-tsearch-use-database-default-collation-for-parsin.patchtext/x-patch; charset=UTF-8; name=v4-0001-tsearch-use-database-default-collation-for-parsin.patchDownload+27-85
v4-0002-Remove-obsolete-global-database_ctype_is_c.patchtext/x-patch; charset=UTF-8; name=v4-0002-Remove-obsolete-global-database_ctype_is_c.patchDownload+0-11
v4-0003-Make-char2wchar-static.patchtext/x-patch; charset=UTF-8; name=v4-0003-Make-char2wchar-static.patchDownload+7-9
#5Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#4)
Re: make tsearch use the database default locale

On 19.10.25 02:29, Jeff Davis wrote:

On Fri, 2025-10-17 at 18:15 +0200, Peter Eisentraut wrote:

This is indeed a bit mysterious.  AFAICT, the behavior you describe
is
conditional on if (prs->usewide), so it apparently depends also on
the
encoding?  I'm not sure if the new code covers this.

I believe the new code does cover this case:

Previously, the code was effectively:
if (prs->usewide && prs->pgwstr != NULL && c > 0x7f)
retirm nonascii

and the new code is:
if (prs->charmaxlen > 1 && locale->ctype_is_c && wc > 0x7f)
return nonascii;

unless I missed something, those are equivalent.

Yes, this looks ok.