Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

Started by Jeff Davisalmost 2 years ago6 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

Like ICU, allow -1 length to mean that the input string is NUL-
terminated for pg_strncoll(), pg_strnxfrm(), and pg_strnxfrm_prefix().

This simplifies the API and code a bit.

Along with some other refactoring in this area, we are getting close to
the point where the collation provider can just be a table of methods,
which means we can add an extension hook to provide a different method
table. That still requires more work, I'm just mentioning it here for
context.

Regards,
Jeff Davis

Attachments:

v1-0001-Allow-length-1-for-NUL-terminated-input-to-pg_str.patchtext/x-patch; charset=UTF-8; name=v1-0001-Allow-length-1-for-NUL-terminated-input-to-pg_str.patchDownload+64-123
#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

On Thu, 2024-08-22 at 11:00 -0700, Jeff Davis wrote:

Like ICU, allow -1 length to mean that the input string is NUL-
terminated for pg_strncoll(), pg_strnxfrm(), and
pg_strnxfrm_prefix().

To better illustrate the direction I'm going, I roughly implemented
some patches that implement collation using a table of methods rather
than lots branching based on the provider.

This more cleanly separates the API for a provider, which will enable
us to use a hook to create a custom provider with arbitrary methods,
that may have nothing to do with ICU or libc. Or, we could go so far as
to implement a "CREATE LOCALE PROVIDER" that would provide the methods
using a handler function, and "datlocprovider" would be an OID rather
than a char.

From a practical perspective, I expect that extensions would use this
to lock down the version of a particular provider rather than implement
a completely arbitrary one. But the API is good for either case, and
offers quite a bit of code cleanup.

There are quite a few loose ends, of course:

* There is still a lot of branching on the provider for DDL and
catalog access. I'm not sure if we will ever eliminate all of this, or
if we would even want to.

* I haven't done anything with get_collation_actual_version().
Perhaps that should be a method, too, but it requires some extra
thought if we want this to be useful for "multilib" (having multiple
versions of a provider library at once).

* I didn't add methods for formatting.c yet.

* initdb -- should it offer a way to preload a library and then use
that for the provider?

* I need to allow an arbitrary per-provider context, rather than the
current union designed for the existing providers.

Again, the patches are rough and there's a lot of code churn. I'd like
some feedback on whether people generally like the direction this is
going. If so I will clean up the patch series into smaller, more
reviewable chunks.

Regards,
Jeff Davis

Attachments:

v4-0001-Tighten-up-make_libc_collator-and-make_icu_collat.patchtext/x-patch; charset=UTF-8; name=v4-0001-Tighten-up-make_libc_collator-and-make_icu_collat.patchDownload+80-51
v4-0007-Use-method-table-for-collation.patchtext/x-patch; charset=UTF-8; name=v4-0007-Use-method-table-for-collation.patchDownload+1727-1513
v4-0006-Allow-length-1-for-NUL-terminated-input-to-pg_str.patchtext/x-patch; charset=UTF-8; name=v4-0006-Allow-length-1-for-NUL-terminated-input-to-pg_str.patchDownload+104-161
v4-0005-invalidation.patchtext/x-patch; charset=UTF-8; name=v4-0005-invalidation.patchDownload+33-9
v4-0004-resource-owners.patchtext/x-patch; charset=UTF-8; name=v4-0004-resource-owners.patchDownload+72-3
v4-0003-CollationCacheContext.patchtext/x-patch; charset=UTF-8; name=v4-0003-CollationCacheContext.patchDownload+8-9
v4-0002-create_pg_locale.patchtext/x-patch; charset=UTF-8; name=v4-0002-create_pg_locale.patchDownload+155-156
#3Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#1)
Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

Hi,

On 2024-08-22 11:00:54 -0700, Jeff Davis wrote:

Like ICU, allow -1 length to mean that the input string is NUL-
terminated for pg_strncoll(), pg_strnxfrm(), and pg_strnxfrm_prefix().

This simplifies the API and code a bit.

I don't really like this. I was hacking on a patch that uses compiler
annotations to tell the compiler what range of memory a function access. The
compiler then can use that knowledge to give you both compile-time warnings
and, more importantly, it makes ubsan much more accurate. It'll e.g. often be
able to warn you if a function accesses more memory than its annotation would
suggest, even if the memory is part of a larger memory allocation (something
asan, valgrind etc can't warn about, yet are often the most security critical
issues). I found a bunch of issues that way already.

But the annotations can't work if the access size is sometimes is -1.

I also don't find this very convincing code-wise. You end up with lots of
branches for -1. You have to support cases where one of the arguments is
specifies as -1 and the other one with a real length, even though that's
presumably a non-existing case.

It seems reasonable to want the more efficient path for zero terminated
strings with libc, but it seems like if we want that, we should add add a
collate_method->strcoll, rather than have a strncoll that's not actually
strncoll but strcoll.

Greetings,

Andres Freund

#4Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#3)
Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

On Fri, 2026-05-01 at 12:40 -0400, Andres Freund wrote:

Hi,

On 2024-08-22 11:00:54 -0700, Jeff Davis wrote:

Like ICU, allow -1 length to mean that the input string is NUL-
terminated for pg_strncoll(), pg_strnxfrm(), and
pg_strnxfrm_prefix().

This simplifies the API and code a bit.

I don't really like this.

Agreed. I did this to match up with the ICU API a bit better, but if
it's interfering with useful tools, then the special cases aren't worth
it.

Patch attached. It causes a bit of churn, so one disadvantage is that
it will complicate future backports in this area.

Regards,
Jeff Davis

Attachments:

v1-0001-Don-t-accept-length-of-1-in-pg_locale.h-APIs.patchtext/x-patch; charset=UTF-8; name=v1-0001-Don-t-accept-length-of-1-in-pg_locale.h-APIs.patchDownload+296-254
#5Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#4)
Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

On Tue, 2026-05-05 at 13:23 -0700, Jeff Davis wrote:

Patch attached. It causes a bit of churn, so one disadvantage is that
it will complicate future backports in this area.

I plan to commit this soon.

Regards,
Jeff Davis

#6Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#4)
Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

Hi,

On 2026-05-05 13:23:12 -0700, Jeff Davis wrote:

Agreed. I did this to match up with the ICU API a bit better, but if
it's interfering with useful tools, then the special cases aren't worth
it.

Patch attached.

Thanks!

It causes a bit of churn, so one disadvantage is that it will complicate
future backports in this area.

I think it's worth the gain in instrument-ability. I also suspect it's good
for runtime performance, adding all those branches can't be particularly good.

Greetings,

Andres Freund