Rework of collation code, extensibility

Started by Jeff Davisover 3 years ago24 messageshackers

pgsql@j-davis.com

over 3 years ago

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1]/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com[2]/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com in
a new thread given its new scope.

Benefits:

1. Clearer division of responsibilities.
2. More consistent between libc and ICU providers.
3. Hooks that allow extensions to replace collation provider libraries.
4. New tests for the collation provider library hooks.

There are a lot of changes, and still some loose ends, but I believe a
few of these patches are close to ready.

This set of changes does not express an opinion on how we might want to
support multiple provider libraries in core; but whatever we choose, it
should be easier to accomplish. Right now, the hooks have limited
information on which to make the choice for a specific version of a
collation provider library, but that's because there's limited
information in the catalog. If the discussion here[3]/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com concludes in
adding collation provider library or library version information to the
catalog, we can add additional parameters to the hooks.

[1]: /messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
[2]: /messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
[3]: /messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com
/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Ted Yu

yuzhihong@gmail.com

over 3 years ago

In reply to: Jeff Davis (#1)

Re: Rework of collation code, extensibility

On Sat, Dec 17, 2022 at 7:14 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2] in
a new thread given its new scope.

Benefits:

1. Clearer division of responsibilities.
2. More consistent between libc and ICU providers.
3. Hooks that allow extensions to replace collation provider libraries.
4. New tests for the collation provider library hooks.

There are a lot of changes, and still some loose ends, but I believe a
few of these patches are close to ready.

This set of changes does not express an opinion on how we might want to
support multiple provider libraries in core; but whatever we choose, it
should be easier to accomplish. Right now, the hooks have limited
information on which to make the choice for a specific version of a
collation provider library, but that's because there's limited
information in the catalog. If the discussion here[3] concludes in
adding collation provider library or library version information to the
catalog, we can add additional parameters to the hooks.

[1]

/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
[2]

/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
[3]

/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Hi,

For pg_strxfrm_libc in v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch:

+#ifdef HAVE_LOCALE_T
+       if (locale)
+               return strxfrm_l(dest, src, destsize, locale->info.lt);
+       else
+#endif
+               return strxfrm(dest, src, destsize);

It seems the `else` is not needed (since when the if branch is taken, we
return from the func).

+ /* nul-terminate arguments */

nul-terminate -> null-terminate

For pg_strnxfrm(), I think `result` can be removed - we directly return the
result from pg_strnxfrm_libc or pg_strnxfrm_icu.

Cheers

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Ted Yu (#2)

Re: Rework of collation code, extensibility

On Sun, Dec 18, 2022 at 10:28 AM Ted Yu <yuzhihong@gmail.com> wrote:

It seems the `else` is not needed (since when the if branch is taken, we

return from the func).

By that same logic, this review comment is not needed, since compiler
vendors don't charge license fees by the number of keywords. ;-)
Joking aside, we don't really have a project style preference for this case.

nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero pointers).
See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com

Ted Yu

yuzhihong@gmail.com

over 3 years ago

In reply to: John Naylor (#3)

Re: Rework of collation code, extensibility

On Sat, Dec 17, 2022 at 8:54 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero
pointers). See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com

Ah.

`nul-terminated` does appear in the codebase.
Should have checked earlier.

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#1)

Re: Rework of collation code, extensibility

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter_e@gmx.net

over 3 years ago

In reply to: Jeff Davis (#5)

Re: Rework of collation code, extensibility

On 22.12.22 06:40, Jeff Davis wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

I'm confused by this patch set.

It combines some refactoring that was previously posted with partial
support for multiple ICU libraries with partial support for some new
hooks. Shouldn't those be three separate threads? I think the multiple
ICU libraries already does have a separate thread; how does this relate
to that work? I don't know what the hooks are supposed to be for? What
other locale libraries are you thinking about using this way? How can
we asses whether these interfaces are sufficient for that? The
refactoring patches don't look convincing just by looking at the numbers:

3 files changed, 406 insertions(+), 247 deletions(-)
6 files changed, 481 insertions(+), 150 deletions(-)
12 files changed, 400 insertions(+), 323 deletions(-)

My sense is this is trying to do too many things at once, and those
things are each not fully developed yet.

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Eisentraut (#6)

Re: Rework of collation code, extensibility

On Wed, 2023-01-04 at 22:46 +0100, Peter Eisentraut wrote:

It combines some refactoring that was previously posted with partial
support for multiple ICU libraries with partial support for some new
hooks. Shouldn't those be three separate threads?

Originally they felt more separate to me, too; but as I worked on them
it seemed better to consider them as a patch series. Whatever is easier
for reviewers works for me, though.

I think the multiple
ICU libraries already does have a separate thread; how does this
relate
to that work?

Multilib ICU support adds complexity, and my hope is that this patch
set cleans up and organizes things to better prepare for that
complexity.

I don't know what the hooks are supposed to be for?

I found them very useful for testing during development. One of the
patches adds a test module for the ICU hook, and I think that's a
valuable place to test regardless of whether any other extension uses
the hook. Also, if proper multilib support doesn't land in 16, then the
hooks could be a way to build rudimentary multilib support (or at least
some kind of ICU version lockdown) until it does land.

When Thomas's work is in place, I expect the hooks to change slightly.
The hooks are not meant to set any specific API in stone.

What
other locale libraries are you thinking about using this way? How
can
we asses whether these interfaces are sufficient for that?

I'm not considering any other locale libraries, nor did I see much
discussion of that.

The
refactoring patches don't look convincing just by looking at the
numbers:

3 files changed, 406 insertions(+), 247 deletions(-)
6 files changed, 481 insertions(+), 150 deletions(-)
12 files changed, 400 insertions(+), 323 deletions(-)

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter_e@gmx.net

over 3 years ago

In reply to: Jeff Davis (#7)

Re: Rework of collation code, extensibility

On 06.01.23 08:04, Jeff Davis wrote:

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?

I think the refactoring that you proposed in the thread "Refactor to
introduce pg_strcoll()." was on a sensible track. Maybe we should try
to get that done. The multiple-ICU stuff is still experimental and has
its own rather impressive thread, so I don't think it's sensible to try
to sort that out here.

vignesh C

vignesh21@gmail.com

over 3 years ago

In reply to: Jeff Davis (#5)

Re: Rework of collation code, extensibility

On Thu, 22 Dec 2022 at 11:11, Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4058.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
c971a5b27ac946e7c94f7f655d321279512c7ee7 ===
=== applying patch ./v5-0003-Refactor-pg_locale_t-routines.patch
....
Hunk #1 FAILED at 88.
...
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/utils/adt/formatting.c.rej
patching file src/backend/utils/adt/like.c
Hunk #1 FAILED at 24.
Hunk #2 succeeded at 97 (offset 1 line).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/utils/adt/like.c.rej

[1]: http://cfbot.cputube.org/patch_41_4058.log

Regards,
Vignesh

#10

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Jeff Davis (#5)

Re: Rework of collation code, extensibility

On Wed, 2022-12-21 at 21:40 -0800, Jeff Davis wrote:

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

Attached trivial rebase as v6.

The libc hook support is still experimental

Patches 0006 and 0007 should still be considered experimental and don't
require review right now.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#11

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Eisentraut (#8)

Re: Rework of collation code, extensibility

On Wed, 2023-01-11 at 15:08 +0100, Peter Eisentraut wrote:

I think the refactoring that you proposed in the thread "Refactor to
introduce pg_strcoll()." was on a sensible track. Maybe we should
try
to get that done.

Those should be patches 0001-0003 in this thread (now at v6), which are
all pure refactoring.

Let's consider those patches the topic of this thread and I'll move
0004-0007 back to the multi-lib ICU thread on the next revision.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#12

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#10)

Re: Rework of collation code, extensibility

On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached trivial rebase as v6.

Some review comments for this v6.

Comments on 0001-*:

* I think that 0002-* can be squashed with 0001-*, since there isn't
any functional reason why you'd want to commit the strcoll() and
strxfrm() changes separately.

Sometimes it can be useful to break things up, despite the fact that
it couldn't possibly make sense to commit just one of the resulting
patches on its own. However, I don't think that that's appropriate
here. There is no apparent conceptual boundary that you're
highlighting by splitting things up like this. strxfrm() and strcoll()
are defined in terms of each other -- they're siblings, joined at the
hip -- so this seems kinda jarring.

* Your commit message for 0001 (and other patches) don't set things up
by describing what the point is, and what the work anticipates. I
think that they should do that.

You're adding a layer of indirection that's going to set things up for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

* I'm not sure that pg_strcoll() should be breaking ties itself. We
break ties using strcmp() for historical reasons, but must not do that
for deterministic ICU collations, which may be obscured.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship. That kind
of makes pg_strcoll() somewhat more than a strcoll() shim, which is
inconsistent. Another concern is that the deterministic collation
handling isn't handled in any one layer, which would have been nice.

Do we need to do things this way? What's it adding?

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle it
could just use the known length of the string.

Presumably this might have performance implications. Have you thought
about that?

Some comments on 0002-*:

* I don't see much point in this new varstr_abbrev_convert() variable:

+ const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

* Having a separate pg_strxfrm_prefix_libc() function just to throw an
error doesn't really add much IMV.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

Comments on 0006-*:

* get_builtin_libc_library() could be indented in a way that would
make it easier to understand.

--
Peter Geoghegan

#13

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#12)

Re: Rework of collation code, extensibility

On Fri, 2023-01-13 at 11:57 -0800, Peter Geoghegan wrote:

You're adding a layer of indirection that's going to set things up
for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

Right, well put. I have two goals and felt that they merged into one
patchset, but I think that caused more confusion.

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I attached new patches:

v7-0001: pg_strcoll and pg_strxfrm patches combined, your comments
addressed
v7-0002: add pg_locale_internal.h (and other refactoring)

I will post the other patches in the other thread.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship.

That's a really good point. I changed tiebreaking to be the caller's
responsibility.

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle
it
could just use the known length of the string.

I think you misread, it still calls ucol_strcollUTF8() when applicable,
which is impoartant because otherwise it would require a conversion to
a UChar string.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

* I don't see much point in this new varstr_abbrev_convert()
variable:

+ const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

* Having a separate pg_strxfrm_prefix_libc() function just to throw
an
error doesn't really add much IMV.

Removed.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#14

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#13)

Re: Rework of collation code, extensibility

On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis <pgsql@j-davis.com> wrote:

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

That's good. I'm happy to leave it at that. I was only enquiring.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

I certainly won't insist.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

--
Peter Geoghegan

#15

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#14)

Re: Rework of collation code, extensibility

On Tue, 2023-01-17 at 14:18 -0800, Peter Geoghegan wrote:

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

Attached v8, which is just a rebase.

To reiterate: commitfest entry
https://commitfest.postgresql.org/41/3956/ is dependent on these
patches and is a big part of the motivation for refactoring.

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

I tested with hash aggregation, which might be more dependent on
pg_locale_deterministic() than sorting. I didn't see any significant
difference between master and the refactoring branch, so I don't see a
need to make that function "inline".

I also re-tested sorting and found some interesting results for en-US-
x-icu on a UTF-8 database (which is I suspect one of the most common
configurations for ICU):

* the refactoring branch is now more than 5% faster, whether using
abbreviated keys or not
* disabling abbreviated keys makes sorting 8-10% faster on both
master and the refactoring branch

Both of these are surprising, and I haven't investigated deeply yet.
Maybe something about LTO, some intervening patch, or I just made some
mistakes somewhere (I did this fairly quickly). But as of now, it
doesn't look like the refactoring patch hurts anything.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#16

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#15)

Re: Rework of collation code, extensibility

On Fri, 2023-01-20 at 12:54 -0800, Jeff Davis wrote:

Both of these are surprising, and I haven't investigated deeply yet.

It's just because autoconf defaults to -O2 and meson to -O3, at least
on my machine. It turns out that, at -O2, master and the refactoring
branch are even; but at -O3, both get faster, and the refactoring pulls
ahead by a few percentage points.

At least that's what's happening for en-US-x-icu on UTF-8 with my test
data set. I didn't see much of a difference in other situations, but I
didn't retest those other situations this time around.

We should still look into why disabling abbreviated keys improves
performance in some cases. Maybe we need a GUC for that?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#17

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#15)

Re: Rework of collation code, extensibility

Attached v9 and added perf numbers below.

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

My primary motivation is still related to
https://commitfest.postgresql.org/41/3956/ but the combination of
cleaner code and a performance boost seems like reasonable
justification for this patch set independently.

There aren't any clear open items on this patch. Peter Eisentraut asked
me to focus this thread on the refactoring, which I've done by reducing
it to 2 patches, and I left multilib ICU up to the other thread. He
also questioned the increased line count, but I think the currently-low
line count is due to bad style. PeterG provided some review comments,
in particular when to do the tiebreaking, which I addressed.

This patch has been around for a while, so I'll take a fresh look and
see if I see risk areas, and re-run a few sanity checks. Of course more
feedback would also be welcome.

PERFORMANCE:

======
Setup:
======

base: master with v9-0001 applied (GUCs only)
refactor: master with v9-0001, v9-0002, v9-0003 applied

Note that I wasn't able to see any performance difference between the
base and master, v9-0001 just adds some GUCs to make testing easier.

glibc 2.35 ICU 70.1
gcc 11.3.0 LLVM 14.0.0

built with meson (uses -O3)

$ perl text_generator.pl 10000000 10 > /tmp/strings.utf8.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.utf8.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

=============
Test queries:
=============

EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";

Timings are measured as the milliseconds to return the first tuple from
the Sort operator (as reported in EXPLAIN ANALYZE). Median of three
runs.

========
Results:
========

base refactor speedup

sort_abbreviated_keys=false:
C 7377 7273 1.4%
en_US 35081 35090 0.0%
en-US-x-ixu 20520 19465 5.4%

sort_abbreviated_keys=true:
C 8105 8008 1.2%
en_US 35067 34850 0.6%
en-US-x-icu 22626 21507 5.2%

===========
Conclusion:
===========

These numbers can move +/-1 percentage point, so I'd interpret anything
less than that as noise. This happens to be the first run where all the
numbers favored the refactoring patch, but it is generally consistent
with what I had seen before.

The important part is that, for ICU, it appears to be a substantial
speedup when using meson (-O3).

Also, when/if the multilib ICU support goes in, that may lose some of
these gains due to an extra indirect function call.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#18

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#17)

Re: Rework of collation code, extensibility

On 27.01.23 00:47, Jeff Davis wrote:

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

I have looked a bit at 0002 and 0003. I like the direction. I'll spend
a bit more time reviewing it in detail. It moves a lot of code around.

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

#19

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#18)

Re: Rework of collation code, extensibility

On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.

Regards,
Jeff Davis

#20

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#19)

Re: Rework of collation code, extensibility

On 01.02.23 00:33, Jeff Davis wrote:

On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.

I took another closer look at the 0002 and 0003 patches.

The commit message for 0002 says "Also remove the TRUST_STRXFRM define",
but I think this is incorrect, as that is done in the 0001 patch.

I don't like that the pg_strnxfrm() function requires these kinds of
repetitive error checks:

+           if (rsize != bsize)
+               elog(ERROR, "pg_strnxfrm() returned unexpected result");

This could be checked inside the function itself, so that the callers
don't have to do this themselves every time.

I don't really understand the 0003 patch. It's a lot of churn but I'm
not sure that it achieves more clarity or something.

The new function pg_locale_deterministic() seems sensible. Maybe this
could be proposed as a separate patch.

I don't understand the new header pg_locale_internal.h. What is
"internal" and what is not internal? What are we hiding from whom?
There are no code comments about this AFAICT.

pg_locale_struct has new fields

+ char *collate;
+ char *ctype;

that are not explained anywhere.

I think this patch would need a bit more explanation and commenting.

#21

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#20)

#22

Jeff Davis

pgsql@j-davis.com