Improve the performance of Unicode Normalization Forms.

Started by Alexander Borisov11 months ago39 messageshackers
Jump to latest
#1Alexander Borisov
lex.borisov@gmail.com

Hi, hackers!

As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]/messages/by-id/7cac7e66-9a3b-4e3f-a997-42aa0c401f80@gmail.com
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.

The attachment contains three patches:
1. Transfer of functions for building a range index to a separate
Perl module.
2-small. In this patch, the mapping tables are reduced, but this
increases the number of branches for searching for the desired value.
2-big. In this patch, the tables are large, but the number of branches
for searching for the desired value is small.

In other words, we choose between two patches, either small or big.

What did we manage to achieve?
1. The uint32 field, which stored the unicode codepoint record, was
removed from the main structure/table. This was necessary for perfect
hashing.
2. Thanks to the first point, we managed to get rid of duplicates and
reduce the main table from 6843 to 3943.
3. We managed to get rid of duplicates in the table with codepoints for
decomposition from 5138 to 4304 (uint32).
4. These manipulations allowed us to increase the speed by up to 40%
in some cases (benchmarks below).
5. The server part "lost weight" in the binary, but the frontend
"gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

Benchmarks.

How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.

Legend.
Patch (big table) — this is a bloated mapping table, but with a small
number of branches for searching.
Patch (small table) — these are small tables, but with a large number
of branches for searching.
Without — the original Postgres without patches.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

NFC:
Patch (big table): tps = 5057.624509
Patch (small table): tps = 4727.158048
Without: tps = 3268.654867

NFD:
Patch (big table): tps = 1145.730247
Patch (small table): tps = 1073.084633
Without: tps = 689.627239

NFKC:
Patch (big table): tps = 4821.700702
Patch (small table): tps = 4662.640629
Without: tps = 3450.254581

NFKD:
Patch (big table): tps = 1142.644341
Patch (small table): tps = 951.405893
Without: tps = 636.161496

How the size of object files and libraries has changed (in bytes):
Patch (big table):
unicode_norm.o: 138896
unicode_norm_srv.o: 191336
libpq.a: 364782
libpq.so.5.18: 480944
libpgcommon.a: 584432
libpgcommon_shlib.a: 568724

Patch (small table):
unicode_norm.o: 86416
unicode_norm_srv.o: 138824
libpq.a: 364782
libpq.so.5.18: 423600
libpgcommon.a: 531952
libpgcommon_shlib.a: 516244

Without:
unicode_norm.o: 79488
unicode_norm_srv.o: 165504
libpq.a: 364782
libpq.so.5.18: 419288
libpgcommon.a: 525024
libpgcommon_shlib.a: 509316

[1]: /messages/by-id/7cac7e66-9a3b-4e3f-a997-42aa0c401f80@gmail.com
/messages/by-id/7cac7e66-9a3b-4e3f-a997-42aa0c401f80@gmail.com

--
Best regards,
Alexander Borisov

Attachments:

v1-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchtext/plain; charset=UTF-8; name=v1-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchDownload+255-153
v1-0002-big-Improve-the-performance-of-Unicode-Normalization-.patchtext/plain; charset=UTF-8; name=v1-0002-big-Improve-the-performance-of-Unicode-Normalization-.patchDownload+54352-12482
v1-0002-small-Improve-the-performance-of-Unicode-Normalization-.patchtext/plain; charset=UTF-8; name=v1-0002-small-Improve-the-performance-of-Unicode-Normalization-.patchDownload+24879-12482
#2John Naylor
john.naylor@enterprisedb.com
In reply to: Alexander Borisov (#1)
Re: Improve the performance of Unicode Normalization Forms.

On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:

5. The server part "lost weight" in the binary, but the frontend
"gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

In the "small" patch, the frontend files got a few kB bigger, but the
backend got quite a bit smaller. If we decided to go with this patch,
I'd say it's preferable to do it in a way that keeps both paths the
same.

How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.

What kind of workload do they present?
Did you consider running the same tests from the thread that lead to
the current implementation?

--
John Naylor
Amazon Web Services

#3Alexander Borisov
lex.borisov@gmail.com
In reply to: John Naylor (#2)
Re: Improve the performance of Unicode Normalization Forms.

11.06.2025 10:13, John Naylor wrote:

On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:

5. The server part "lost weight" in the binary, but the frontend
"gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

In the "small" patch, the frontend files got a few kB bigger, but the
backend got quite a bit smaller. If we decided to go with this patch,
I'd say it's preferable to do it in a way that keeps both paths the
same.

Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.

How was it tested?
Four files were created for each normalization form: NFC, NFD, NFKC,
and NFKD.
The files were sent via pgbench. The files contain all code points that
need to be normalized.
Unfortunately, the patches are already quite large, but if necessary,
I can send these files in a separate email or upload them somewhere.

What kind of workload do they present?
Did you consider running the same tests from the thread that lead to
the current implementation?

I found performance tests in this discussion
/messages/by-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Below are performance test results.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

1.

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 279,858 ms
Patch (small table): 282,925 ms
Without: 444,118 ms

2.

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 219,858 ms
Patch (small table): 247,893 ms
Without: 376,906 ms

3.

Normalize, decomp+recomp

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 7,553 ms
Patch (small table): 7,876 ms
Without: 13,177 ms

4.

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 5,765 ms
Patch (small table): 6,782 ms
Without: 10,800 ms

5.

Quick check has not changed because these patches do not affect it:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 29,477 ms
Patch (small table): 29,436 ms
Without: 29,378 ms

From these tests, we see 2x in some tests.

--
Best regards,
Alexander Borisov

#4John Naylor
john.naylor@enterprisedb.com
In reply to: Alexander Borisov (#3)
Re: Improve the performance of Unicode Normalization Forms.

On Wed, Jun 11, 2025 at 7:27 PM Alexander Borisov <lex.borisov@gmail.com> wrote:

11.06.2025 10:13, John Naylor wrote:

On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:

5. The server part "lost weight" in the binary, but the frontend
"gained weight" a little.

I read the old commits, which say that the size of the frontend is very
important and that speed is not important
(speed is important on the server).
I'm not quite sure what to do if this is really the case. Perhaps
we should leave the slow version for the frontend.

In the "small" patch, the frontend files got a few kB bigger, but the
backend got quite a bit smaller. If we decided to go with this patch,
I'd say it's preferable to do it in a way that keeps both paths the
same.

Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.

Sorry, I by "both paths" I meant make the frontend and backend the
same, because it's good for testing. In the "small table" patch, libpq
etc increase by about 1%, which is negligible. unicode_norm.o is only
bigger by 7kB -- That seems okay to me, especially considering
unicode_norm_srv.o is smaller by 27kB.

From these tests, we see 2x in some tests.

Nice!

--
John Naylor
Amazon Web Services

#5Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#1)
Re: Improve the performance of Unicode Normalization Forms.

On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:

As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.

Did you compare against other implementations, such as ICU's
normalization functions? There's also a rust crate here:

https://github.com/unicode-rs/unicode-normalization

that might have been optimized.

In addition to the lookups themselves, there are other opportunities
for optimization as well, such as:

* reducing the need for palloc and extra buffers, perhaps by using
buffers on the stack for small strings

* operate more directly on UTF-8 data rather than decoding and re-
encoding the entire string

Regards,
Jeff Davis

#6Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#5)
Re: Improve the performance of Unicode Normalization Forms.

19.06.2025 20:41, Jeff Davis wrote:

On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:

As promised, I continue to improve/speed up Unicode in Postgres.
Last time, we improved the lower(), upper(), and casefold()
functions. [1]
Now it's time for Unicode Normalization Forms, specifically
the normalize() function.

Did you compare against other implementations, such as ICU's
normalization functions? There's also a rust crate here:

https://github.com/unicode-rs/unicode-normalization

that might have been optimized.

I don't quite see how this compares to the implementation on Rust. In
the link provided, they use perfect hash, which I get rid of and get
a x2 boost.
If you take ICU implementations in C++, I have always considered them
slow, at least when used in C code.
I may well run benchmarks and compare the performance of the approach
in Postgres and ICU. But this is beyond the scope of the patches under
discussion.

I want to emphasize that the pachty I gave doesn't change the
normalization code/logic.
We change the approach in finding the right codepoints across tables,
which is what gives us the performance boost.

In addition to the lookups themselves, there are other opportunities
for optimization as well, such as:

* reducing the need for palloc and extra buffers, perhaps by using
buffers on the stack for small strings

* operate more directly on UTF-8 data rather than decoding and re-
encoding the entire string

Absolutely agree with you, the normalization code is very well written
but far from optimized.
I didn't send changes in the normalization code itself to avoid lumping
everything together and make the review easier.
In keeping with my idea of optimizations in normalization forms, I
planned to discuss the optimization code (C code) in the next iteration
on “Improve performance...”.

--
Regards,
Alexander Borisov

#7Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#5)
Re: Improve the performance of Unicode Normalization Forms.

On Thu, Jun 19, 2025 at 10:41:57AM -0700, Jeff Davis wrote:

In addition to the lookups themselves, there are other opportunities
for optimization as well, such as:

* reducing the need for palloc and extra buffers, perhaps by using
buffers on the stack for small strings

* operate more directly on UTF-8 data rather than decoding and re-
encoding the entire string

Not sure how relevant it is here, but in ZFS in the 00s we implement
form-insensitive, from-preserving functionality with an interesting
optimization that greatly reduced allocations and which had a decent
fast path for ASCII-mostly strings. It works by doing normalization
only in a) string comparison functions, b) string hashing functions
(because in ZFS directories are hash tables).

For b-trees one has to normalize the whole string, so perhaps this
scheme is just not that relevant to databases that use b-trees, but then
depending on how the b-tree key is smeared over internal nodes it might
be.

The ASCII-mostly fast path was that when comparing strings you want a
"cursor" through each string, and whenever the current and next bytes
are both ASCII you compare the current one the fast way (as ASCII), and
you only take the slow path of having to check if the current
_character_ (as opposed to _byte_) requires normalization when either
the current or next bytes are not ASCII. In the slow path you only
normalize the _current character_, so you only need enough buffer space
for that. In principle a Unicode character could consist of an
unbounded number of codepoints (zalgo), I think, but in practice it will
be no more than a half a dozen or so, so the buffer can be small and
stack allocated. This same concept applies to string hashing: you walk
a cursor over the string and hash each ASCII _character_ and normalize
all non-ASCII characters, all one character at a time.

The really nice thing about form-insensitive/form-preserving
functionality is that it is form-preserving rather than normalizing on
create and lookup, and that makes the fast-path described above
feasible. Whereas if you normalize on create and lookup you have to
heap allocate enough space for each string normalized. The other nice
thing is that f-i/f-p behavior is a lot less surprising to users -- the
input methods they use don't matter.

What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
something very close to NFD) but input methods (even on OS X) invariably
produce NFC (well, something close to it), at least for Latin scripts.
This means that on OS X if you use filenames from directory listings
those will be NFD while user inputs will be NFC, and so you can't just
memcmp()/strcmp() those -- you have to normalize _or_ use form-
insensitive string comparison, but nothing did that 20 years ago. Thus
doing the form-insensitivity in the filesystem seemed best, and if you
do that you can be form-preserving to enable the optimization described
above.

My apologies if the above is not relevant to PG and thus noise,

Nico
--

#8Jeff Davis
pgsql@j-davis.com
In reply to: Nico Williams (#7)
Re: Improve the performance of Unicode Normalization Forms.

On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:

In the slow path you only
normalize the _current character_, so you only need enough buffer
space
for that.

That's a clear win for UTF8 data. Also, if there are no changes, then
you can just return the input buffer and not bother allocating an
output buffer.

The really nice thing about form-insensitive/form-preserving
functionality is that it is form-preserving rather than normalizing
on
create and lookup, and that makes the fast-path described above
feasible.  Whereas if you normalize on create and lookup you have to
heap allocate enough space for each string normalized.

Non-deterministic ICU collations are already insensitive to most
normalization differences. Some differences are not handled when it
involves too many combining marks, but you can use the "full
normalization" option to address that. See:

https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-SETTINGS-TABLE

  The other nice
thing is that f-i/f-p behavior is a lot less surprising to users --
the
input methods they use don't matter.

Postgres is already form-preserving; it does not auto-normalize. (I
have suggested that we might want to offer something like that, but
that would be a user choice.)

Currently, the non-deterministic collations (which offer form-
insensitivity) are not available at the database level, so you have to
explicitly specify the COLLATE clause on a column or query. In other
words, Postgres is not form-insensitive by default, though there is
work to make that possible.

What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
something very close to NFD) but input methods (even on OS X)
invariably
produce NFC (well, something close to it), at least for Latin
scripts.
This means that on OS X if you use filenames from directory listings
those will be NFD while user inputs will be NFC, and so you can't
just
memcmp()/strcmp() those -- you have to normalize _or_ use form-
insensitive string comparison, but nothing did that 20 years ago. 
Thus
doing the form-insensitivity in the filesystem seemed best, and if
you
do that you can be form-preserving to enable the optimization
described
above.

Databases have similar concerns as a filesystem in this respect.

Regards,
Jeff Davis

#9Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#6)
Re: Improve the performance of Unicode Normalization Forms.

On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:

I don't quite see how this compares to the implementation on Rust. In
the link provided, they use perfect hash, which I get rid of and get
a x2 boost.
If you take ICU implementations in C++, I have always considered them
slow, at least when used in C code.
I may well run benchmarks and compare the performance of the approach
in Postgres and ICU. But this is beyond the scope of the patches
under
discussion.

Are you saying that, with these patches, Postgres will offer the
fastest open-source Unicode normalization? If so, that would be very
cool.

The reason I'm asking is because, if there are multiple open source
implementations, we should either have the best one, or just borrow
another one as long as it has a suitable license (perhaps translating
to C as necessary).

Regards,
Jeff Davis

#10Nico Williams
nico@cryptonector.com
In reply to: Jeff Davis (#8)
Re: Improve the performance of Unicode Normalization Forms.

On Fri, Jun 20, 2025 at 10:15:47AM -0700, Jeff Davis wrote:

On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:

In the slow path you only normalize the _current character_, so you
only need enough buffer space for that.

That's a clear win for UTF8 data. Also, if there are no changes, then
you can just return the input buffer and not bother allocating an
output buffer.

The latter is not relevant to string comparison or hashing, but, yeah,
if you have to produce a normalized string you can optimistically assume
it is already normalized and defer allocation until you know it isn't
normalized.

Postgres is already form-preserving; it does not auto-normalize. (I
have suggested that we might want to offer something like that, but
that would be a user choice.)

Excellent, then I would advise looking into adding form-insensitive
string comparison and hashing to get f-i/f-p behavior.

Currently, the non-deterministic collations (which offer form-
insensitivity) are not available at the database level, so you have to
explicitly specify the COLLATE clause on a column or query. In other
words, Postgres is not form-insensitive by default, though there is
work to make that possible.

TIL. Thanks.

Databases have similar concerns as a filesystem in this respect.

I figured :)

#11Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#9)
Re: Improve the performance of Unicode Normalization Forms.

20.06.2025 20:20, Jeff Davis wrote:

On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:

I don't quite see how this compares to the implementation on Rust. In
the link provided, they use perfect hash, which I get rid of and get
a x2 boost.
If you take ICU implementations in C++, I have always considered them
slow, at least when used in C code.
I may well run benchmarks and compare the performance of the approach
in Postgres and ICU. But this is beyond the scope of the patches
under
discussion.

Are you saying that, with these patches, Postgres will offer the
fastest open-source Unicode normalization? If so, that would be very
cool.

That's what we're aiming for - to implement the fastest approach.
By applying the proposed patches (two patches) we get the fastest
codepoints search by tables. This is evidenced by the measurements made
here and earlier in the patch for unicode case improvement.

After these patches are compiled, I will improve the C normalization
code directly, optimize it. That's when we can take benchmarks and say
with confidence that we're the best at speed.

The reason I'm asking is because, if there are multiple open source
implementations, we should either have the best one, or just borrow
another one as long as it has a suitable license (perhaps translating
to C as necessary).

Before getting into this "fight" I studied different approaches to
searching for the necessary codepoints in tables (hash, binary search,
radix trees...) and came to the conclusion that the approach I proposed
(range index) is the fastest for this purpose.

--
Regards,
Alexander Borisov

#12Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#11)
Re: Improve the performance of Unicode Normalization Forms.

On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:

That's what we're aiming for - to implement the fastest approach.

Awesome! Thank you for clarifying this as a goal. Having the fastest
open-source Unicode normalization would be a great thing to highlight
when this is done.

Regards,
Jeff Davis

#13Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#12)
Re: Improve the performance of Unicode Normalization Forms.

30.06.2025 22:28, Jeff Davis пишет:

On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:

That's what we're aiming for - to implement the fastest approach.

Awesome! Thank you for clarifying this as a goal. Having the fastest
open-source Unicode normalization would be a great thing to highlight
when this is done.

After the discussion in this correspondence, we are settling on
the "small" patch option. The table size is reduced, the speed is
almost x2.
Attached are the final two patches. After reviewing/accepting them,
I will proceed to optimizing the C code for Unicode Normalization Forms.

--
Regards,
Alexander Borisov

Attachments:

v2-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchtext/plain; charset=UTF-8; name=v2-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchDownload+255-153
v2-0002-Improve-the-performance-of-Unicode-Normalization-.patchtext/plain; charset=UTF-8; name=v2-0002-Improve-the-performance-of-Unicode-Normalization-.patchDownload+24879-12482
#14Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#13)
Re: Improve the performance of Unicode Normalization Forms.

07.07.2025 21:44, Alexander Borisov пишет:

30.06.2025 22:28, Jeff Davis пишет:

On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:

That's what we're aiming for - to implement the fastest approach.

Awesome! Thank you for clarifying this as a goal. Having the fastest
open-source Unicode normalization would be a great thing to highlight
when this is done.

After the discussion in this correspondence, we are settling on
the "small" patch option. The table size is reduced, the speed is
almost x2.
Attached are the final two patches. After reviewing/accepting them,
I will proceed to optimizing the C code for Unicode Normalization Forms.

Version 3 patches. In version 2 "make -s headerscheck" did not work.

--
Regards,
Alexander Borisov

Attachments:

v3-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchtext/plain; charset=UTF-8; name=v3-0001-Moving-Perl-functions-Range-index-to-a-common-mod.patchDownload+255-153
v3-0002-Improve-the-performance-of-Unicode-Normalization-.patchtext/plain; charset=UTF-8; name=v3-0002-Improve-the-performance-of-Unicode-Normalization-.patchDownload+24883-12482
#15Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#14)
Re: Improve the performance of Unicode Normalization Forms.

Hi,

I'm new here, so please advise me: if a patch wasn't accepted at the
commitfest, does that mean it's not needed (no one was interested in
it), or was there not enough time?
Please tell me how this works for this.
Should I move it to the next commitfest? I'm not quite sure what to do.

I looked and saw that patches are often transferred from commitfest to
commitfest. I understand that this is normal practice?
Please understand, it's not very transparent here, the approach is not
obvious.

What is the best course of action for me?
Thanks!

--
Regards,
Alexander Borisov

#16David G. Johnston
david.g.johnston@gmail.com
In reply to: Alexander Borisov (#15)
Re: Improve the performance of Unicode Normalization Forms.

On Friday, August 1, 2025, Alexander Borisov <lex.borisov@gmail.com> wrote:

I looked and saw that patches are often transferred from commitfest to
commitfest. I understand that this is normal practice?

What is the best course of action for me?

If you feel the patch is committable it should remain in the non-draft CFs,
being moved every other month or so, until it gets committed.

David J.

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Borisov (#15)
Re: Improve the performance of Unicode Normalization Forms.

Alexander Borisov <lex.borisov@gmail.com> writes:

I'm new here, so please advise me: if a patch wasn't accepted at the
commitfest, does that mean it's not needed (no one was interested in
it), or was there not enough time?

It's kind of hard to tell really --- there are many patches in our
queue and not nearly enough reviewers. So maybe someone will get to
it in the fullness of time, or maybe it's true that no one cares
about the particular topic. (But bug fixes and performance
improvements are almost always interesting to someone.)

I recommend optimism: as long as *you* still believe that the patch
is worthwhile, keep pushing it forward to the next commitfest.
We used to do that automatically, but we have started asking authors
to do that themselves, as a way of weeding out patches for which
the author has lost interest.

regards, tom lane

#18Alexander Borisov
lex.borisov@gmail.com
In reply to: Tom Lane (#17)
Re: Improve the performance of Unicode Normalization Forms.

01.08.2025 23:37, Tom Lane пишет:

Alexander Borisov <lex.borisov@gmail.com> writes:

I'm new here, so please advise me: if a patch wasn't accepted at the
commitfest, does that mean it's not needed (no one was interested in
it), or was there not enough time?

It's kind of hard to tell really --- there are many patches in our
queue and not nearly enough reviewers. So maybe someone will get to
it in the fullness of time, or maybe it's true that no one cares
about the particular topic. (But bug fixes and performance
improvements are almost always interesting to someone.)

I recommend optimism: as long as *you* still believe that the patch
is worthwhile, keep pushing it forward to the next commitfest.
We used to do that automatically, but we have started asking authors
to do that themselves, as a way of weeding out patches for which
the author has lost interest.

Thanks, Tom! I always choose optimism.

I've been in open source for a while, and this is the first time I've
seen this approach.
I have a plan to further improve Postgres performance in terms of
Unicode (and not only) (which is kind of the foundation for working with
text).
I don't want to overwhelm the community with patches. I take a
systematic approach.

Once again, thank you, Tom. The community's approach has become clearer.

--
Regards,
Alexander Borisov

#19Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#14)
Re: Improve the performance of Unicode Normalization Forms.

On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:

Version 3 patches. In version 2 "make -s headerscheck" did not work.

I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:

cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt

cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt

export NORM_PERF_NFC_FILE=~/strings.nfc.txt
export NORM_PERF_NFD_FILE=~/strings.nfd.txt

The first is about 8MB, the second 9MB (because NFD is slightly
larger).

Then I added some testing code to norm_test.c. It's not intended for
committing, just to run the test. Note that it requires setting
environment variables to find the input files.

If patch v3j-0001 are applied, it's using perfect hashing. If patches
v3j-0002-4 are applied, it's using your code. In either case it
compares with ICU.

Results with perfect hashing (100 iterations):

Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654

Results with your code (100 iterations):

Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651

Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.

Regards,
Jeff Davis

Attachments:

v3j-0001-Performance-testing-infrastructure-for-normaliza.patchtext/x-patch; charset=UTF-8; name=v3j-0001-Performance-testing-infrastructure-for-normaliza.patchDownload+221-9
v3j-0002-Undo-perfect-hash-changes.patchtext/x-patch; charset=UTF-8; name=v3j-0002-Undo-perfect-hash-changes.patchDownload+7-10
v3j-0003-Moving-Perl-functions-Range-index-to-a-common-mo.patchtext/x-patch; charset=UTF-8; name=v3j-0003-Moving-Perl-functions-Range-index-to-a-common-mo.patchDownload+255-153
v3j-0004-Improve-the-performance-of-Unicode-Normalization.patchtext/x-patch; charset=UTF-8; name=v3j-0004-Improve-the-performance-of-Unicode-Normalization.patchDownload+24883-12482
#20Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#19)
Re: Improve the performance of Unicode Normalization Forms.

09.08.2025 02:17, Jeff Davis пишет:

On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:

Version 3 patches. In version 2 "make -s headerscheck" did not work.

I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:

[..]

Results with perfect hashing (100 iterations):

Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654

Results with your code (100 iterations):

Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651

Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.

I had some experimented a bit, and took a look at the ICU code.

1. The performance test is not entirely accurate.

The thing is that in ICU (unorm_normalize()), we pass the output
buffer and its size.
That is, ICU does not calculate how much data we will have at the
output; we pass it our buffer. ICU simply checks whether the data
fits into the buffer or not.

In our case, PG (unicode_normalize()) only accepts the input buffer
and first calculates the size of the output buffer.
This means we are doing double the work, as we have already gone
through the input data at least once too many times.

Here, it would be more honest to call the function for calculating
the buffer in ICU, and only then call data normalization.

2. In PG, unnecessary table lookups (get_code_entry()) that can
clearly be avoided.
3. In PG, the algorithm is not optimized.

We could eliminate the decompose_code() function, which is called
for each code point.
In this function, we can remove recursion and prepare the data in
advance. That is, we can perform data expansion in the Perl script.
If we do this, we will have code that is in place and without recursion.
And then there are all sorts of other minor details.

Of course, there are other approaches.

I did this comparison (your test, Jeff):

1. Without patch.

Normalization from NFC to NFD with PG: 009.121
Normalization from NFC to NFD with ICU: 000.954
Normalization from NFC to NFKD with PG: 009.048
Normalization from NFC to NFKD with ICU: 000.965
Normalization from NFD to NFC with PG: 014.525
Normalization from NFD to NFC with ICU: 000.623
Normalization from NFD to NFKC with PG: 014.380
Normalization from NFD to NFKC with ICU: 000.666

2. With patch.

Normalization from NFC to NFD with PG: 005.743
Normalization from NFC to NFD with ICU: 001.005
Normalization from NFC to NFKD with PG: 005.902
Normalization from NFC to NFKD with ICU: 000.963
Normalization from NFD to NFC with PG: 007.837
Normalization from NFD to NFC with ICU: 000.640
Normalization from NFD to NFKC with PG: 008.036
Normalization from NFD to NFKC with ICU: 000.667

3. With a patch, but! With direct access to tables, i.e., I created one
large table (index table, unit16_t) where there is no search, direct
access to data.

Normalization from NFC to NFD with PG: 002.792
Normalization from NFC to NFD with ICU: 000.953
Normalization from NFC to NFKD with PG: 002.865
Normalization from NFC to NFKD with ICU: 000.958
Normalization from NFD to NFC with PG: 004.361
Normalization from NFD to NFC with ICU: 000.651
Normalization from NFD to NFKC with PG: 004.474
Normalization from NFD to NFKC with ICU: 000.668

It is evident that direct access provides x2 from the patch, but adds
+1.5MB to the object file size.
This is just to check the time difference in access.

As a result, I would not look into ICU at the moment, especially since
we have a different approach.
I am currently working on optimizing unicode_normalize().
I am trying to come up with an improved version of the algorithm in C
by the next commitfest (which will be in September).

P.S.: In general, I strive to surpass ICU in terms of speed.
I think we're making good progress. Let's see what happens.

--
Regards,
Alexander Borisov

#21Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#20)
#22Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#21)
#23Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#22)
#24Victor Yegorov
vyegorov@gmail.com
In reply to: Alexander Borisov (#23)
#25Alexander Borisov
lex.borisov@gmail.com
In reply to: Victor Yegorov (#24)
#26Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Borisov (#25)
#27Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#26)
#28Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#27)
#29Alexander Borisov
lex.borisov@gmail.com
In reply to: Jeff Davis (#26)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Borisov (#29)
#31Alexander Borisov
lex.borisov@gmail.com
In reply to: Heikki Linnakangas (#30)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Borisov (#1)
#33Alexander Borisov
lex.borisov@gmail.com
In reply to: Heikki Linnakangas (#32)
#34Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#33)
#35Alexander Borisov
lex.borisov@gmail.com
In reply to: Alexander Borisov (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Alexander Borisov (#35)
#37Alexander Borisov
lex.borisov@gmail.com
In reply to: Michael Paquier (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Alexander Borisov (#37)
#39Alexander Borisov
lex.borisov@gmail.com
In reply to: Michael Paquier (#38)