Speed up ICU case conversion by using ucasemap_utf8To*()
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.
Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~540 ms
Patched: ~460 ms
glibc: ~410 ms
I have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
Andreas
Attachments:
v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchtext/x-patch; charset=UTF-8; name=v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchDownload+114-48
v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patchtext/x-patch; charset=UTF-8; name=v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patchDownload+26-49
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);master: ~540 ms
Patched: ~460 ms
glibc: ~410 ms
It looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.
Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?
Regards,
Jeff Davis
On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);master: ~540 ms
Patched: ~460 ms
glibc: ~410 msI have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
I noticed that Jeff's comments from [1]/messages/by-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com have not yet been addressed, I
have changed the commitfest entry status to "Waiting on Author",
please address them and update it to "Needs Review".
[1]: /messages/by-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
Regards,
Vignesh
On 2025-03-17 12:16:11 +0530, vignesh C wrote:
On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.SELECT count(upper) FROM (SELECT upper(('K�lhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);master: ~540 ms
Patched: ~460 ms
glibc: ~410 msI have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().I noticed that Jeff's comments from [1] have not yet been addressed, I
have changed the commitfest entry status to "Waiting on Author",
please address them and update it to "Needs Review".
[1] - /messages/by-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
It's also worth noting that this patch hasn't been building for quite a while
(at least not since 2025-01-29):
https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
[17:17:51.214] ld: error: undefined symbol: icu_convert_case
[17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
[17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
[17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
I think we can mark this as returned-with-feedback for now?
Greetings,
Andres Freund
On Sun, 30 Mar 2025 at 00:20, Andres Freund <andres@anarazel.de> wrote:
On 2025-03-17 12:16:11 +0530, vignesh C wrote:
On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);master: ~540 ms
Patched: ~460 ms
glibc: ~410 msI have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().I noticed that Jeff's comments from [1] have not yet been addressed, I
have changed the commitfest entry status to "Waiting on Author",
please address them and update it to "Needs Review".
[1] - /messages/by-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.comIt's also worth noting that this patch hasn't been building for quite a while
(at least not since 2025-01-29):https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
[17:17:51.214] ld: error: undefined symbol: icu_convert_case
[17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
[17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
[17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)I think we can mark this as returned-with-feedback for now?
Thanks, the commitfest entry is marked as returned with feedback.
@Andreas Karlsson Feel free to add a new commitfest entry when you
have addressed the feedback.
Regards,
Vignesh
On 12/20/24 8:24 PM, Jeff Davis wrote:
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);master: ~540 ms
Patched: ~460 ms
glibc: ~410 msIt looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.
Fixed. New benchmarks are:
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~570 ms
Patched: ~340 ms
glibc: ~400 ms
So it does indeed seem like we got a further speedup and now are faster
than glibc.
Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?
Fixed.
Andreas
Attachments:
v2-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchtext/x-patch; charset=UTF-8; name=v2-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchDownload+164-92
Hi Andreas,
On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
--
Regards,
Man Zeng
www.openhalo.org
On 12/31/25 3:36 AM, zengman wrote:
On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
Thanks for trying it out!
In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
Fixed.
In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
Fixed.
Andreas
Attachments:
v3-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchtext/x-patch; charset=UTF-8; name=v3-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchDownload+167-92
Hi,
Here is a version 4 of the patch which uses the fact that we have method
tables to remove one level of indirection. I am not sure the extra lines
of codes are worth it but on the other hand despite 40 more lines the
code became easier to read to me. What do you think?
Andreas
Attachments:
v4-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchtext/x-patch; charset=UTF-8; name=v4-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchDownload+208-95
Here is a version 4 of the patch which uses the fact that we have method
tables to remove one level of indirection. I am not sure the extra lines
of codes are worth it but on the other hand despite 40 more lines the
code became easier to read to me. What do you think?
I don't have any major objections, but I noticed a few minor details that might need a bit more tweaking.
`signficant` -> `significant`
`realtively` -> `relatively`
`if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` -> `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)`
--
Regards,
Man Zeng
www.openhalo.org
On 1/3/26 4:05 AM, zengman wrote:
I don't have any major objections, but I noticed a few minor details that might need a bit more tweaking.
`signficant` -> `significant`
`realtively` -> `relatively`
`if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` -> `if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)`
Fixed, thanks!
Andreas
Attachments:
v5-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchtext/x-patch; charset=UTF-8; name=v5-0001-Use-optimized-versions-of-ICU-case-conversion-for.patchDownload+208-95
On Sat, 2026-01-03 at 04:12 +0100, Andreas Karlsson wrote:
On 1/3/26 4:05 AM, zengman wrote:
I don't have any major objections, but I noticed a few minor
details that might need a bit more tweaking.`signficant` -> `significant`
`realtively` -> `relatively`
`if (status != U_BUFFER_OVERFLOW_ERROR && U_FAILURE(status))` ->
`if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)`Fixed, thanks!
Committed, thank you!
Regards,
Jeff Davis
Hello Jeff,
07.01.2026 00:10, Jeff Davis wrote:
Committed, thank you!
I've discovered that starting from c4ff35f10, the following query:
CREATE COLLATION c (provider = icu, locale = 'icu_something');
makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00
sp 0x7ffd386d45a8
...
Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in frame
#0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/postgres+0x12322ef) (BuildId:
e441a9634858193e7358e5901e7948606ff5b1b1)
This frame has 2 object(s):
[48, 52) 'status' (line 993)
[64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows this variable
I use a build made with:
CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -static-libasan" ./configure --with-icu ...
Could you please have a look?
Best regards,
Alexander
On 3/12/26 5:00 AM, Alexander Lakhin wrote:
I've discovered that starting from c4ff35f10, the following query:
CREATE COLLATION c (provider = icu, locale = 'icu_something');makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address
0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00 sp 0x7ffd386d45a8
...
Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in
frame
#0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/
postgres+0x12322ef) (BuildId: e441a9634858193e7358e5901e7948606ff5b1b1)This frame has 2 object(s):
[48, 52) 'status' (line 993)
[64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows
this variableI use a build made with:
CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -
static-libasan" ./configure --with-icu ...Could you please have a look?
Thanks for finding this!
Interestingly this bug seems like it would be there even before my
patch, but maybe something I did made it when moving code around made it
possible or easier to trigger. As far as I can tell the issue is that
uloc_getLanguage(locale, lang, 3, &status);
will populate lang with a string which is not zero terminated if the
language is 3 or more characters, e.g. "und". And for some reason which
I am not entirely strcmp("tr", {'u','n','d'}) can cause an overflow.
Maybe due to some optimization?
My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for the
language like we do in fix_icu_locale_str() instead of trying to be
clever. An alternative would be to use strncmp("tr", lang, 3) but that
seems too clever for my taste in something which is not performance
critical. A third option would be to check for
U_STRING_NOT_TERMINATED_WARNING but I think that would just be
unnecessarily convoluted code.
I have attached my proposed fix.
Andreas
Attachments:
v1-0001-Fix-overrun-when-comparing-with-unterminated-ICU-.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-overrun-when-comparing-with-unterminated-ICU-.patchDownload+2-3
On 1 April 2026 02:46:23 CEST, Andreas Karlsson <andreas@proxel.se> wrote:
My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for the language like we do in fix_icu_locale_str() instead of trying to be clever. An alternative would be to use strncmp("tr", lang, 3) but that seems too clever for my taste in something which is not performance critical. A third option would be to check for U_STRING_NOT_TERMINATED_WARNING but I think that would just be unnecessarily convoluted code.
I have attached my proposed fix.
Since it is likely I introduced or at least exposed this bug somehow I am adding this to the open items for PG 19.
Andreas
On Wed, 2026-04-01 at 02:46 +0200, Andreas Karlsson wrote:
On 3/12/26 5:00 AM, Alexander Lakhin wrote:
I've discovered that starting from c4ff35f10, the following query:
CREATE COLLATION c (provider = icu, locale = 'icu_something');makes asan detect (maybe dubious, but still..) stack-buffer-
overflow:
==21963==ERROR: AddressSanitizer: stack-buffer-overflow on addressMy proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for
the
language like we do in fix_icu_locale_str() instead of trying to be
clever.
Thank you both!
Committed with minor revisions:
* also check the status code, just to be sure
* backport to 18 where the original code was introduced
Regards,
Jeff Davis
On 4/13/26 20:40, Jeff Davis wrote:
Thank you both!
Thanks!
Committed with minor revisions:
* also check the status code, just to be sure
If we do that shouldn't we also do the same in the other callsites in
initdb.c uloc_getLanguage()? Maybe something like the attached. Also I
wonder if maybe other ICU functions have the same risk.
Andreas
Attachments:
v1-0001-Always-check-for-untermianted-strings-when-callin.patchtext/x-patch; charset=UTF-8; name=v1-0001-Always-check-for-untermianted-strings-when-callin.patchDownload+2-3
On 4/14/26 02:20, Andreas Karlsson wrote:
If we do that shouldn't we also do the same in the other callsites in
initdb.c uloc_getLanguage()? Maybe something like the attached. Also I
wonder if maybe other ICU functions have the same risk.
Now attached without a stupid typo.
Andreas
Attachments:
v2-0001-Always-check-for-untermianted-strings-when-callin.patchtext/x-patch; charset=UTF-8; name=v2-0001-Always-check-for-untermianted-strings-when-callin.patchDownload+2-3
On Tue, 2026-04-14 at 02:28 +0200, Andreas Karlsson wrote:
On 4/14/26 02:20, Andreas Karlsson wrote:
If we do that shouldn't we also do the same in the other callsites
in
initdb.c uloc_getLanguage()? Maybe something like the attached.
Also I
wonder if maybe other ICU functions have the same risk.
Committed, thank you.
Regards,
Jeff Davis