speed up unicode decomposition and recomposition
Having committed the optimization for unicode normalization quick check,
Michael Paquier suggested I might do the same for decomposition as well. I
wrote:
I'll
do some performance testing soon. Note that a 25kB increase in size could
be present in frontend binaries as well in this case. While looking at
decomposition, I noticed that recomposition does a linear search through
all 6600+ entries, although it seems only about 800 are valid for that.
That could be optimized as well now, since with hashing we have more
flexibility in the ordering and can put the recomp-valid entries in front.
I'm not yet sure if it's worth the additional complexity. I'll take a look
and start a new thread.
The attached patch uses a perfect hash for codepoint decomposition, and for
recomposing reduces the linear search from 6604 entries to 942.
The performance is very nice, and if I'd known better I would have done
this first, since the decomp array is as big as the two quick check arrays
put together:
Normalize, decomp only
select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patchÏ
887ms 231ms
select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
master patch
1110ms 208ms
Normalize, decomp+recomp (note: 100x less data)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;
master patch
194ms 50.6ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;
master patch
137ms 39.4ms
Quick check is another 2x faster on top of previous gains, since it tests
canonical class via the decomposition array:
-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;
master patch
296ms 131ms
Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
the decomp array was reordered to optimize linear search, it can no longer
be used for binary search. It's possible to build two arrays, one for
frontend and one for backend, but that's additional complexity. We could
also force frontend to do a linear search all the time, but that seems
foolish. I haven't checked if it's possible to exclude the hash from
backend's libpq.
- I could split out the two approaches into separate patches, but it'd be
rather messy.
I'll add a CF entry for this.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v1-0001-Optimize-unicode-decomposition-and-recomposition.patchapplication/octet-stream; name=v1-0001-Optimize-unicode-decomposition-and-recomposition.patchDownload+2796-1024
John Naylor <john.naylor@enterprisedb.com> writes:
Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
the decomp array was reordered to optimize linear search, it can no longer
be used for binary search. It's possible to build two arrays, one for
frontend and one for backend, but that's additional complexity. We could
also force frontend to do a linear search all the time, but that seems
foolish. I haven't checked if it's possible to exclude the hash from
backend's libpq.
IIUC, the only place libpq uses this is to process a password-sized string
or two during connection establishment. It seems quite silly to add
26kB in order to make that faster. Seems like a nice speedup on the
backend side, but I'd vote for keeping the frontend as-is.
regards, tom lane
On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote:
John Naylor <john.naylor@enterprisedb.com> writes:
Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
the decomp array was reordered to optimize linear search, it can no longer
be used for binary search. It's possible to build two arrays, one for
frontend and one for backend, but that's additional complexity. We could
also force frontend to do a linear search all the time, but that seems
foolish. I haven't checked if it's possible to exclude the hash from
backend's libpq.IIUC, the only place libpq uses this is to process a password-sized string
or two during connection establishment. It seems quite silly to add
26kB in order to make that faster. Seems like a nice speedup on the
backend side, but I'd vote for keeping the frontend as-is.
Agreed. Let's only use the perfect hash in the backend. It would be
nice to avoid an extra generation of the decomposition table for that,
and a table ordered by codepoints is easier to look at. How much do
you think would be the performance impact if we don't use for the
linear search the most-optimized decomposition table?
--
Michael
On Wed, Oct 14, 2020 at 8:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Oct 14, 2020 at 01:06:40PM -0400, Tom Lane wrote:
IIUC, the only place libpq uses this is to process a password-sized
string
or two during connection establishment. It seems quite silly to add
26kB in order to make that faster. Seems like a nice speedup on the
backend side, but I'd vote for keeping the frontend as-is.Agreed. Let's only use the perfect hash in the backend. It would be
nice to avoid an extra generation of the decomposition table for that,
and a table ordered by codepoints is easier to look at. How much do
you think would be the performance impact if we don't use for the
linear search the most-optimized decomposition table?
With those points in mind and thinking more broadly, I'd like to try harder
on recomposition. Even several times faster, recomposition is still orders
of magnitude slower than ICU, as measured by Daniel Verite [1]/messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org. I only did
it this way because I couldn't think of how to do the inverse lookup with a
hash. But I think if we constructed the hash key like
pg_hton64((code1 << 32) | code2)
and on the Perl side do something like
pack('N',$code1) . pack('N',$code2)
that might work. Or something that looks more like the C side. And make
sure to use the lowest codepoint for the result. That way, we can still
keep the large decomp array ordered, making it easier to keep the current
implementation in the front end, and hopefully getting even better
performance in the backend.
[1]: /messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
/messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
John Naylor <john.naylor@enterprisedb.com> writes:
With those points in mind and thinking more broadly, I'd like to try harder
on recomposition. Even several times faster, recomposition is still orders
of magnitude slower than ICU, as measured by Daniel Verite [1].
Huh. Has anyone looked into how they do it?
regards, tom lane
At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
John Naylor <john.naylor@enterprisedb.com> writes:
With those points in mind and thinking more broadly, I'd like to try harder
on recomposition. Even several times faster, recomposition is still orders
of magnitude slower than ICU, as measured by Daniel Verite [1].Huh. Has anyone looked into how they do it?
I'm not sure it is that, but it would be that.. It uses separate
tables for decomposition and composition pointed from a trie?
That table is used after trying algorithmic decomposition/composition
for, for example, Hangul. I didn't look it any fruther but just for
information, icu4c/source/common/normalizer2impl.cpp seems doing that.
For example icu4c/srouce/common/norm2_nfc_data.h defines the static data.
icu4c/source/common/normalier2impl.h:244 points a design documentation
of normalization.
http://site.icu-project.org/design/normalization/custom
Old and New Implementation Details
The old normalization data format (unorm.icu, ca. 2001..2009) uses
three data structures for normalization: A trie for looking up 32-bit
values for every code point, a 16-bit-unit array with decompositions
and some other data, and a composition table (16-bit-unit array,
linear search list per starter). The data is combined for all 4
standard normalization forms: NFC, NFD, NFKC and NFKD.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Oct 15, 2020 at 1:30 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
John Naylor <john.naylor@enterprisedb.com> writes:
With those points in mind and thinking more broadly, I'd like to try
harder
on recomposition. Even several times faster, recomposition is still
orders
of magnitude slower than ICU, as measured by Daniel Verite [1].
Huh. Has anyone looked into how they do it?
I'm not sure it is that, but it would be that.. It uses separate
tables for decomposition and composition pointed from a trie?
I think I've seen a trie recommended somewhere, maybe the official website.
That said, I was able to get the hash working for recomposition (split into
a separate patch, and both of them now leave frontend alone), and I'm
pleased to say it's 50-75x faster than linear search in simple tests. I'd
be curious how it compares to ICU now. Perhaps Daniel Verite would be
interested in testing again? (CC'd)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
master patch
18800ms 257ms
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
master patch
13000ms 254ms
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Oct 15, 2020 at 01:59:38PM -0400, John Naylor wrote:
I think I've seen a trie recommended somewhere, maybe the official website.
That said, I was able to get the hash working for recomposition (split into
a separate patch, and both of them now leave frontend alone), and I'm
pleased to say it's 50-75x faster than linear search in simple tests. I'd
be curious how it compares to ICU now. Perhaps Daniel Verite would be
interested in testing again? (CC'd)
Yeah, that would be interesting to compare. Now the gains proposed by
this patch are already a good step forward, so I don't think that it
should be a blocker for a solution we have at hand as the numbers
speak by themselves here. So if something better gets proposed, we
could always change the decomposition and recomposition logic as
needed.
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;master patch
18800ms 257ms
My environment was showing HEAD as being a bit faster with 15s, while
the patch gets "only" down to 290~300ms (compiled with -O2, as I guess
you did). Nice.
+ # Then the second
+ return -1 if $a2 < $b2;
+ return 1 if $a2 > $b2;
Should say "second code point" here?
+ hashkey = pg_hton64(((uint64) start << 32) | (uint64) code);
+ h = recompinfo.hash(&hashkey);
This choice should be documented, and most likely we should have
comments on the perl and C sides to keep track of the relationship
between the two.
The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
Structurally, wouldn't it be better to move this part into its own,
backend-only, header? It could be possible to paint the difference
with some ifdef FRONTEND of course, or just keep things as they are
because this can be useful for some out-of-core frontend tool? But if
we keep that as a separate header then any C part can decide to
include it or not, so frontend tools could also make this choice.
Note that we don't include unicode_normprops_table.h for frontends in
unicode_norm.c, but that's the case of unicode_norm_table.h.
--
Michael
John Naylor wrote:
I'd be curious how it compares to ICU now
I've made another run of the test in [1]/messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:
operation | unpatched | patched | icu_ext
------------+-----------+---------+---------
nfc check | 7968 | 5989 | 4076
nfc conv | 700894 | 15163 | 6808
nfd check | 16399 | 9852 | 3849
nfd conv | 17391 | 10916 | 6758
nfkc check | 8259 | 6092 | 4301
nfkc conv | 700241 | 15354 | 7034
nfkd check | 16585 | 10049 | 4038
nfkd conv | 17587 | 11109 | 7086
The ICU implementation still wins by a large margin, but
the improvements brought by the patch are gorgeous,
especially for the conversion to NFC/NFKC.
This test ran on a slower machine than what I used for [1]/messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org, so
that's why all queries take longer.
For the two queries upthread, I get this:
1)
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
count
--------
100000
(1 row)
Time: 371.043 ms
VS ICU:
select count(icu_normalize(t, 'NFC')) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
count
--------
100000
(1 row)
Time: 125.809 ms
2)
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
count
--------
100000
(1 row)
Time: 428.214 ms
VS ICU:
select count(icu_normalize(t, 'NFC')) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
count
--------
100000
(1 row)
Time: 147.713 ms
[1]: /messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
/messages/by-id/2c5e8df9-43b8-41fa-88e6-286e8634f00a@manitou-mail.org
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz>
wrote:
The binary sizes of libpgcommon_shlib.a and libpgcommon.a change
because Decomp_hash_func() gets included, impacting libpq.
I don't see any difference on gcc/Linux in those two files, nor in
unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
expected. Could it depend on the compiler?
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Oct 16, 2020 at 2:08 PM Daniel Verite <daniel@manitou-mail.org>
wrote:
John Naylor wrote:
I'd be curious how it compares to ICU now
I've made another run of the test in [1] with your v2 patches
from this thread against icu_ext built with ICU-67.1.
The results show the times in milliseconds to process
about 10 million short strings:
Thanks for testing!
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote:
I don't see any difference on gcc/Linux in those two files, nor in
unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
expected. Could it depend on the compiler?
Hmm. My guess is that you don't have --enable-debug in your set of
configure options? It is not unusual to have this one enabled for GCC
even on production systems, and the size of the libs is impacted in
this case with your patch.
--
Michael
On Tue, Oct 20, 2020 at 3:22 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Oct 19, 2020 at 10:34:33AM -0400, John Naylor wrote:
I don't see any difference on gcc/Linux in those two files, nor in
unicode_norm_shlib.o -- I do see a difference in unicode_norm_srv.o as
expected. Could it depend on the compiler?Hmm. My guess is that you don't have --enable-debug in your set of
configure options? It is not unusual to have this one enabled for GCC
even on production systems, and the size of the libs is impacted in
this case with your patch.
I've confirmed that. How about a new header unicode_norm_hashfunc.h which
would include unicode_norm_table.h at the top. In unicode.c, we can include
one of these depending on frontend or backend.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Oct 20, 2020 at 08:03:12AM -0400, John Naylor wrote:
I've confirmed that. How about a new header unicode_norm_hashfunc.h which
would include unicode_norm_table.h at the top. In unicode.c, we can include
one of these depending on frontend or backend.
Sounds good to me. Looking at the code, I would just generate the
second file within generate-unicode_norm_table.pl.
--
Michael
Attached v3 addressing review points below:
On Thu, Oct 15, 2020 at 11:32 PM Michael Paquier <michael@paquier.xyz>
wrote:
+ # Then the second + return -1 if $a2 < $b2; + return 1 if $a2 > $b2; Should say "second code point" here?
Done. Also changed the tiebreaker to the composed codepoint. Beforehand, it
was the index into DecompMain[], which is only equivalent if the list is in
order (it is but we don't want correctness to depend on that), and not very
clear.
+ hashkey = pg_hton64(((uint64) start << 32) | (uint64) code); + h = recompinfo.hash(&hashkey); This choice should be documented, and most likely we should have comments on the perl and C sides to keep track of the relationship between the two.
Done.
<separate headers>
Done.
Other cosmetic changes:
- format recomp array comments like /* U+0045+032D -> U+1E18 */
- make sure to comment #endif's that are far from the #if
- small whitespace fixes
Note: for the new header I simply adapted from unicode_norm_table.h the
choice of "There is deliberately not an #ifndef PG_UNICODE_NORM_TABLE_H
here", although I must confess I'm not sure what the purpose of that is, in
this case.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Oct 21, 2020 at 07:35:12PM -0400, John Naylor wrote:
There was a mistake in v3 with pgindent/exclude_file_patterns, fixed in v4.
Thanks for the updated version, that was fast. I have found a couple
of places that needed to be adjusted, like the comment at the top of
generate-unicode_norm_table.pl or some comments, an incorrect include
in the new headers and the indentation was not right in perl (we use
perltidy v20170521, see the README in src/tools/pgindent).
Except that, this looks good to me. Attached is the updated version
with all my tweaks, that I would like to commit. If there are any
comments, please feel free of course.
--
Michael
Attachments:
unicode-derecomp-v5.patchtext/x-diff; charset=us-asciiDownload+3227-44
On Thu, Oct 22, 2020 at 12:34 AM Michael Paquier <michael@paquier.xyz>
wrote:
Thanks for the updated version, that was fast. I have found a couple
of places that needed to be adjusted, like the comment at the top of
generate-unicode_norm_table.pl or some comments, an incorrect include
in the new headers and the indentation was not right in perl (we use
perltidy v20170521, see the README in src/tools/pgindent).Except that, this looks good to me. Attached is the updated version
with all my tweaks, that I would like to commit. If there are any
comments, please feel free of course.
Looks good to me.
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote:
Looks good to me.
Thanks. Committed, then. Great work!
--
Michael
On Thu, Oct 22, 2020 at 10:11 PM Michael Paquier <michael@paquier.xyz>
wrote:
On Thu, Oct 22, 2020 at 05:50:52AM -0400, John Naylor wrote:
Looks good to me.
Thanks. Committed, then. Great work!
Thank you!
--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company