Unicode normalization SQL functions
Here are patches to add support for Unicode normalization into SQL, per
SQL standard:
normalize($string [, form])
$string is [form] normalized
(comment about silly SQL syntax here)
We already have all the infrastructure for Unicode normalization for the
SASLprep functionality. The first patch extends the internal APIs to
support all four normal forms instead of only NFKC used by SASLprep.
The second patch adds the SQL layer on top of it.
This could be used to preprocess or check strings before using them with
deterministic collations or locale implementations that don't deal with
non-NFC data correctly, perhaps using triggers, generated columns, or
domains. The NFKC and NFKD normalizations could also be used for
general data cleaning, similar to what SASLprep does.
As a future idea, I think we could also hook Unicode normalization into
the protocol-level encoding conversion.
Also, there is a way to optimize the "is normalized" test for common
cases, described in UTR #15. For that we'll need an additional data
file from Unicode. In order to simplify that, I would like my patch
"Add support for automatically updating Unicode derived files"
integrated first.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v1-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patchtext/plain; charset=UTF-8; name=v1-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patch; x-mac-creator=0; x-mac-type=0Download+3727-3703
v1-0002-Add-SQL-functions-for-Unicode-normalization.patchtext/plain; charset=UTF-8; name=v1-0002-Add-SQL-functions-for-Unicode-normalization.patch; x-mac-creator=0; x-mac-type=0Download+364-4
Peter Eisentraut wrote:
Also, there is a way to optimize the "is normalized" test for common
cases, described in UTR #15. For that we'll need an additional data
file from Unicode. In order to simplify that, I would like my patch
"Add support for automatically updating Unicode derived files"
integrated first.
Would that explain that the NFC/NFKC normalization and "is normalized"
check seem abnormally slow with the current patch, or should
it be regarded independently of the other patch?
For instance, testing 10000 short ASCII strings:
postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfc normalized ;
count
-------
10000
(1 row)
Time: 2573,859 ms (00:02,574)
By comparison, the NFD/NFKD case is faster by two orders of magnitude:
postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfd normalized ;
count
-------
10000
(1 row)
Time: 29,962 ms
Although NFC/NFKC has a recomposition step that NFD/NFKD
doesn't have, such a difference is surprising.
I've tried an alternative implementation based on ICU's
unorm2_isNormalized() /unorm2_normalize() functions (which I'm
currently adding to the icu_ext extension to be exposed in SQL).
With these, the 4 normal forms are in the 20ms ballpark with the above
test case, without a clear difference between composed and decomposed
forms.
Independently of the performance, I've compared the results
of the ICU implementation vs this patch on large series of strings
with all normal forms and could not find any difference.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 2020-01-06 17:00, Daniel Verite wrote:
Peter Eisentraut wrote:
Also, there is a way to optimize the "is normalized" test for common
cases, described in UTR #15. For that we'll need an additional data
file from Unicode. In order to simplify that, I would like my patch
"Add support for automatically updating Unicode derived files"
integrated first.Would that explain that the NFC/NFKC normalization and "is normalized"
check seem abnormally slow with the current patch, or should
it be regarded independently of the other patch?
That's unrelated.
For instance, testing 10000 short ASCII strings:
postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfc normalized ;
count
-------
10000
(1 row)Time: 2573,859 ms (00:02,574)
By comparison, the NFD/NFKD case is faster by two orders of magnitude:
postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfd normalized ;
count
-------
10000
(1 row)Time: 29,962 ms
Although NFC/NFKC has a recomposition step that NFD/NFKD
doesn't have, such a difference is surprising.
It's very likely that this is because the recomposition calls
recompose_code() which does a sequential scan of UnicodeDecompMain for
each character. To optimize that, we should probably build a bespoke
reverse mapping table that can be accessed more efficiently.
I've tried an alternative implementation based on ICU's
unorm2_isNormalized() /unorm2_normalize() functions (which I'm
currently adding to the icu_ext extension to be exposed in SQL).
With these, the 4 normal forms are in the 20ms ballpark with the above
test case, without a clear difference between composed and decomposed
forms.
That's good feedback.
Independently of the performance, I've compared the results
of the ICU implementation vs this patch on large series of strings
with all normal forms and could not find any difference.
And that too.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Here is an updated patch set that now also implements the "quick check"
algorithm from UTR #15 for making IS NORMALIZED very fast in many cases,
which I had mentioned earlier in the thread.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v2-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patchtext/plain; charset=UTF-8; name=v2-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patch; x-mac-creator=0; x-mac-type=0Download+3727-3703
v2-0002-Add-SQL-functions-for-Unicode-normalization.patchtext/plain; charset=UTF-8; name=v2-0002-Add-SQL-functions-for-Unicode-normalization.patch; x-mac-creator=0; x-mac-type=0Download+36887-7
Peter Eisentraut wrote:
Here is an updated patch set that now also implements the "quick check"
algorithm from UTR #15 for making IS NORMALIZED very fast in many cases,
which I had mentioned earlier in the thread.
I found a bug in unicode_is_normalized_quickcheck() which is
triggered when the last codepoint of the string is beyond
U+10000. On encountering it, it does:
+ if (is_supplementary_codepoint(ch))
+ p++;
When ch is the last codepoint, it makes p point to
the ending zero, but the subsequent p++ done by
the for loop makes it miss the exit and go into over-reading.
But anyway, what's the reason for skipping the codepoint
following a codepoint outside of the BMP?
I've figured that it comes from porting the Java code in UAX#15:
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
source.length() is the length in UTF-16 code units, in which a surrogate
pair counts for 2. This would be why it does
if (Character.isSupplementaryCodePoint(ch)) ++i;
it's meant to skip the 2nd UTF-16 code of the pair.
As this does not apply to the 32-bit pg_wchar, I think the two lines above
in the C implementation should just be removed.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 2020-01-28 10:48, Daniel Verite wrote:
I found a bug in unicode_is_normalized_quickcheck() which is
triggered when the last codepoint of the string is beyond
U+10000. On encountering it, it does:
+ if (is_supplementary_codepoint(ch))
+ p++;
When ch is the last codepoint, it makes p point to
the ending zero, but the subsequent p++ done by
the for loop makes it miss the exit and go into over-reading.But anyway, what's the reason for skipping the codepoint
following a codepoint outside of the BMP?
You're right, this didn't make any sense. Here is a new patch set with
that fixed.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patchtext/plain; charset=UTF-8; name=v3-0001-Add-support-for-other-normal-forms-to-Unicode-nor.patch; x-mac-creator=0; x-mac-type=0Download+3727-3703
v3-0002-Add-SQL-functions-for-Unicode-normalization.patchtext/plain; charset=UTF-8; name=v3-0002-Add-SQL-functions-for-Unicode-normalization.patch; x-mac-creator=0; x-mac-type=0Download+36878-7
On 1/28/20 9:21 PM, Peter Eisentraut wrote:
You're right, this didn't make any sense.ᅵ Here is a new patch set with
that fixed.
Thanks for this patch. This is a feature which has been on my personal
todo list for a while and something which I have wished to have a couple
of times.
I took a quick look at the patch and here is some feedback:
A possible concern is increased binary size from the new tables for the
quickcheck but personally I think they are worth it.
A potential optimization would be to merge utf8_to_unicode() and
pg_utf_mblen() into one function in unicode_normalize_func() since
utf8_to_unicode() already knows length of the character. Probably not
worth it though.
It feels a bit wasteful to measure output_size in
unicode_is_normalized() since unicode_normalize() actually already knows
the length of the buffer, it just does not return it.
A potential optimization for the normalized case would be to abort the
quick check on the first maybe and normalize from that point on only. If
I can find the time I might try this out and benchmark it.
Nitpick: "split/\s*;\s*/, $line" in generate-unicode_normprops_table.pl
should be "split /\s*;\s*/, $line".
What about using else if in the code below for clarity?
+ if (check == UNICODE_NORM_QC_NO)
+ return UNICODE_NORM_QC_NO;
+ if (check == UNICODE_NORM_QC_MAYBE)
+ result = UNICODE_NORM_QC_MAYBE;
Remove extra space in the line below.
+ else if (quickcheck == UNICODE_NORM_QC_NO )
Andreas
On Thu, Feb 13, 2020 at 01:23:41AM +0100, Andreas Karlsson wrote:
On 1/28/20 9:21 PM, Peter Eisentraut wrote:
You're right, this didn't make any sense. Here is a new patch set with
that fixed.Thanks for this patch. This is a feature which has been on my personal todo
list for a while and something which I have wished to have a couple of
times.
(The size of the patch set may justify compressing it)
--
Michael
On 2020-02-13 01:23, Andreas Karlsson wrote:
A potential optimization would be to merge utf8_to_unicode() and
pg_utf_mblen() into one function in unicode_normalize_func() since
utf8_to_unicode() already knows length of the character. Probably not
worth it though.
This would also require untangling the entire encoding API.
It feels a bit wasteful to measure output_size in
unicode_is_normalized() since unicode_normalize() actually already knows
the length of the buffer, it just does not return it.
Sure, but really most string APIs work like that. They surely know the
string length internally, but afterwards you often have to call strlen()
again.
A potential optimization for the normalized case would be to abort the
quick check on the first maybe and normalize from that point on only. If
I can find the time I might try this out and benchmark it.
Are you sure this would always be valid? The fact that this wasn't
mentioned in UTR #15 makes me suspicious.
Nitpick: "split/\s*;\s*/, $line" in generate-unicode_normprops_table.pl
should be "split /\s*;\s*/, $line".
done
What about using else if in the code below for clarity?
+ if (check == UNICODE_NORM_QC_NO) + return UNICODE_NORM_QC_NO; + if (check == UNICODE_NORM_QC_MAYBE) + result = UNICODE_NORM_QC_MAYBE;
done
Remove extra space in the line below.
+ else if (quickcheck == UNICODE_NORM_QC_NO )
I didn't find this in my local copy.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've checked the v3 patch against the results of the normalization
done by ICU [1]https://github.com/dverite/icu_ext/blob/master/icu_normalize.c on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).
Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.
Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).
operation | core | icu
------------+--------------+-----------
nfc check | 4398 (20) | 3088 (27)
nfc conv | 771502 (414) | 5503 (19)
nfd check | 4510 (10) | 2898 (8)
nfd conv | 9102 (1) | 5569 (6)
nfkc check | 4825 (51) | 3273 (4)
nfkc conv | 772240 (340) | 5763 (8)
nfkd check | 4794 (4) | 3170 (39)
nfkd conv | 9229 (4) | 5824 (9)
The queries:
check w/core:
select count(*) from words where w is $NORM normalized;
conversion w/core:
select sum(length(normalize(w, $NORM))) from words;
check w/icu:
select count(*) from words where icu_is_normalized(w, '$NORM');
conversion w/icu:
select sum(length(icu_normalize(w, '$NORM'))) from words;
[1]: https://github.com/dverite/icu_ext/blob/master/icu_normalize.c
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
One nitpick:
Around this hunk:
- * unicode_normalize_kc - Normalize a Unicode string to NFKC form.
+ * unicode_normalize - Normalize a Unicode string to the specified form.
*
* The input is a 0-terminated array of codepoints.
*
@@ -304,8 +306,10 @@ decompose_code(pg_wchar code, pg_wchar **result, int
*current)
* string is palloc'd instead, and OOM is reported with ereport().
*/
The comment in full says:
/*
* unicode_normalize - Normalize a Unicode string to the specified form.
*
* The input is a 0-terminated array of codepoints.
*
* In frontend, returns a 0-terminated array of codepoints, allocated with
* malloc. Or NULL if we run out of memory. In frontend, the returned
* string is palloc'd instead, and OOM is reported with ereport().
*/
It looks like the 2nd occurrence of "frontend" was meant to be "backend".
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 2020-02-17 20:14, Daniel Verite wrote:
The comment in full says:
/*
* unicode_normalize - Normalize a Unicode string to the specified form.
*
* The input is a 0-terminated array of codepoints.
*
* In frontend, returns a 0-terminated array of codepoints, allocated with
* malloc. Or NULL if we run out of memory. In frontend, the returned
* string is palloc'd instead, and OOM is reported with ereport().
*/It looks like the 2nd occurrence of "frontend" was meant to be "backend".
This was a pre-existing problem, so I have fixed that separately.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-02-17 20:08, Daniel Verite wrote:
Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.
Yes, this is a known issue and I think room for future optimization work.
Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
That would require linking libpq to ICU (for SASLprep), and in general
would either make ICU required or require maintaining multiple
implementations. I don't think we're there yet.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
What is that status of this patch set? I think we have nailed down the
behavior, but there were some concerns about certain performance
characteristics. Do people feel that those are required to be addressed
in this cycle?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/19/20 3:41 PM, Peter Eisentraut wrote:
What is that status of this patch set?ᅵ I think we have nailed down the
behavior, but there were some concerns about certain performance
characteristics.ᅵ Do people feel that those are required to be addressed
in this cycle?
Personally I would rather see it merged if the code is correct (which it
seems like it is from what I can tell) as the performance seems to be
good enough for it to be useful.
Unicode normalization is a feature which I have wished and at least for
my use cases the current implementation is more than fast enough.
Andreas
Peter Eisentraut wrote:
What is that status of this patch set? I think we have nailed down the
behavior, but there were some concerns about certain performance
characteristics. Do people feel that those are required to be addressed
in this cycle?
Not finding any other issue with v3 or objections in the thread,
I've set the status to Ready For Committer in the CF.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 2020-03-23 17:26, Daniel Verite wrote:
Peter Eisentraut wrote:
What is that status of this patch set? I think we have nailed down the
behavior, but there were some concerns about certain performance
characteristics. Do people feel that those are required to be addressed
in this cycle?Not finding any other issue with v3 or objections in the thread,
I've set the status to Ready For Committer in the CF.
I have committed the 0001 patch.
Now I have some concerns about the size of the new table in
unicode_normprops_table.h, and the resulting binary size. At the very
least, we should probably make that #ifndef FRONTEND or something like
that so libpq isn't bloated by it unnecessarily. Perhaps there is a
better format for that table? Any ideas?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-03-24 10:20, Peter Eisentraut wrote:
Now I have some concerns about the size of the new table in
unicode_normprops_table.h, and the resulting binary size. At the very
least, we should probably make that #ifndef FRONTEND or something like
that so libpq isn't bloated by it unnecessarily. Perhaps there is a
better format for that table? Any ideas?
I have figured this out. New patch is attached.
First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't
bloated. Second, I have changed the lookup structure to a bitfield, so
each entry is only 32 bits instead of 64. Third, I have dropped the
quickcheck tables for the NFD and NFKD forms. Those are by far the
biggest tables, and you still get okay performance if you do the
normalization check the long way, since we don't need the recomposition
step on those cases, which is by far the slowest part. The main use
case of all of this, I expect, is to check for NFC normalization, so
it's okay if the other variants are not optimized to the same extent.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v4-0001-Add-SQL-functions-for-Unicode-normalization.patchtext/plain; charset=UTF-8; name=v4-0001-Add-SQL-functions-for-Unicode-normalization.patch; x-mac-creator=0; x-mac-type=0Download+6764-8
On Thu, Mar 26, 2020 at 3:26 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
I have figured this out. New patch is attached.
First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't
bloated. Second, I have changed the lookup structure to a bitfield, so
each entry is only 32 bits instead of 64. Third, I have dropped the
quickcheck tables for the NFD and NFKD forms. Those are by far the
biggest tables, and you still get okay performance if you do the
normalization check the long way, since we don't need the recomposition
step on those cases, which is by far the slowest part. The main use
case of all of this, I expect, is to check for NFC normalization, so
it's okay if the other variants are not optimized to the same extent.
Reading the link cited in the patch
http://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
"The data for the implementation of the isAllowed() call can be
accessed in memory with a hash table or a trie (see Section 14,
Implementation Notes); the latter will be the fastest."
We don't have a trie implementation in Postgres, but we do have a
perfect hash implementation. Doing that would bring the tables back to
64 bits per entry, but would likely be noticeably faster than binary
search. Since v4 has left out the biggest tables entirely, I think
this might be worth a look for the smaller tables remaining.
In the attached v5, when building the hash tables, we sort the code
points by NO/MAYBE, and store the index of the beginning of the NO
block:
MMMNNNNNNNNN
~~~^
That way we can tell a NO from a MAYBE by testing the result of the hash lookup.
Regression tests pass, but I haven't measured performance yet. I had
to fiddle with the hash seeds a bit to get the larger table to build.
Also, if we go with v4, I noticed the following test is present twice:
+SELECT "normalize"('abc', 'def'); -- run-time error
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v5-Add-SQL-functions-for-Unicode-normalization.patchapplication/x-patch; name=v5-Add-SQL-functions-for-Unicode-normalization.patchDownload+8392-9
I wrote:
Regression tests pass, but I haven't measured performance yet.
Using a test similar to one upthread:
select count(*) from (select md5(i::text) as t from
generate_series(1,100000) as i) s where t is nfc normalized ;
I get (median of three)
v4 419ms
v5 310ms
with binary size
v4 HEAD + 33kB
v5 HEAD + 57kB
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services