[PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Started by Diego Friasabout 2 months ago4 messageshackers

mail@dzfrias.dev

about 2 months ago

Hi hackers

I was browsing the PostgreSQL’s Unicode normalization code and found an issue where the composition algorithm recognizes 0x11A7 as a T syllable and combines it with subsequent S and V syllables. Per the Unicode specification:

TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1.

In short, TCount actually counts 1 more than the number of T syllables; this is so s % TCount == 0 implies that s has no T syllable (because the 0th place represents the absence of a T syllable), where s is the s-index of a precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T syllable, the composition algorithm yields a nonsense character when 0x11A7 is put in the T position. See https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/src/normalize.rs#L218 for a comment on this bug in Rust’s unicode-rs, and https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710 for a similar contribution I made to JuliaStrings/utf8proc a few months ago.

Let me know if this patch needs anything else. I can write a test for this, but it looks like the current testing setup in src/common/norm_test.c only runs the Unicode test suite and isn’t built for writing custom tests. If that is something of interest, though, I’m happy to add that to this patch.

Best,
Diego

Michael Paquier

michael@paquier.xyz

about 2 months ago

In reply to: Diego Frias (#1)

Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:

In short, TCount actually counts 1 more than the number of T
syllables; this is so s % TCount == 0 implies that s has no T
syllable (because the 0th place represents the absence of a T
syllable), where s is the s-index of a precomposed Hangul
character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
syllable, the composition algorithm yields a nonsense character when
0x11A7 is put in the T position.

Oops. Yes, including TBASE in the recomposition is incorrect, finding
your quote here (TBase is set to one less..):
https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688

The character gets eaten by the normalization. Pas glop.

Let me know if this patch needs anything else. I can write a test
for this, but it looks like the current testing setup in
src/common/norm_test.c only runs the Unicode test suite and isn’t
built for writing custom tests. If that is something of interest,
though, I’m happy to add that to this patch.

We have a set of tests in src/test/regress/sql/unicode.sql that would
fit nicely with what you want to address here. For this specific
problem, this would work:
SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';

How about adding more normalization check patterns, while on it? I am
finishing with the attached, all things combined. Diego. what do you
think?
--
Michael

Diego Frias

mail@dzfrias.dev

about 2 months ago

In reply to: Michael Paquier (#2)

Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.

Thanks,
Diego

Show quoted text

On Jun 3, 2026, at 9:07 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:

In short, TCount actually counts 1 more than the number of T
syllables; this is so s % TCount == 0 implies that s has no T
syllable (because the 0th place represents the absence of a T
syllable), where s is the s-index of a precomposed Hangul
character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
syllable, the composition algorithm yields a nonsense character when
0x11A7 is put in the T position.

Oops. Yes, including TBASE in the recomposition is incorrect, finding
your quote here (TBase is set to one less..):
https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688

The character gets eaten by the normalization. Pas glop.

Let me know if this patch needs anything else. I can write a test
for this, but it looks like the current testing setup in
src/common/norm_test.c only runs the Unicode test suite and isn’t
built for writing custom tests. If that is something of interest,
though, I’m happy to add that to this patch.

We have a set of tests in src/test/regress/sql/unicode.sql that would
fit nicely with what you want to address here. For this specific
problem, this would work:
SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';

How about adding more normalization check patterns, while on it? I am
finishing with the attached, all things combined. Diego. what do you
think?
--
Michael
<0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>

Michael Paquier

michael@paquier.xyz

about 2 months ago

In reply to: Diego Frias (#3)

Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

On Thu, Jun 04, 2026 at 09:32:53AM -0700, Diego Frias wrote:

Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.

Thanks. Applied and backpatched down to v14.
--
Michael

[PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Attachments:

Attachments: