Wrong results using initcap() with non normalized string
Hello,
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
The attached patch can fix the issue.
Regards,
Juan José Santamaría Flecha
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
The attached patch can fix the issue.
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.
As for the details of the patch, it seems overly certain that
it's working with UTF8 data.
regards, tom lane
On 2019-Sep-20, Tom Lane wrote:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
For illustration purposes:
SELECT initcap('ŞUB');
initcap
─────────
Şub
(1 fila)
SELECT initcap('ŞUB');
initcap
─────────
ŞUb
(1 fila)
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.
I would think that we have to start somewhere, rather than take the
position that we can never do anything about it.
(ref: /messages/by-id/53E179E1.3060404@2ndquadrant.com )
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 21, 2019 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Sep-20, Tom Lane wrote:
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.I would think that we have to start somewhere, rather than take the
position that we can never do anything about it.
This conversation is prior to having the normalization code available
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=60f11b87a2349985230c08616fa8a34ffde934c8
I agree it would be problematic if it was the only normalization-aware
function, although most functions are sure to be troubleless if
nothing has been reported before.
The attached patch addresses the comment about assuming UTF8.
Regards,
Juan José Santamaría Flecha
Attachments:
0001-initcap-non-normalized-string-v1.patchapplication/octet-stream; name=0001-initcap-non-normalized-string-v1.patchDownload+19-2
On 2019-Sep-22, Juan Jos� Santamar�a Flecha wrote:
The attached patch addresses the comment about assuming UTF8.
The UTF8 bits looks reasonable to me. I guess the other part of that
question is whether we support any other multibyte encoding that
supports combining characters. Maybe for cases other than UTF8 we can
test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive
the upper/lower decision off that? (For the UTF8 case, I don't know if
Juanjo's proposal is better than pg_encoding_dsplen. Both seem to boil
down to a bsearch, though unicode_norm.c's table seems much larger than
wchar.c's).
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Sep 29, 2019 at 3:38 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
The UTF8 bits looks reasonable to me. I guess the other part of that
question is whether we support any other multibyte encoding that
supports combining characters. Maybe for cases other than UTF8 we can
test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive
the upper/lower decision off that? (For the UTF8 case, I don't know if
Juanjo's proposal is better than pg_encoding_dsplen. Both seem to boil
down to a bsearch, though unicode_norm.c's table seems much larger than
wchar.c's).
Using pg_encoding_dsplen() looks like the way to go. The normalizarion
logic included in ucs_wcwidth() already does what is need to avoid the
issue, so there is no need to use unicode_norm_table.h. UTF8 is the
only multibyte encoding that can return a 0-width dsplen, so this
approach would also works for all the other encodings that do not use
combining characters.
Please find attached a patch with this approach.
Regards,
Juan José Santamaría Flecha