Wrong results using initcap() with non normalized string
Hello,
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
The attached patch can fix the issue.
Regards,
Juan José Santamaría Flecha
Attachments:
0001-initcap-non-normalized-string.patchapplication/x-patch; name=0001-initcap-non-normalized-string.patchDownload
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 755ca6e..9f8becf 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -96,6 +96,7 @@
#include "utils/memutils.h"
#include "utils/numeric.h"
#include "utils/pg_locale.h"
+#include "common/unicode_norm.h"
/* ----------
* Routines type
@@ -1864,7 +1865,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
else
workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
- wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+ if (!is_pg_wchar_combining(workspace[curr_char]))
+ wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
}
else
#endif
@@ -1873,7 +1875,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower(workspace[curr_char]);
else
workspace[curr_char] = towupper(workspace[curr_char]);
- wasalnum = iswalnum(workspace[curr_char]);
+ if (!is_pg_wchar_combining(workspace[curr_char]))
+ wasalnum = iswalnum(workspace[curr_char]);
}
}
diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index 89c5533..25b149b 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -435,3 +435,14 @@ unicode_normalize_kc(const pg_wchar *input)
return recomp_chars;
}
+
+bool
+is_pg_wchar_combining(const pg_wchar current)
+{
+ pg_unicode_decomposition *currEntry = get_code_entry(current);
+ if (currEntry == NULL)
+ return false;
+ if (currEntry->comb_class == 0x0)
+ return false;
+ return true;
+}
diff --git a/src/include/common/unicode_norm.h b/src/include/common/unicode_norm.h
index 99167d2..bdcf02e 100644
--- a/src/include/common/unicode_norm.h
+++ b/src/include/common/unicode_norm.h
@@ -17,5 +17,6 @@
#include "mb/pg_wchar.h"
extern pg_wchar *unicode_normalize_kc(const pg_wchar *input);
+extern bool is_pg_wchar_combining(const pg_wchar current);
#endif /* UNICODE_NORM_H */
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
The attached patch can fix the issue.
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.
As for the details of the patch, it seems overly certain that
it's working with UTF8 data.
regards, tom lane
On 2019-Sep-20, Tom Lane wrote:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
I have come around a strange situation when using a unicode string
that has non normalized characters. The attached script 'initcap.sql'
can reproduce the problem.
For illustration purposes:
SELECT initcap('ŞUB');
initcap
─────────
Şub
(1 fila)
SELECT initcap('ŞUB');
initcap
─────────
ŞUb
(1 fila)
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.
I would think that we have to start somewhere, rather than take the
position that we can never do anything about it.
(ref: /messages/by-id/53E179E1.3060404@2ndquadrant.com )
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 21, 2019 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Sep-20, Tom Lane wrote:
If we're going to start worrying about non-normalized characters,
I suspect there are far more places than this one that we'd have
to consider buggy :-(.I would think that we have to start somewhere, rather than take the
position that we can never do anything about it.
This conversation is prior to having the normalization code available
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=60f11b87a2349985230c08616fa8a34ffde934c8
I agree it would be problematic if it was the only normalization-aware
function, although most functions are sure to be troubleless if
nothing has been reported before.
The attached patch addresses the comment about assuming UTF8.
Regards,
Juan José Santamaría Flecha
Attachments:
0001-initcap-non-normalized-string-v1.patchapplication/octet-stream; name=0001-initcap-non-normalized-string-v1.patchDownload
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 755ca6e..5a57391 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -96,6 +96,7 @@
#include "utils/memutils.h"
#include "utils/numeric.h"
#include "utils/pg_locale.h"
+#include "common/unicode_norm.h"
/* ----------
* Routines type
@@ -1843,6 +1844,7 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
wchar_t *workspace;
size_t curr_char;
size_t result_size;
+ int encoding;
/* Overflow paranoia */
if ((nbytes + 1) > (INT_MAX / sizeof(wchar_t)))
@@ -1850,6 +1852,7 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
+ encoding = GetDatabaseEncoding();
/* Output workspace cannot have more codes than input bytes */
workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
@@ -1864,7 +1867,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
else
workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
- wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+ if (encoding == PG_UTF8 && !is_pg_wchar_combining(workspace[curr_char]))
+ wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
}
else
#endif
@@ -1873,7 +1877,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower(workspace[curr_char]);
else
workspace[curr_char] = towupper(workspace[curr_char]);
- wasalnum = iswalnum(workspace[curr_char]);
+ if (encoding == PG_UTF8 && !is_pg_wchar_combining(workspace[curr_char]))
+ wasalnum = iswalnum(workspace[curr_char]);
}
}
diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index 89c5533..25b149b 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -435,3 +435,14 @@ unicode_normalize_kc(const pg_wchar *input)
return recomp_chars;
}
+
+bool
+is_pg_wchar_combining(const pg_wchar current)
+{
+ pg_unicode_decomposition *currEntry = get_code_entry(current);
+ if (currEntry == NULL)
+ return false;
+ if (currEntry->comb_class == 0x0)
+ return false;
+ return true;
+}
diff --git a/src/include/common/unicode_norm.h b/src/include/common/unicode_norm.h
index 99167d2..bdcf02e 100644
--- a/src/include/common/unicode_norm.h
+++ b/src/include/common/unicode_norm.h
@@ -17,5 +17,6 @@
#include "mb/pg_wchar.h"
extern pg_wchar *unicode_normalize_kc(const pg_wchar *input);
+extern bool is_pg_wchar_combining(const pg_wchar current);
#endif /* UNICODE_NORM_H */
On 2019-Sep-22, Juan Jos� Santamar�a Flecha wrote:
The attached patch addresses the comment about assuming UTF8.
The UTF8 bits looks reasonable to me. I guess the other part of that
question is whether we support any other multibyte encoding that
supports combining characters. Maybe for cases other than UTF8 we can
test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive
the upper/lower decision off that? (For the UTF8 case, I don't know if
Juanjo's proposal is better than pg_encoding_dsplen. Both seem to boil
down to a bsearch, though unicode_norm.c's table seems much larger than
wchar.c's).
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Sep 29, 2019 at 3:38 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
The UTF8 bits looks reasonable to me. I guess the other part of that
question is whether we support any other multibyte encoding that
supports combining characters. Maybe for cases other than UTF8 we can
test for 0-width chars (using pg_encoding_dsplen() perhaps?) and drive
the upper/lower decision off that? (For the UTF8 case, I don't know if
Juanjo's proposal is better than pg_encoding_dsplen. Both seem to boil
down to a bsearch, though unicode_norm.c's table seems much larger than
wchar.c's).
Using pg_encoding_dsplen() looks like the way to go. The normalizarion
logic included in ucs_wcwidth() already does what is need to avoid the
issue, so there is no need to use unicode_norm_table.h. UTF8 is the
only multibyte encoding that can return a 0-width dsplen, so this
approach would also works for all the other encodings that do not use
combining characters.
Please find attached a patch with this approach.
Regards,
Juan José Santamaría Flecha
Attachments:
0001-initcap-non-normalized-string-v2.patchapplication/x-patch; name=0001-initcap-non-normalized-string-v2.patchDownload
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f7175df..8af62d2 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1947,7 +1947,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
wchar_t *workspace;
size_t curr_char;
size_t result_size;
+ int encoding;
+ char wdsplen[MAX_MULTIBYTE_CHAR_LEN];
+ encoding = GetDatabaseEncoding();
/* Overflow paranoia */
if ((nbytes + 1) > (INT_MAX / sizeof(wchar_t)))
ereport(ERROR,
@@ -1968,7 +1971,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
else
workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
- wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+ wchar2char(wdsplen, &workspace[curr_char], MAX_MULTIBYTE_CHAR_LEN, mylocale);
+ if (pg_encoding_dsplen(encoding, wdsplen) != 0)
+ wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
}
else
#endif
@@ -1977,7 +1982,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
workspace[curr_char] = towlower(workspace[curr_char]);
else
workspace[curr_char] = towupper(workspace[curr_char]);
- wasalnum = iswalnum(workspace[curr_char]);
+ wchar2char(wdsplen, &workspace[curr_char], MAX_MULTIBYTE_CHAR_LEN, mylocale);
+ if (pg_encoding_dsplen(encoding, wdsplen) != 0)
+ wasalnum = iswalnum(workspace[curr_char]);
}
}