insensitive collations
With various patches and discussions around collations going on, I
figured I'd send in my in-progress patch for insensitive collations.
This adds a flag "insensitive" to collations. Such a collation disables
various optimizations that assume that strings are equal only if they
are byte-wise equal. That then allows use cases such as
case-insensitive or accent-insensitive comparisons or handling of
strings with different Unicode normal forms.
So this doesn't actually make the collation case-insensitive or
anything, it just allows a library-provided collation that is, say,
case-insensitive to actually work that way. So maybe "insensitive"
isn't the right name for this flag, but we can think about that.
The jobs of this patch, aside from some DDL extensions, are to track
collation assignment in plan types whether they have so far been
ignored, and then make the various collation-aware functions take the
insensitive flag into account. In comparison functions this just means
skipping past the memcmp() optimizations. In hashing functions, this
means converting the string to a sort key (think strxfrm()) before hashing.
Various pieces are incomplete, but the idea should be clear from this.
I have only implemented the ICU implementation in hashtext(); the libc
provider branch needs to be added (or maybe we won't want to). All the
changes around the "name" type haven't been taken into account. Foreign
key support (see ri_GenerateQualCollation()) needs to be addressed.
More tests for all the different plans need to be added. But in
principle it works quite well, as you can see in the tests added so far.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v1-0001-Insensitive-collations.patchtext/plain; charset=UTF-8; name=v1-0001-Insensitive-collations.patch; x-mac-creator=0; x-mac-type=0Download+516-40
Peter Eisentraut wrote:
So this doesn't actually make the collation case-insensitive or
anything, it just allows a library-provided collation that is, say,
case-insensitive to actually work that way.
That's great news!
So maybe "insensitive" isn't the right name for this flag, but we
can think about that.
To me it seems more natural to find a name for the other behavior, the
one that consists of overwriting the locale-sensitive equality with
the result of the byte-wise comparison. AFAIU the flag is meant
to say: "don't do that".
Some ideas that come to mind:
as an enum
CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] )
as a boolean
CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] )
defaulting to false
or
CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] )
defaulting to true
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes:
To me it seems more natural to find a name for the other behavior, the
one that consists of overwriting the locale-sensitive equality with
the result of the byte-wise comparison. AFAIU the flag is meant
to say: "don't do that".
Some ideas that come to mind:
as an enum
CREATE COLLATION ... ( [EQUALITY = 'natural' | 'bytewise' ] )
as a boolean
CREATE COLLATION ... ( [NATURAL_EQUALITY = true | false ] )
defaulting to false
or
CREATE COLLATION ... ( [BYTEWISE_EQUALITY = true | false ] )
defaulting to true
I don't really find it "natural" for equality to consider obviously
distinct values to be equal. As a counterexample, the fact that IEEE
arithmetic treats 0 and -0 as equal seems to cause about as many problems
as it solves, and newcomers to float math certainly don't find it
"natural". So I object to that particular naming. Perhaps we could
do something around "uniqueness"? That is, collations could have
a boolean property UNIQUE, default true, or something like that.
regards, tom lane
On Wed, Dec 19, 2018 at 6:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't really find it "natural" for equality to consider obviously
distinct values to be equal.
The Unicode consortium calls our current behavior within comparisons
"deterministic comparisons" -- it's something they're not so keen on:
https://unicode.org/reports/tr10/#Deterministic_Comparison
I suggest using their terminology for our current behavior. I can see
where Peter was coming from with "natural", but it doesn't quite work.
One problem with that terminology is that already refers to a sort
that sorts numbers as numbers. How about "linguistic", or "lexical"?
There is a significant cultural dimension to this. I suspect that you
don't find it "natural" that obviously distinct values compare as
equal because you're anglophone. That's the exact behavior you'll get
when using an unadorned en_US collation/locale, I think.
As a counterexample, the fact that IEEE
arithmetic treats 0 and -0 as equal seems to cause about as many problems
as it solves, and newcomers to float math certainly don't find it
"natural". So I object to that particular naming.
FWIW, I don't think that your IEEE analogy quite works, because you're
talking about a property of a datatype. A collation is not intrinsic
to any collatable datatype. Fortunately, we're not required to agree
on what feels natural.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
The Unicode consortium calls our current behavior within comparisons
"deterministic comparisons" -- it's something they're not so keen on:
https://unicode.org/reports/tr10/#Deterministic_Comparison
I suggest using their terminology for our current behavior.
Hm, it's not the greatest terminology perhaps, but if they're using it
then that makes it at least semi-standard. I can live with that.
FWIW, I don't think that your IEEE analogy quite works, because you're
talking about a property of a datatype. A collation is not intrinsic
to any collatable datatype. Fortunately, we're not required to agree
on what feels natural.
Right, which is exactly why it'd be a bad idea to use "natural" as the
name for this property ...
regards, tom lane
Tom Lane wrote:
I don't really find it "natural" for equality to consider obviously
distinct values to be equal.
According to https://www.merriam-webster.com/dictionary/natural
"natural" has no less than 15 meanings. The first in the list is
"based on an inherent sense of right and wrong"
which I admit is not what we want to imply in this context.
The meaning that I was thinking about was close to definitions
4: "following from the nature of the one in question "
or 7: "having a specified character by nature "
or 13: "closely resembling an original : true to nature"
When postgres uses the comparison from a collation
with no modification whatsoever, it's true to that collation.
When it changes the result from equal to non-equal, it's not.
If a collation says that "ABC" = "abc" and postgres says, mmh, OK
thanks but I'll go with "ABC" != "abc", then that denatures the
collation, in the sense of:
"to deprive of natural qualities : change the nature of"
(https://www.merriam-webster.com/dictionary/denature)
Aside from that, I'd be +1 for "linguistic" as the opposite of
"bytewise", I think it tends to be easily understood when expressing
that a strcoll()-like function is used as opposed to a strcmp()-like
function.
I'm -1 for "deterministic" as a replacement for "bytewise". Even
if Unicode has choosen that term for exactly the behavior we're talking
about, it's heavily used in the more general sense of:
"given a particular input, will always produce the same output"
(quoted from https://en.wikipedia.org/wiki/Deterministic_algorithm)
which we very much expect from all our string comparisons no matter the
flags we may put on the collations. "bytewise" might be less academic
but it has less potential for wrong interpretations.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
Here is an updated patch.
I have updated the naming to "deterministic", as discussed.
I have fixed up support for the "name" type, added foreign key support,
psql, pg_dump support, more tests. There are a couple of TODOs in
bpchar support that I need to look into a bit more. But other than that
it's pretty complete.
Perhaps it worth pointing out to new reviewers that the ICU tests can be
run like so:
make check EXTRA_TESTS=collate.icu.utf8
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v2-0001-Collations-with-nondeterministic-comparison.patchtext/plain; charset=UTF-8; name=v2-0001-Collations-with-nondeterministic-comparison.patch; x-mac-creator=0; x-mac-type=0Download+908-110
Peter Eisentraut wrote:
Here is an updated patch.
When using GROUP BY and ORDER BY on a field with a non-deterministic
collation, this pops out:
CREATE COLLATION myfr (locale='fr-u-ks-level1',
provider='icu', deterministic=false);
=# select n from (values ('été' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 ;
n
-----
ete
(1 row)
=# select n from (values ('été' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 desc;
n
-----
été
(1 row)
The single-row output is different whether it's sorted in the ASC or
DESC direction, even though in theory, ORDER BY is done after GROUP
BY, where it shouldn't make that difference.
EXPLAIN shows that the sort is done before grouping, which might
explain why it happens, but isn't that plan incorrect given the context?
postgres=# explain select n from (values ('été' collate "myfr"), ('ete'))
x(n)
group by 1 order by 1 desc;
QUERY PLAN
--------------------------------------------------------------------------
Group (cost=0.04..0.04 rows=2 width=32)
Group Key: "*VALUES*".column1
-> Sort (cost=0.04..0.04 rows=2 width=32)
Sort Key: "*VALUES*".column1 COLLATE myfr DESC
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=32)
(5 rows)
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 04/01/2019 17:05, Daniel Verite wrote:
When using GROUP BY and ORDER BY on a field with a non-deterministic
collation, this pops out:CREATE COLLATION myfr (locale='fr-u-ks-level1',
provider='icu', deterministic=false);=# select n from (values ('ᅵtᅵ' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 ;
n
-----
ete
(1 row)=# select n from (values ('ᅵtᅵ' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 desc;
n
-----
ᅵtᅵ
(1 row)
I don't see anything wrong here. The collation says that both values
are equal, so which one is returned is implementation-dependent.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/28/18 9:55 AM, Peter Eisentraut wrote:
Here is an updated patch.
I have updated the naming to "deterministic", as discussed.
Maybe this is orthogonal and best handled elsewhere but have you when
working with string equality given unicode normalization forms[1] any
thought? I feel there are three sane ways to do unicode string equality:
1) Binary equality
2) Binary equality after normalizing the unicode
3) Collation equality
Would there be any point in adding unicode normalization support into
the collation system or is this best handle for example with a function
run on INSERT or with something else entirely?
Right now PosgreSQL does not have any support for normalization forms as
far as I know.
1. http://unicode.org/reports/tr15/
Andreas
Peter Eisentraut wrote:
=# select n from (values ('été' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 ;
n
-----
ete
(1 row)=# select n from (values ('été' collate "myfr"), ('ete')) x(n)
group by 1 order by 1 desc;
n
-----
été
(1 row)I don't see anything wrong here. The collation says that both values
are equal, so which one is returned is implementation-dependent.
Is it, but it's impractical if the product of seemingly the same GROUP BY
flip-flops between its different valid results. If it can't be avoided, then
okay. If it can be avoided at little cost, then it would be better to do it.
As a different example, the regression tests are somewhat counting on
this already. Consider this part:
+CREATE TABLE test3ci (x text COLLATE case_insensitive);
+INSERT INTO test1ci VALUES ('abc'), ('def'), ('ghi');
+INSERT INTO test2ci VALUES ('ABC'), ('ghi');
+INSERT INTO test3ci VALUES ('abc'), ('ABC'), ('def'), ('ghi');
...
+SELECT x, count(*) FROM test3ci GROUP BY x ORDER BY x;
+ x | count
+-----+-------
+ abc | 2
+ def | 1
+ ghi | 1
+(3 rows)
If ABC was returned here instead of abc for whatever reason,
that would be correct strictly speaking, yet "make check" would fail.
That's impractical.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 09/01/2019 19:49, Andreas Karlsson wrote:
On 12/28/18 9:55 AM, Peter Eisentraut wrote:
Here is an updated patch.
I have updated the naming to "deterministic", as discussed.
Maybe this is orthogonal and best handled elsewhere but have you when
working with string equality given unicode normalization forms[1] any
thought?
Nondeterministic collations do address this by allowing canonically
equivalent code point sequences to compare as equal. You still need a
collation implementation that actually does compare them as equal; ICU
does this, glibc does not AFAICT.
Would there be any point in adding unicode normalization support into
the collation system or is this best handle for example with a function
run on INSERT or with something else entirely?
I think there might be value in a feature that normalizes strings as
they enter the database, as a component of the encoding conversion
infrastructure. But that would be a separate feature.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/01/2019 22:01, Daniel Verite wrote:
I don't see anything wrong here. The collation says that both values
are equal, so which one is returned is implementation-dependent.Is it, but it's impractical if the product of seemingly the same GROUP BY
flip-flops between its different valid results. If it can't be avoided, then
okay. If it can be avoided at little cost, then it would be better to do it.
But there is no concept of which one of these is the preferred variant,
so I don't see how the system is supposed to pick one and then stick to
it across separate query invocations.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote:
Here is an updated patch.
On a table with pre-existing contents, the creation of a unique index
does not seem to detect the duplicates that are equal per the
collation and different binary-wise.
postgres=# \d test3ci
Table "public.test3ci"
Column | Type | Collation | Nullable | Default
--------+------+------------------+----------+---------
x | text | case_insensitive | |
postgres=# select * from test3ci;
x
-----
abc
ABC
def
ghi
(4 rows)
postgres=# create unique index idx on test3ci(x); -- EXPECTED TO FAIL
CREATE INDEX
postgres=# \d test3ci
Table "public.test3ci"
Column | Type | Collation | Nullable | Default
--------+------+------------------+----------+---------
x | text | case_insensitive | |
Indexes:
"idx" UNIQUE, btree (x)
postgres=# select count(*) from test3ci where x='abc';
count
-------
2
(1 row)
OTOH with an already existing unique index, attempts to insert
such duplicates are rejected as expected.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 1/10/19 8:44 AM, Peter Eisentraut wrote:
On 09/01/2019 19:49, Andreas Karlsson wrote:
Maybe this is orthogonal and best handled elsewhere but have you when
working with string equality given unicode normalization forms[1] any
thought?Nondeterministic collations do address this by allowing canonically
equivalent code point sequences to compare as equal. You still need a
collation implementation that actually does compare them as equal; ICU
does this, glibc does not AFAICT.
Ah, right! You could use -ks-identic[1] for this.
Would there be any point in adding unicode normalization support into
the collation system or is this best handle for example with a function
run on INSERT or with something else entirely?I think there might be value in a feature that normalizes strings as
they enter the database, as a component of the encoding conversion
infrastructure. But that would be a separate feature.
Agreed. And if we ever implement this we could theoretically optimize
the equality of -ks-identic to do a strcmp() rather than having to
collate anything.
I think it could also be useful to just add functions which can
normalize strings, which was in a proposal to the SQL standard which was
not accepted.[2]
Notes
1. http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
2. https://dev.mysql.com/worklog/task/?id=2048
Andreas
Andreas Karlsson wrote:
Nondeterministic collations do address this by allowing canonically
equivalent code point sequences to compare as equal. You still need a
collation implementation that actually does compare them as equal; ICU
does this, glibc does not AFAICT.Ah, right! You could use -ks-identic[1] for this.
Strings that differ like that are considered equal even at this level:
postgres=# create collation identic (locale='und-u-ks-identic',
provider='icu', deterministic=false);
CREATE COLLATION
postgres=# select 'é' = E'e\u0301' collate "identic";
?column?
----------
t
(1 row)
There's a separate setting "colNormalization", or "kk" in BCP 47
From
http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting
"The UCA always normalizes input strings into NFD form before the
rest of the algorithm. However, this results in poor performance.
With normalization=off, strings that are in [FCD] and do not contain
Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
correctly. With normalization=on, an implementation that does not
normalize to NFD must at least perform an incremental FCD check and
normalize substrings as necessary"
But even setting this to false does not mean that NFD and NFC forms
of the same text compare as different:
postgres=# create collation identickk (locale='und-u-ks-identic-kk-false',
provider='icu', deterministic=false);
CREATE COLLATION
postgres=# select 'é' = E'e\u0301' collate "identickk";
?column?
----------
t
(1 row)
AFAIU such strings may only compare as different when they're not
in FCD form (http://unicode.org/notes/tn5/#FCD)
There are also ICU-specific explanations about FCD here:
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization
It looks like setting colNormalization to false might provide a
performance benefit when you know your contents are in FCD
form, which is mostly the case according to ICU:
"Note that all NFD strings are in FCD, and in practice most NFC
strings will also be in FCD; for that matter most strings (of whatever
ilk) will be in FCD.
We guarantee that if any input strings are in FCD, that we will get
the right results in collation without having to normalize".
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 14/01/2019 13:23, Daniel Verite wrote:
On a table with pre-existing contents, the creation of a unique index
does not seem to detect the duplicates that are equal per the
collation and different binary-wise.
Fixed in the attached updated patch.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v3-0001-Collations-with-nondeterministic-comparison.patchtext/plain; charset=UTF-8; name=v3-0001-Collations-with-nondeterministic-comparison.patch; x-mac-creator=0; x-mac-type=0Download+928-112
Peter Eisentraut wrote:
On a table with pre-existing contents, the creation of a unique index
does not seem to detect the duplicates that are equal per the
collation and different binary-wise.Fixed in the attached updated patch.
Check. I've found another issue with aggregates over distinct:
the deduplication seems to ignore the collation.
postgres=# select distinct x from test3ci; -- OK
x
-----
def
abc
ghi
(3 rows)
postgres=# select count(distinct x) from test3ci; -- not OK
count
-------
4
(1 row)
postgres=# select array_agg(distinct x) from test3ci; -- not OK
array_agg
-------------------
{ABC,abc,def,ghi}
(1 row)
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 14/01/2019 15:37, Andreas Karlsson wrote:
Nondeterministic collations do address this by allowing canonically
equivalent code point sequences to compare as equal. You still need a
collation implementation that actually does compare them as equal; ICU
does this, glibc does not AFAICT.Ah, right! You could use -ks-identic[1] for this.
That's the default.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16/01/2019 14:20, Daniel Verite wrote:
I've found another issue with aggregates over distinct:
the deduplication seems to ignore the collation.
I have a fix for that. I'll send it with the next update.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services