pgsql: Collations with nondeterministic comparison

Started by Peter Eisentrautover 7 years ago7 messagescomitters
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

Collations with nondeterministic comparison

This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.

- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.

- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: /messages/by-id/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/5e1963fb764e9cc092e0f7b58b28985c311431d9

Modified Files
--------------
contrib/bloom/bloom.h | 1 +
contrib/bloom/blutils.c | 3 +-
doc/src/sgml/catalogs.sgml | 7 +
doc/src/sgml/charset.sgml | 61 +-
doc/src/sgml/citext.sgml | 21 +
doc/src/sgml/func.sgml | 6 +
doc/src/sgml/ref/create_collation.sgml | 22 +
src/backend/access/hash/hashfunc.c | 100 +++-
src/backend/access/spgist/spgtextproc.c | 3 +-
src/backend/catalog/pg_collation.c | 2 +
src/backend/commands/collationcmds.c | 25 +-
src/backend/commands/extension.c | 6 +-
src/backend/executor/execExpr.c | 4 +-
src/backend/executor/execGrouping.c | 14 +-
src/backend/executor/execPartition.c | 1 +
src/backend/executor/execReplication.c | 5 +-
src/backend/executor/nodeAgg.c | 9 +-
src/backend/executor/nodeGroup.c | 1 +
src/backend/executor/nodeHash.c | 14 +-
src/backend/executor/nodeHashjoin.c | 5 +
src/backend/executor/nodeRecursiveunion.c | 1 +
src/backend/executor/nodeSetOp.c | 2 +
src/backend/executor/nodeSubplan.c | 14 +-
src/backend/executor/nodeUnique.c | 1 +
src/backend/executor/nodeWindowAgg.c | 2 +
src/backend/nodes/copyfuncs.c | 7 +
src/backend/nodes/outfuncs.c | 7 +
src/backend/nodes/readfuncs.c | 7 +
src/backend/optimizer/plan/createplan.c | 54 +-
src/backend/optimizer/util/tlist.c | 25 +
src/backend/partitioning/partbounds.c | 4 +-
src/backend/partitioning/partprune.c | 3 +-
src/backend/regex/regc_pg_locale.c | 5 +
src/backend/utils/adt/arrayfuncs.c | 2 +-
src/backend/utils/adt/like.c | 27 +-
src/backend/utils/adt/like_support.c | 14 +
src/backend/utils/adt/name.c | 32 +-
src/backend/utils/adt/orderedsetaggs.c | 3 +-
src/backend/utils/adt/pg_locale.c | 1 +
src/backend/utils/adt/ri_triggers.c | 33 +-
src/backend/utils/adt/varchar.c | 194 ++++++-
src/backend/utils/adt/varlena.c | 333 +++++++----
src/backend/utils/cache/catcache.c | 9 +-
src/backend/utils/cache/lsyscache.c | 16 +
src/bin/initdb/initdb.c | 4 +-
src/bin/pg_dump/pg_dump.c | 39 +-
src/bin/psql/describe.c | 17 +-
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_collation.h | 2 +
src/include/executor/executor.h | 4 +
src/include/executor/hashjoin.h | 1 +
src/include/executor/nodeHash.h | 2 +-
src/include/nodes/execnodes.h | 3 +
src/include/nodes/plannodes.h | 7 +
src/include/optimizer/planmain.h | 2 +-
src/include/optimizer/tlist.h | 1 +
src/include/partitioning/partbounds.h | 1 +
src/include/utils/lsyscache.h | 1 +
src/include/utils/pg_locale.h | 1 +
src/test/regress/expected/collate.icu.utf8.out | 710 +++++++++++++++++++++++
src/test/regress/expected/collate.linux.utf8.out | 5 +
src/test/regress/expected/collate.out | 15 +
src/test/regress/expected/subselect.out | 19 +
src/test/regress/sql/collate.icu.utf8.sql | 250 ++++++++
src/test/regress/sql/collate.linux.utf8.sql | 7 +
src/test/regress/sql/collate.sql | 5 +
src/test/regress/sql/subselect.sql | 17 +
src/test/subscription/Makefile | 2 +
src/test/subscription/t/012_collation.pl | 103 ++++
69 files changed, 2087 insertions(+), 239 deletions(-)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: pgsql: Collations with nondeterministic comparison

Peter Eisentraut <peter@eisentraut.org> writes:

Collations with nondeterministic comparison

Buildfarm member tern doesn't like this patch. I think the issue is
probably that the 012_collation.pl test script isn't being sufficiently
careful to force UTF8 database encoding.

regards, tom lane

#3Christoph Berg
myon@debian.org
In reply to: Tom Lane (#2)
Re: pgsql: Collations with nondeterministic comparison

Re: Tom Lane 2019-03-22 <17125.1553282378@sss.pgh.pa.us>

Peter Eisentraut <peter@eisentraut.org> writes:

Collations with nondeterministic comparison

Buildfarm member tern doesn't like this patch. I think the issue is
probably that the 012_collation.pl test script isn't being sufficiently
careful to force UTF8 database encoding.

Debian unstable is also unhappy as of b5fd4972a:

LC_ALL=POSIX

19:19:23 t/010_truncate.pl ..... ok
19:19:25 # Looks like your test exited with 29 before it could output anything.
19:19:25 t/012_collation.pl ....
19:19:25 Dubious, test returned 29 (wstat 7424, 0x1d00)
19:19:25 Failed 2/2 subtests

https://pgdgbuild.dus.dg-i.net/job/postgresql-12-binaries/architecture=amd64,distribution=sid/639/console

Christoph

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: pgsql: Collations with nondeterministic comparison

Peter Eisentraut <peter@eisentraut.org> writes:

Collations with nondeterministic comparison

Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.

regards, tom lane

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#4)
Re: pgsql: Collations with nondeterministic comparison

On 2019-03-25 06:07, Tom Lane wrote:

Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.

pushed a fix

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Peter Eisentraut (#5)
Re: pgsql: Collations with nondeterministic comparison

On Mon, Mar 25, 2019 at 12:17 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2019-03-25 06:07, Tom Lane wrote:

Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.

pushed a fix

Buildfarm member snapper is still unhappy about this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&amp;dt=2019-03-26%2013%3A01%3A31

--
Peter Geoghegan

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#6)
Re: pgsql: Collations with nondeterministic comparison

Peter Geoghegan <pg@bowt.ie> writes:

Buildfarm member snapper is still unhappy about this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&amp;dt=2019-03-26%2013%3A01%3A31

Ah, the reason's not so far to seek: Peter only fixed two of the four
queries with platform-dependent results. (Could probably have figured
that out without going to the trouble of installing ICU on a big-endian
machine, but anyway now I've got that.)

Will push a fix in a moment.

regards, tom lane