pgsql: Collations with nondeterministic comparison
Collations with nondeterministic comparison
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: /messages/by-id/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/5e1963fb764e9cc092e0f7b58b28985c311431d9
Modified Files
--------------
contrib/bloom/bloom.h | 1 +
contrib/bloom/blutils.c | 3 +-
doc/src/sgml/catalogs.sgml | 7 +
doc/src/sgml/charset.sgml | 61 +-
doc/src/sgml/citext.sgml | 21 +
doc/src/sgml/func.sgml | 6 +
doc/src/sgml/ref/create_collation.sgml | 22 +
src/backend/access/hash/hashfunc.c | 100 +++-
src/backend/access/spgist/spgtextproc.c | 3 +-
src/backend/catalog/pg_collation.c | 2 +
src/backend/commands/collationcmds.c | 25 +-
src/backend/commands/extension.c | 6 +-
src/backend/executor/execExpr.c | 4 +-
src/backend/executor/execGrouping.c | 14 +-
src/backend/executor/execPartition.c | 1 +
src/backend/executor/execReplication.c | 5 +-
src/backend/executor/nodeAgg.c | 9 +-
src/backend/executor/nodeGroup.c | 1 +
src/backend/executor/nodeHash.c | 14 +-
src/backend/executor/nodeHashjoin.c | 5 +
src/backend/executor/nodeRecursiveunion.c | 1 +
src/backend/executor/nodeSetOp.c | 2 +
src/backend/executor/nodeSubplan.c | 14 +-
src/backend/executor/nodeUnique.c | 1 +
src/backend/executor/nodeWindowAgg.c | 2 +
src/backend/nodes/copyfuncs.c | 7 +
src/backend/nodes/outfuncs.c | 7 +
src/backend/nodes/readfuncs.c | 7 +
src/backend/optimizer/plan/createplan.c | 54 +-
src/backend/optimizer/util/tlist.c | 25 +
src/backend/partitioning/partbounds.c | 4 +-
src/backend/partitioning/partprune.c | 3 +-
src/backend/regex/regc_pg_locale.c | 5 +
src/backend/utils/adt/arrayfuncs.c | 2 +-
src/backend/utils/adt/like.c | 27 +-
src/backend/utils/adt/like_support.c | 14 +
src/backend/utils/adt/name.c | 32 +-
src/backend/utils/adt/orderedsetaggs.c | 3 +-
src/backend/utils/adt/pg_locale.c | 1 +
src/backend/utils/adt/ri_triggers.c | 33 +-
src/backend/utils/adt/varchar.c | 194 ++++++-
src/backend/utils/adt/varlena.c | 333 +++++++----
src/backend/utils/cache/catcache.c | 9 +-
src/backend/utils/cache/lsyscache.c | 16 +
src/bin/initdb/initdb.c | 4 +-
src/bin/pg_dump/pg_dump.c | 39 +-
src/bin/psql/describe.c | 17 +-
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_collation.h | 2 +
src/include/executor/executor.h | 4 +
src/include/executor/hashjoin.h | 1 +
src/include/executor/nodeHash.h | 2 +-
src/include/nodes/execnodes.h | 3 +
src/include/nodes/plannodes.h | 7 +
src/include/optimizer/planmain.h | 2 +-
src/include/optimizer/tlist.h | 1 +
src/include/partitioning/partbounds.h | 1 +
src/include/utils/lsyscache.h | 1 +
src/include/utils/pg_locale.h | 1 +
src/test/regress/expected/collate.icu.utf8.out | 710 +++++++++++++++++++++++
src/test/regress/expected/collate.linux.utf8.out | 5 +
src/test/regress/expected/collate.out | 15 +
src/test/regress/expected/subselect.out | 19 +
src/test/regress/sql/collate.icu.utf8.sql | 250 ++++++++
src/test/regress/sql/collate.linux.utf8.sql | 7 +
src/test/regress/sql/collate.sql | 5 +
src/test/regress/sql/subselect.sql | 17 +
src/test/subscription/Makefile | 2 +
src/test/subscription/t/012_collation.pl | 103 ++++
69 files changed, 2087 insertions(+), 239 deletions(-)
Peter Eisentraut <peter@eisentraut.org> writes:
Collations with nondeterministic comparison
Buildfarm member tern doesn't like this patch. I think the issue is
probably that the 012_collation.pl test script isn't being sufficiently
careful to force UTF8 database encoding.
regards, tom lane
Re: Tom Lane 2019-03-22 <17125.1553282378@sss.pgh.pa.us>
Peter Eisentraut <peter@eisentraut.org> writes:
Collations with nondeterministic comparison
Buildfarm member tern doesn't like this patch. I think the issue is
probably that the 012_collation.pl test script isn't being sufficiently
careful to force UTF8 database encoding.
Debian unstable is also unhappy as of b5fd4972a:
LC_ALL=POSIX
19:19:23 t/010_truncate.pl ..... ok
19:19:25 # Looks like your test exited with 29 before it could output anything.
19:19:25 t/012_collation.pl ....
19:19:25 Dubious, test returned 29 (wstat 7424, 0x1d00)
19:19:25 Failed 2/2 subtests
Christoph
Peter Eisentraut <peter@eisentraut.org> writes:
Collations with nondeterministic comparison
Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.
regards, tom lane
On 2019-03-25 06:07, Tom Lane wrote:
Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.
pushed a fix
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Mar 25, 2019 at 12:17 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 2019-03-25 06:07, Tom Lane wrote:
Buildfarm member snapper doesn't like this. I think the problem is
that the queries it's failing on do not bother to constrain their
output row order. Adding an "EXPLAIN" right there indicates that
the INTERSECTs are being done via hashing, meaning that platform
specific output ordering is to be expected. I don't have a
big-endian platform with ICU installed, but if I did I bet I could
replicate the failure on it.pushed a fix
Buildfarm member snapper is still unhappy about this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2019-03-26%2013%3A01%3A31
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
Buildfarm member snapper is still unhappy about this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2019-03-26%2013%3A01%3A31
Ah, the reason's not so far to seek: Peter only fixed two of the four
queries with platform-dependent results. (Could probably have figured
that out without going to the trouble of installing ICU on a big-endian
machine, but anyway now I've got that.)
Will push a fix in a moment.
regards, tom lane