Nondeterministic collations vs. text_pattern_ops
Whilst poking at the leakproofness-of-texteq issue, I realized
that there's an independent problem caused by the nondeterminism
patch. To wit, that the text_pattern_ops btree opclass uses
texteq as its equality operator, even though that operator is
no longer guaranteed to be bitwise equality. That means that
depending on which collation happens to get attached to the
operator, equality might be inconsistent with the other members
of the opclass, leading to who-knows-what bad results.
bpchar_pattern_ops has the same issue with respect to bpchareq.
The obvious fix for this is to invent separate new equality operators,
but that's actually rather disastrous for performance, because
text_pattern_ops indexes would no longer be able to use WHERE clauses
using plain equality. That also feeds into whether equality clauses
deduced from equivalence classes will work for them (nope, not any
more). People using such indexes are just about certain to be
bitterly unhappy.
We may not have any choice but to do that, though --- I sure don't
see any other easy fix. If we could be certain that the collation
attached to the operator is deterministic, then it would still work
with a pattern_ops index, but that's not a concept that the index
infrastructure has got right now.
Whatever we do about this is likely to require a catversion bump,
meaning we've got to fix it *now*.
regards, tom lane
On 2019-09-17 01:13, Tom Lane wrote:
Whilst poking at the leakproofness-of-texteq issue, I realized
that there's an independent problem caused by the nondeterminism
patch. To wit, that the text_pattern_ops btree opclass uses
texteq as its equality operator, even though that operator is
no longer guaranteed to be bitwise equality. That means that
depending on which collation happens to get attached to the
operator, equality might be inconsistent with the other members
of the opclass, leading to who-knows-what bad results.
You can't create a text_pattern_ops index on a column with
nondeterministic collation:
create collation c1 (provider = icu, locale = 'und', deterministic = false);
create table t1 (a int, b text collate c1);
create index on t1 (b text_pattern_ops);
ERROR: nondeterministic collations are not supported for operator class
"text_pattern_ops"
There is some discussion in internal_text_pattern_compare().
Are there other cases we need to consider?
I notice that there is a hash opclass text_pattern_ops, which I'd
actually never heard of until now, and I don't see documented. What
would we need to do about that?
The obvious fix for this is to invent separate new equality operators,
but that's actually rather disastrous for performance, because
text_pattern_ops indexes would no longer be able to use WHERE clauses
using plain equality. That also feeds into whether equality clauses
deduced from equivalence classes will work for them (nope, not any
more). People using such indexes are just about certain to be
bitterly unhappy.
Would it help if one created COLLATE "C" indexes instead of
text_pattern_ops? What are the tradeoffs between the two approaches?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 2019-09-17 01:13, Tom Lane wrote:
Whilst poking at the leakproofness-of-texteq issue, I realized
that there's an independent problem caused by the nondeterminism
patch. To wit, that the text_pattern_ops btree opclass uses
texteq as its equality operator, even though that operator is
no longer guaranteed to be bitwise equality. That means that
depending on which collation happens to get attached to the
operator, equality might be inconsistent with the other members
of the opclass, leading to who-knows-what bad results.
You can't create a text_pattern_ops index on a column with
nondeterministic collation:
create collation c1 (provider = icu, locale = 'und', deterministic = false);
create table t1 (a int, b text collate c1);
create index on t1 (b text_pattern_ops);
ERROR: nondeterministic collations are not supported for operator class
"text_pattern_ops"
Oh! I'd seen that error message, but not realized that it'd get
thrown during index creation.
There is some discussion in internal_text_pattern_compare().
I don't much like doing it that way: looking up the determinism property
of the collation over again in every single comparison seems pretty
expensive, plus the operator is way exceeding its actual knowledge
of the call context by throwing an error phrased that way.
Are there other cases we need to consider?
I think that disallowing indexes with this combination of opclass and
collation may actually be sufficient. A query can request equality
using any collation, but it won't get matched to an index with a
different collation, so I think we're safe against index-related
problems if we have that restriction.
AFAIR, the only other place in the system where non-default opclasses
can be invoked is ORDER BY. Somebody could write
ORDER BY textcol COLLATE "nondeterm" USING ~<~
However, I think we're actually okay on that one, because although
the equality opclass member is out of sync with the rest, it won't
get consulted during a sort. internal_text_pattern_compare will
throw an error for this, but I don't believe it actually needs to.
My recommendation is to get rid of the run-time checks and instead
put a hack like this into DefineIndex or some minion thereof:
if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID ||
opclass == VARCHAR_PATTERN_BTREE_CLASS_OID ||
opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) &&
!get_collation_isdeterministic(collid))
ereport(ERROR, ...);
Hard-wiring that is ugly; maybe someday we'd wish to expose "doesn't
allow nondeterminism" as a more generally-available opclass property.
But without some other examples that have a need for it, I think it's
not worth the work to create infrastructure for that. It's not like
there are no other hard-wired legacy behaviors in DefineIndex...
I notice that there is a hash opclass text_pattern_ops, which I'd
actually never heard of until now, and I don't see documented. What
would we need to do about that?
Huh. I wonder why that's there --- I can't see a reason why it'd
behave differently from the regular hash text_ops. Maybe we feared
that someday it would need to be different? Anyway, I think we can
just ignore it for this purpose.
Would it help if one created COLLATE "C" indexes instead of
text_pattern_ops? What are the tradeoffs between the two approaches?
Of course, the pattern_ops opclasses long predate our COLLATE support.
I suspect if we'd had COLLATE we never would have invented them.
I believe the pattern_ops are equivalent to a COLLATE "C" index, both
theoretically and in terms of the optimizations we know about making.
There might be some minor performance difference from not having to
look up the collation, but not much if we've done the collate-is-c
fast paths properly.
regards, tom lane
On 2019-09-17 17:17, Tom Lane wrote:
My recommendation is to get rid of the run-time checks and instead
put a hack like this into DefineIndex or some minion thereof:if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID ||
opclass == VARCHAR_PATTERN_BTREE_CLASS_OID ||
opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) &&
!get_collation_isdeterministic(collid))
ereport(ERROR, ...);
Here is a draft patch.
It will require a catversion change because those operator classes don't
have assigned OIDs so far.
The comment block I just moved over for the time being. It should
probably be rephrased a bit.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
v1-0001-Prohibit-creating-text_pattern_ops-indexes-with-n.patchtext/plain; charset=UTF-8; name=v1-0001-Prohibit-creating-text_pattern_ops-indexes-with-n.patch; x-mac-creator=0; x-mac-type=0Download
From 3b1f52e5292303cba8804ae1f6676973ef903ec0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Wed, 18 Sep 2019 14:28:31 +0200
Subject: [PATCH v1] Prohibit creating text_pattern_ops indexes with
nondeterministic collations
Discussion: https://www.postgresql.org/message-id/22566.1568675619@sss.pgh.pa.us
---
src/backend/catalog/index.c | 40 +++++++++++++++++++++++++++
src/backend/utils/adt/varchar.c | 32 +++++-----------------
src/backend/utils/adt/varlena.c | 43 +++++-------------------------
src/include/catalog/pg_opclass.dat | 9 ++++---
4 files changed, 58 insertions(+), 66 deletions(-)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1dac2803b0..f2adee3156 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -762,6 +762,46 @@ index_create(Relation heapRelation,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("user-defined indexes on system catalog tables are not supported")));
+ /*
+ * XXX We cannot use a text_pattern_ops index for nondeterministic
+ * collations, because these operators intentionally ignore the collation.
+ * However, the planner has no way to know that, so it might choose such
+ * an index for an "=" clause, which would lead to wrong results. This
+ * check here doesn't prevent choosing the index, but it will at least
+ * error out if the index is chosen. A text_pattern_ops index on a column
+ * with nondeterministic collation is pretty useless anyway, since LIKE
+ * etc. won't work there either. A future possibility would be to
+ * annotate the operator class or its members in the catalog to avoid the
+ * index. Another alternative is to stay away from the *_pattern_ops
+ * operator classes and prefer creating LIKE-supporting indexes with
+ * COLLATE "C".
+ */
+ for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+ {
+ Oid collation = collationObjectId[i];
+ Oid opclass = classObjectId[i];
+
+ if (collation)
+ {
+ if (!get_collation_isdeterministic(collation) &&
+ (opclass == TEXT_BTREE_PATTERN_OPS_OID ||
+ opclass == VARCHAR_BTREE_PATTERN_OPS_OID ||
+ opclass == BPCHAR_BTREE_PATTERN_OPS_OID))
+ {
+ HeapTuple classtup;
+
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclass));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclass);
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("nondeterministic collations are not supported for operator class \"%s\"",
+ NameStr(((Form_pg_opclass) GETSTRUCT(classtup))->opcname))));
+ ReleaseSysCache(classtup);
+ }
+ }
+ }
+
/*
* Concurrent index build on a system catalog is unsafe because we tend to
* release locks before committing in catalogs.
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 332dc860c4..b21137c583 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1105,23 +1105,12 @@ hashbpcharextended(PG_FUNCTION_ARGS)
*/
static int
-internal_bpchar_pattern_compare(BpChar *arg1, BpChar *arg2, Oid collid)
+internal_bpchar_pattern_compare(BpChar *arg1, BpChar *arg2)
{
int result;
int len1,
len2;
- check_collation_set(collid);
-
- /*
- * see internal_text_pattern_compare()
- */
- if (!get_collation_isdeterministic(collid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for operator class \"%s\"",
- "bpchar_pattern_ops")));
-
len1 = bcTruelen(arg1);
len2 = bcTruelen(arg2);
@@ -1144,7 +1133,7 @@ bpchar_pattern_lt(PG_FUNCTION_ARGS)
BpChar *arg2 = PG_GETARG_BPCHAR_PP(1);
int result;
- result = internal_bpchar_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_bpchar_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -1160,7 +1149,7 @@ bpchar_pattern_le(PG_FUNCTION_ARGS)
BpChar *arg2 = PG_GETARG_BPCHAR_PP(1);
int result;
- result = internal_bpchar_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_bpchar_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -1176,7 +1165,7 @@ bpchar_pattern_ge(PG_FUNCTION_ARGS)
BpChar *arg2 = PG_GETARG_BPCHAR_PP(1);
int result;
- result = internal_bpchar_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_bpchar_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -1192,7 +1181,7 @@ bpchar_pattern_gt(PG_FUNCTION_ARGS)
BpChar *arg2 = PG_GETARG_BPCHAR_PP(1);
int result;
- result = internal_bpchar_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_bpchar_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -1208,7 +1197,7 @@ btbpchar_pattern_cmp(PG_FUNCTION_ARGS)
BpChar *arg2 = PG_GETARG_BPCHAR_PP(1);
int result;
- result = internal_bpchar_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_bpchar_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -1221,17 +1210,8 @@ Datum
btbpchar_pattern_sortsupport(PG_FUNCTION_ARGS)
{
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
- Oid collid = ssup->ssup_collation;
MemoryContext oldcontext;
- check_collation_set(collid);
-
- if (!get_collation_isdeterministic(collid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for operator class \"%s\"",
- "bpchar_pattern_ops")));
-
oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
/* Use generic string SortSupport, forcing "C" collation */
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 0864838867..b44f895d95 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2996,34 +2996,12 @@ textgename(PG_FUNCTION_ARGS)
*/
static int
-internal_text_pattern_compare(text *arg1, text *arg2, Oid collid)
+internal_text_pattern_compare(text *arg1, text *arg2)
{
int result;
int len1,
len2;
- check_collation_set(collid);
-
- /*
- * XXX We cannot use a text_pattern_ops index for nondeterministic
- * collations, because these operators intentionally ignore the collation.
- * However, the planner has no way to know that, so it might choose such
- * an index for an "=" clause, which would lead to wrong results. This
- * check here doesn't prevent choosing the index, but it will at least
- * error out if the index is chosen. A text_pattern_ops index on a column
- * with nondeterministic collation is pretty useless anyway, since LIKE
- * etc. won't work there either. A future possibility would be to
- * annotate the operator class or its members in the catalog to avoid the
- * index. Another alternative is to stay away from the *_pattern_ops
- * operator classes and prefer creating LIKE-supporting indexes with
- * COLLATE "C".
- */
- if (!get_collation_isdeterministic(collid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for operator class \"%s\"",
- "text_pattern_ops")));
-
len1 = VARSIZE_ANY_EXHDR(arg1);
len2 = VARSIZE_ANY_EXHDR(arg2);
@@ -3046,7 +3024,7 @@ text_pattern_lt(PG_FUNCTION_ARGS)
text *arg2 = PG_GETARG_TEXT_PP(1);
int result;
- result = internal_text_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_text_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -3062,7 +3040,7 @@ text_pattern_le(PG_FUNCTION_ARGS)
text *arg2 = PG_GETARG_TEXT_PP(1);
int result;
- result = internal_text_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_text_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -3078,7 +3056,7 @@ text_pattern_ge(PG_FUNCTION_ARGS)
text *arg2 = PG_GETARG_TEXT_PP(1);
int result;
- result = internal_text_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_text_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -3094,7 +3072,7 @@ text_pattern_gt(PG_FUNCTION_ARGS)
text *arg2 = PG_GETARG_TEXT_PP(1);
int result;
- result = internal_text_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_text_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -3110,7 +3088,7 @@ bttext_pattern_cmp(PG_FUNCTION_ARGS)
text *arg2 = PG_GETARG_TEXT_PP(1);
int result;
- result = internal_text_pattern_compare(arg1, arg2, PG_GET_COLLATION());
+ result = internal_text_pattern_compare(arg1, arg2);
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1);
@@ -3123,17 +3101,8 @@ Datum
bttext_pattern_sortsupport(PG_FUNCTION_ARGS)
{
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
- Oid collid = ssup->ssup_collation;
MemoryContext oldcontext;
- check_collation_set(collid);
-
- if (!get_collation_isdeterministic(collid))
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("nondeterministic collations are not supported for operator class \"%s\"",
- "text_pattern_ops")));
-
oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
/* Use generic string SortSupport, forcing "C" collation */
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index fdfea85efe..9bab83e91e 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -146,13 +146,16 @@
opcfamily => 'btree/datetime_ops', opcintype => 'timestamp' },
{ opcmethod => 'hash', opcname => 'timestamp_ops',
opcfamily => 'hash/timestamp_ops', opcintype => 'timestamp' },
-{ opcmethod => 'btree', opcname => 'text_pattern_ops',
+{ oid => '4187', oid_symbol => 'TEXT_BTREE_PATTERN_OPS_OID',
+ opcmethod => 'btree', opcname => 'text_pattern_ops',
opcfamily => 'btree/text_pattern_ops', opcintype => 'text',
opcdefault => 'f' },
-{ opcmethod => 'btree', opcname => 'varchar_pattern_ops',
+{ oid => '4188', oid_symbol => 'VARCHAR_BTREE_PATTERN_OPS_OID',
+ opcmethod => 'btree', opcname => 'varchar_pattern_ops',
opcfamily => 'btree/text_pattern_ops', opcintype => 'text',
opcdefault => 'f' },
-{ opcmethod => 'btree', opcname => 'bpchar_pattern_ops',
+{ oid => '4189', oid_symbol => 'BPCHAR_BTREE_PATTERN_OPS_OID',
+ opcmethod => 'btree', opcname => 'bpchar_pattern_ops',
opcfamily => 'btree/bpchar_pattern_ops', opcintype => 'bpchar',
opcdefault => 'f' },
{ opcmethod => 'btree', opcname => 'money_ops', opcfamily => 'btree/money_ops',
--
2.23.0
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
Here is a draft patch.
It will require a catversion change because those operator classes don't
have assigned OIDs so far.
That's slightly annoying given where we are with v12. We could
avoid it by looking up the opclass's opfamily and seeing if it's
TEXT_BTREE_FAM_OID etc, which do already have hand-assigned OIDs.
But maybe avoiding a catversion bump now is not worth the cost of
an extra syscache lookup. (It'd give me an excuse to shove the
leakproofness-marking changes from the other thread into v12, so
there's that.)
Speaking of extra syscache lookups, I don't like that you rearranged
the if-test to check nondeterminism before the opclass identity checks.
That's usually going to be a wasted lookup.
The comment block I just moved over for the time being. It should
probably be rephrased a bit.
Indeed. Maybe like
* text_pattern_ops uses text_eq as the equality operator, which is
* fine as long as the collation is deterministic; text_eq then
* reduces to bitwise equality and so it is semantically compatible
* with the other operators and functions in the opclass. But with a
* nondeterministic collation, text_eq could yield results that are
* incompatible with the actual behavior of the index (which is
* determined by the opclass's comparison function). We prevent
* such problems by refusing creation of an index with this opclass
* and a nondeterministic collation.
*
* The same applies to varchar_pattern_ops and bpchar_pattern_ops.
* If we find more cases, we might decide to create a real mechanism
* for marking opclasses as incompatible with nondeterminism; but
* for now, this small hack suffices.
*
* Another solution is to use a special operator, not text_eq, as the
* equality opclass member, but that is undesirable because it would
* prevent index usage in many queries that work fine today.
regards, tom lane
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
Here is a draft patch.
Where are we on pushing that? I'm starting to get antsy about the
amount of time remaining before rc1. It's a low-risk fix, but still,
it'd be best to have a complete buildfarm cycle on it before Monday's
wrap.
regards, tom lane
I wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
Here is a draft patch.
Where are we on pushing that? I'm starting to get antsy about the
amount of time remaining before rc1. It's a low-risk fix, but still,
it'd be best to have a complete buildfarm cycle on it before Monday's
wrap.
Since time is now really running short, I went ahead and pushed
this, after doing a closer review and finding one nitpicky bug.
regards, tom lane
On 2019-09-21 22:30, Tom Lane wrote:
I wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
Here is a draft patch.
Where are we on pushing that? I'm starting to get antsy about the
amount of time remaining before rc1. It's a low-risk fix, but still,
it'd be best to have a complete buildfarm cycle on it before Monday's
wrap.Since time is now really running short, I went ahead and pushed
this, after doing a closer review and finding one nitpicky bug.
Great! I think that covers all the open issues then.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services