Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
Hi all,
Wanted to call out what seems like a possible bug in non-deterministic
collation handling with pattern matching operators. Per the
documentation, non-deterministic collations are not supported with
pattern matching operators. Section 9.7 of the PG12 manual recommends
"The pattern matching operators of all three kinds do not support
nondeterministic collations. If required, apply a different collation
to the expression to work around this limitation." However, I'm
finding that pattern matching operations fail when a column is
declared with a non-deterministic collation, *even if* a different,
deterministic collation is explicitly applied to the pattern matching
operation. This doesn't seem to be the expected behavior.
Example. This is tested on Postgres 12.3, on Centos 8.1.1911 with libicu 60.3.
Create a non-deterministic collation.
create collation mycollation (provider = icu, locale =
'en-US-ks-level2.utf8', deterministic = false);
Create a couple of sample tables:
create table ctest (id numeric, t text);
create table ctestnd (id numeric, t text collate mycollation);
Populate them with some data:
insert into ctest values (1,'aAa');
insert into ctest select generate_series(2,100000),'bbb';
insert into ctestnd select id, t from ctest;
analyze ctest, ctestnd;
Add a few indexes:
create index ctest_idx01 on ctest (t);
create index ctest_idx02 on ctest (t collate "C");
create index ctestnd_idx01 on ctestnd (t);
create index ctestnd_idx02 on ctestnd (t collate "C");
Test on ctest:
explain select * from ctest where t = 'aAa' collate "C";
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using ctest_idx02 on ctest (cost=0.42..4.44 rows=1 width=10)
Index Cond: (t = 'aAa'::text COLLATE "C")
COMMENT: Works as expected.
explain select * from ctest where t like 'a%';
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using ctest_idx02 on ctest (cost=0.42..8.44 rows=1 width=10)
Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
Filter: (t ~~ 'a%'::text)
COMMENT: Actually this is very interesting, because even without an
explicit COLLATE clause, LIKE still uses the "C" collation index. Not
sure if that's intended behavior either?
explain select * from ctest where t like 'a%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using ctest_idx02 on ctest (cost=0.42..8.44 rows=1 width=10)
Index Cond: ((t >= 'a'::text) AND (t < 'b'::text))
Filter: (t ~~ 'a%'::text COLLATE "C")
COMMENT: Uses explicit collation and index as expected.
Test on ctestnd:
explain select * from ctestnd where t = 'aAa' collate "C";
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1 width=10)
Index Cond: (t = 'aAa'::text COLLATE "C")
COMMENT: Works as expected.
explain select * from ctestnd where t like 'a%';
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: Fails as expected.
explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: Not expected. It seems like the explicit COLLATE clause is
ignored in this case. I've tried different placements for the COLLATE
clause, and none seem to work.
Is this a bug, or have I missed something?
Thanks,
James Lucas
On Wed, May 27, 2020 at 8:23 AM James Lucas <jlucasdba@gmail.com> wrote:
create table ctestnd (id numeric, t text collate mycollation);
create index ctestnd_idx02 on ctestnd (t collate "C");
Test on ctestnd:
explain select * from ctestnd where t = 'aAa' collate "C";
QUERY PLAN------------------------------------------------------------------------------
Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1
width=10)
Index Cond: (t = 'aAa'::text COLLATE "C")
COMMENT: Works as expected.
Uses an index scan which is where the deterministic collation exists
explain select * from ctestnd where t like 'a%';
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: Fails as expected.explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKE
Your schema is inherently unstable in this respect because the planner has
to be allowed to choose a sequential scan and as soon as it does it
attempts to perform like comparisons with table data that is stored using a
non-deterministic collation.
I don't know what kinds of promises we make about implicit collation
manipulation here but absent such a transformation the sequential scan plan
with LIKE generates an invalid plan choice. That it doesn't go find the
index that happens to have a workable collation for the query is
unsurprising - whether that is even a possibility is beyond me.
David J.
Hi David,
Thanks for the response. One possibly relevant thing I forgot to
mention. The collation for the database is "en_US.UTF-8", which is
thus also the collation for the t column of ctest.
Per the documentation, it seems putting an implicit collation on the
operation should work. Although the documentation is admittedly a
little vague in this respect. I also found a mail thread in the list
where Peter Eisentraut recommended syntax exactly like this (collate
"C") to work around the inability to use pattern matching on
non-deterministic collation columns. Unfortunately that thread
trailed out without a response if it actually worked.
Noticed something else a bit interesting. Perhaps removing indexes
from the equation would also help:
drop index ctestnd_idx01, ctestnd_idx02, ctest_idx01, ctest_idx02;
explain select * from ctest where t like 'a%' collate "C";
QUERY PLAN
---------------------------------------------------------
Seq Scan on ctest (cost=0.00..1791.00 rows=1 width=10)
Filter: (t ~~ 'a%'::text COLLATE "C")
COMMENT: Okay
explain select * from ctest where t like 'a%' collate mycollation;
QUERY PLAN
---------------------------------------------------------
Seq Scan on ctest (cost=0.00..1791.00 rows=1 width=10)
Filter: (t ~~ 'a%'::text COLLATE mycollation)
COMMENT: Wait, that doesn't seem right.
select * from ctest where t like 'a%' collate mycollation;
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: So in this case, specifying an explicit non-deterministic
collation with EXPLAIN, we get a plan. But when we actually go to
execute, it fails.
explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: But in the inverse case, running explain on a column with a
non-deterministic collation, but an explicit deterministic collation,
we don't even get a plan with EXPLAIN. That seems inconsistent. Only
conclusion I can reach is that it's failing a check at an earlier
point in the process than in the other case.
Thanks,
James
On Wed, May 27, 2020 at 10:53 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
Show quoted text
On Wed, May 27, 2020 at 8:23 AM James Lucas <jlucasdba@gmail.com> wrote:
create table ctestnd (id numeric, t text collate mycollation);
create index ctestnd_idx02 on ctestnd (t collate "C");
Test on ctestnd:
explain select * from ctestnd where t = 'aAa' collate "C";
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1 width=10)
Index Cond: (t = 'aAa'::text COLLATE "C")
COMMENT: Works as expected.Uses an index scan which is where the deterministic collation exists
explain select * from ctestnd where t like 'a%';
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: Fails as expected.explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKEYour schema is inherently unstable in this respect because the planner has to be allowed to choose a sequential scan and as soon as it does it attempts to perform like comparisons with table data that is stored using a non-deterministic collation.
I don't know what kinds of promises we make about implicit collation manipulation here but absent such a transformation the sequential scan plan with LIKE generates an invalid plan choice. That it doesn't go find the index that happens to have a workable collation for the query is unsurprising - whether that is even a possibility is beyond me.
David J.
James Lucas <jlucasdba@gmail.com> writes:
explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKE
Yeah. I traced through this, and the place where it's failing is where
the planner tries to apply the LIKE operator to the stored MCV values
(to see how many of them pass the condition, which gives us a big clue
about the selectivity). Unfortunately, per the comments in selfuncs.c,
* For both oprrest and oprjoin functions, the operator's input collation OID
* (if any) is passed using the standard fmgr mechanism, so that the estimator
* function can fetch it with PG_GET_COLLATION(). Note, however, that all
* statistics in pg_statistic are currently built using the relevant column's
* collation. Thus, in most cases where we are looking at statistics, we
* should ignore the operator collation and use the stats entry's collation.
* We expect that the error induced by doing this is usually not large enough
* to justify complicating matters. In any case, doing otherwise would yield
* entirely garbage results for ordered stats data such as histograms.
mcv_selectivity is following this advice and applying LIKE with the
ctestnd.t column's declared collation ... and then the operator throws
an error.
The idea that using the "wrong" collation might actually cause an error
was not factored into this design, obviously. I'm not sure offhand what
to do about it. If we go over to using the query's collation then we
avoid that issue, but instead we have the problem noted in this comment
about the histogram sort order not matching what the operator expects.
(In the case of mcv_selectivity the sort order isn't really an issue,
but it is an issue for sibling functions such as
ineq_histogram_selectivity.)
This issue only dates back to commit 5e0928005; before that, we just
blindly passed DEFAULT_COLLATION_OID to operators being evaluated for
estimation purposes. (I suppose if you made the database's default
collation nondeterministic, you could still get into trouble; but that
case may not be reachable right now.) On the other hand, the actual
breakage is even newer, because nondeterministic collations weren't
added until 5e1963fb7, several months later. Both of those are v12
cycle, so it's academic from a user's standpoint which one we blame;
but the upshot is that this case doesn't work.
Ideally, no operator would ever throw an error about unsupported
collations, but I suppose that day is far away.
I guess the path of least resistance is to change the selectivity
functions to use the query's collation; then, if you get an error
here you would have done so at runtime anyway. The problem of
inconsistency with the histogram collation will be real for
ineq_histogram_selectivity; but we had a variant of that before,
in that always using DEFAULT_COLLATION_OID would give answers
that were wrong for a query using a different collation.
Peter, any other thoughts?
regards, tom lane
Thanks Tom,
This is too much into the guts of the planner for me to contribute
much. This does raise an interesting point that I had not considered
though - it sounds like column statistics depend on the default
collation on the column. That could impact the plans chosen for
future queries, even if those queries are performed using a different
collation. For most deterministic collations I expect that probably
doesn't make much of a difference, but for non-deterministic
collations it seems like the difference in stats could be significant.
N_distinct, in particular, seems like it might be very different for a
non-deterministic collation.
I tried setting up a pathological test case for this, and it seems
like at least currently, even with a non-deterministic collation
statistics still count values as distinct, even if the default
collation would consider them equivalent. Not sure if that's as
intended or not?
create table stest (id numeric, t text);
create table stestnd (id numeric, t text collate mycollation);
insert into stest select generate_series(1,50000),'aaa';
insert into stest select generate_series(50001,100000),'aAa';
insert into stest select generate_series(100001,150000),'bbb';
insert into stest select generate_series(150001,200000),'bBb';
insert into stest select generate_series(200001,250000),'ccc';
insert into stest select generate_series(250001,300000),'cCc';
insert into stestnd select * from stest;
analyze stest, stestnd;
select schemaname, tablename, attname, n_distinct, most_common_vals
from pg_stats where attname='t' and tablename like 'stest%' order by
tablename;
schemaname | tablename | attname | n_distinct | most_common_vals
------------+-----------+---------+------------+---------------------------
public | stest | t | 6 | {aAa,cCc,bbb,aaa,ccc,bBb}
public | stestnd | t | 6 | {bBb,ccc,bbb,aAa,cCc,aaa}
Actually it turns out the DISTINCT clause doesn't either:
select count(*) from (select distinct t from stest) s;
count
-------
6
select count(*) from (select distinct t from stestnd) s;
count
-------
6
Sorry - don't want to derail the question at hand too much. It seems
like it might be relevant if the discussion is around stats and
collation handling.
Thanks,
James
Show quoted text
On Wed, May 27, 2020 at 7:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Lucas <jlucasdba@gmail.com> writes:
explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKEYeah. I traced through this, and the place where it's failing is where
the planner tries to apply the LIKE operator to the stored MCV values
(to see how many of them pass the condition, which gives us a big clue
about the selectivity). Unfortunately, per the comments in selfuncs.c,* For both oprrest and oprjoin functions, the operator's input collation OID
* (if any) is passed using the standard fmgr mechanism, so that the estimator
* function can fetch it with PG_GET_COLLATION(). Note, however, that all
* statistics in pg_statistic are currently built using the relevant column's
* collation. Thus, in most cases where we are looking at statistics, we
* should ignore the operator collation and use the stats entry's collation.
* We expect that the error induced by doing this is usually not large enough
* to justify complicating matters. In any case, doing otherwise would yield
* entirely garbage results for ordered stats data such as histograms.mcv_selectivity is following this advice and applying LIKE with the
ctestnd.t column's declared collation ... and then the operator throws
an error.The idea that using the "wrong" collation might actually cause an error
was not factored into this design, obviously. I'm not sure offhand what
to do about it. If we go over to using the query's collation then we
avoid that issue, but instead we have the problem noted in this comment
about the histogram sort order not matching what the operator expects.
(In the case of mcv_selectivity the sort order isn't really an issue,
but it is an issue for sibling functions such as
ineq_histogram_selectivity.)This issue only dates back to commit 5e0928005; before that, we just
blindly passed DEFAULT_COLLATION_OID to operators being evaluated for
estimation purposes. (I suppose if you made the database's default
collation nondeterministic, you could still get into trouble; but that
case may not be reachable right now.) On the other hand, the actual
breakage is even newer, because nondeterministic collations weren't
added until 5e1963fb7, several months later. Both of those are v12
cycle, so it's academic from a user's standpoint which one we blame;
but the upshot is that this case doesn't work.Ideally, no operator would ever throw an error about unsupported
collations, but I suppose that day is far away.I guess the path of least resistance is to change the selectivity
functions to use the query's collation; then, if you get an error
here you would have done so at runtime anyway. The problem of
inconsistency with the histogram collation will be real for
ineq_histogram_selectivity; but we had a variant of that before,
in that always using DEFAULT_COLLATION_OID would give answers
that were wrong for a query using a different collation.Peter, any other thoughts?
regards, tom lane
James Lucas <jlucasdba@gmail.com> writes:
I tried setting up a pathological test case for this, and it seems
like at least currently, even with a non-deterministic collation
statistics still count values as distinct, even if the default
collation would consider them equivalent. Not sure if that's as
intended or not?
I experimented with this, and what I'm seeing is that ucol_strcollUTF8()
reports that 'aaa' is different from 'aAa'. So the behavior on the
Postgres side is as-expected. I suspect that the 'en-US-ks-level2'
ICU locale doesn't act as you think it does. (That is, just saying
that a collation is nondeterministic doesn't make it so; it only forces
Postgres through slower code paths that allow for the possibility of
bitwise-unequal strings being reported as equal by ICU.) Not knowing
anything about ICU, I can't say more than that.
[ Tested on libicu-60.3-2.el8_1 ]
regards, tom lane
Tom Lane wrote:
I suspect that the 'en-US-ks-level2' ICU locale doesn't act as you
think it does.
Indeed, because the syntax is tricky. The OP wants 'en-US-u-ks-level2'.
With 'en-US-ks-level2', the ks-level2 component is ignored and you
get a tertiary colstrength.
Or use 'en-US@colStrength=secondary' which is possibly more
readable and works with older versions of ICU.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
Apologies if anyone gets this twice. I got a rejected mail notice
back the first time I sent.
You are correct. I was playing around with collation naming sometime
back and when I started looking at this, I just used one I had left in
the database assuming it was correct. That's my bad.
I dropped the tables and redefined the collation as
create collation mycollation (provider = icu, locale =
'en-US-u-ks-level2', deterministic = false);
Now the results are more what I expected.
select schemaname, tablename, attname, n_distinct, most_common_vals
from pg_stats where attname='t' and tablename like 'stest%' order by
tablename;
schemaname | tablename | attname | n_distinct | most_common_vals
------------+-----------+---------+------------+---------------------------
public | stest | t | 6 | {aaa,cCc,bBb,bbb,ccc,aAa}
public | stestnd | t | 3 | {ccc,bbb,aaa}
So that is something to be aware of - the collation defined on the
column can impact stats values, which could in turn impact plans
chosen for queries that use alternative collations.
Sorry for the distraction. That still leaves us with the original
issue regarding LIKE and COLLATE.
Thanks,
James
Show quoted text
On Thu, May 28, 2020 at 1:48 PM Daniel Verite <daniel@manitou-mail.org> wrote:
Tom Lane wrote:
I suspect that the 'en-US-ks-level2' ICU locale doesn't act as you
think it does.Indeed, because the syntax is tricky. The OP wants 'en-US-u-ks-level2'.
With 'en-US-ks-level2', the ks-level2 component is ignored and you
get a tertiary colstrength.Or use 'en-US@colStrength=secondary' which is possibly more
readable and works with older versions of ICU.Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
James Lucas <jlucasdba@gmail.com> writes:
So that is something to be aware of - the collation defined on the
column can impact stats values, which could in turn impact plans
chosen for queries that use alternative collations.
Yeah. At some point we might try to collect stats with respect to
multiple collations, but that's a long way off probably.
(I have suggested that CREATE STATISTICS could be extended to
control this type of thing, but I don't think anyone's worked
on making it happen.)
regards, tom lane
I wrote:
I guess the path of least resistance is to change the selectivity
functions to use the query's collation; then, if you get an error
here you would have done so at runtime anyway. The problem of
inconsistency with the histogram collation will be real for
ineq_histogram_selectivity; but we had a variant of that before,
in that always using DEFAULT_COLLATION_OID would give answers
that were wrong for a query using a different collation.
I worked on this for awhile and came up with the attached patchset.
0001 does about the minimum required to avoid this failure, by
passing the query's collation not stacoll to operators and selectivity
functions invoked during selectivity estimation. Unfortunately, it
doesn't seem like we could sanely back-patch this, because it requires
adding parameters to several globally-visible functions. The odds
that some external code is calling those functions seem too high to
risk an ABI break. So, while I'd like to squeeze this into v13,
we still need to think about what to do for v12.
0002 addresses the mentioned problem with ineq_histogram_selectivity
by having that function actually verify that the query operator and
collation match what the pg_statistic histogram was generated with.
If they don't match, all is not lost. What we can do is just
sequentially apply the query's operator and comparison constant to
each histogram entry, and take the fraction of matches as our
selectivity estimate. This is more or less the same insight we have
used in generic_restriction_selectivity: the histogram is a pretty
decent sample of the column, even if its ordering is not quite what
you want.
0002 also deletes a hack I had put in get_attstatsslot() to insert a
dummy value into sslot->stacoll. That hack isn't necessary any longer
(because indeed we aren't using sslot->stacoll's value anywhere as of
0001), and it breaks the verification check that 0002 wants to add to
ineq_histogram_selectivity, which depends on stacoll being truthful.
I also adjusted get_variable_range() to deal with collations more
honestly.
When I went to test 0002, I found out that it broke some test cases
in privileges.sql, and the reason was rather interesting. What those
cases are relying on is getting a highly accurate selectivity
estimate for a user-defined operator, for which the only thing the
planner knows for sure is that it uses scalarltsel as the restriction
estimator. Despite this lack of knowledge, the existing code just
blithely uses the histogram as though it is *precisely* applicable
to the user-defined operator. (Which it is, since that operator is
just a wrapper around regular "<" ... but the system has no business
assuming that.) So with the patch, the case exercises the new code
path that just counts matches, and that gives us only
1/default_statistics_target resolution in the selectivity estimate;
which is not enough to get the expected plan to be selected. I worked
around this for the moment by cranking up default_statistics_target
while running the ANALYZE in that test script, but I wonder if we
should instead tweak those test cases to be more robust.
I think the combination of 0001+0002 really moves the goalposts a
long way in terms of having honest stats estimation for non-default
collations, so I'd like to sneak it into v13. As for v12, about
the only alternatives I can think of are:
1. Do nothing, reasoning that if nobody noticed for a year, this
situation is enough of a corner case that we can leave it unfixed.
Obviously that's pretty unsatisfying.
2. Change all the stats functions to pass DEFAULT_COLLATION_OID
when invoking operator functions. This is not too attractive
either because it essentially reverts 5e0928005; in fact, to avoid
breaking things completely we'd likely have to revert the part
of that commit that taught ANALYZE to collect stats using column
collations instead of DEFAULT_COLLATION_OID. Then we get into
questions like what about 6b0faf723 --- it's going to be a mess.
3. Hack things up so that the core code renames all these exposed
functions to, say, ineq_histogram_selectivity_ext() and so on,
allowing the additional arguments to exist, but the old names would
still be there as ABI compatibility wrappers. This might produce
slightly funny results for external code calling the wrappers, since
the wrappers would have to assume DEFAULT_COLLATION_OID, but it'd
avoid an ABI break at least. I don't want to propagate such a thing
into HEAD, so this would leave us with unsightly differences between
v12 and earlier/later branches -- but there aren't *that* many places
involved. (I'd envision this approach as back-porting 0001 but not
0002. For one reason, there's noplace for a wrapper to get the
additional operator OID needed for ineq_histogram_selectivity_ext.
For another, the results for the privilege test suggest that 0002
might have surprising effects on user-defined operators, so back
patching it might draw more complaints.)
Alternatives #2 and #3 would result in (different) changes in the
selectivity estimates v12 produces when considering columns with
non-default collations and/or queries using collations that don't
match the relevant columns. So that might be an argument for
doing nothing in v12; people tend not to like it when minor
releases cause unexpected plan changes. Also, #2 is probably
strictly worse than #3 on this score, since it'd move such
estimates away from reality not towards it.
Thoughts?
regards, tom lane
Attachments:
0001-use-query-collation-in-selectivity-estimation.patchtext/x-diff; charset=us-ascii; name=0001-use-query-collation-in-selectivity-estimation.patchDownload+100-85
0002-handle-other-collations-in-scalarineqsel.patchtext/x-diff; charset=us-ascii; name=0002-handle-other-collations-in-scalarineqsel.patchDownload+194-65
I wrote:
3. Hack things up so that the core code renames all these exposed
functions to, say, ineq_histogram_selectivity_ext() and so on,
allowing the additional arguments to exist, but the old names would
still be there as ABI compatibility wrappers.
Here's a proposed v12 patch along those lines.
regards, tom lane