Hash join not finding which collation to use for string hashing

Started by Mark Dilgeralmost 6 years ago14 messages

mark.dilger@enterprisedb.com

almost 6 years ago

1 attachment(s)

While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.

I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Notice the "ERROR: could not determine which collation to use for string hashing”

The following is extracted from the output from the test:

CREATE TABLE raw_data (a text);
INSERT INTO raw_data (a) VALUES ('Türkiye'),
('TÜRKIYE'),
('bıt'),
('BIT'),
('äbç'),
('ÄBÇ'),
('aaá'),
('coté'),
('Götz'),
('ὀδυσσεύς'),
('ὈΔΥΣΣΕΎΣ'),
('を読み取り用'),
('にオープンできませんでした');
-- Create unpartitioned tables for test
CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
INSERT INTO alpha (SELECT a, a FROM raw_data);
INSERT INTO beta (SELECT a, a FROM raw_data);
ANALYZE alpha;
ANALYZE beta;
EXPLAIN (COSTS OFF)
SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
QUERY PLAN
------------------------------------------------------------
Hash Join
Hash Cond: ((t2.a)::text = (t1.a)::text)
-> Seq Scan on beta t2
-> Hash
-> Seq Scan on alpha t1
Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string hashing
HINT: Use the COLLATE clause to set the collation explicitly.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-WIP-patch-to-demonstrate-problem.patchapplication/octet-stream; name=0001-WIP-patch-to-demonstrate-problem.patch; x-unix-mode=0644Download

From 31694084eb47127edb38d854b3d762b6a870985d Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 28 Jan 2020 15:25:58 -0800
Subject: [PATCH] WIP patch to demonstrate problem

Apply and run 'make check' to see the problem.  I've created an
empty expected output file, just to keep the regression test
infrastructure happy.

The problem manifests as:

ERROR:  could not determine which collation to use for string hashing
---
 src/test/regress/expected/collation_join.out |  0
 src/test/regress/parallel_schedule           |  2 +-
 src/test/regress/serial_schedule             |  1 +
 src/test/regress/sql/collation_join.sql      | 58 ++++++++++++++++++++
 4 files changed, 60 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/collation_join.out
 create mode 100644 src/test/regress/sql/collation_join.sql

diff --git a/src/test/regress/expected/collation_join.out b/src/test/regress/expected/collation_join.out
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d2b17dd3ea..3064f27f8a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -112,7 +112,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # ----------
 # Another group of parallel tests
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain
+test: collation_join partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain
 
 # event triggers cannot run concurrently with any test that runs DDL
 test: event_trigger
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index acba391332..0f2901c027 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -185,6 +185,7 @@ test: returning
 test: largeobject
 test: with
 test: xml
+test: collation_join
 test: partition_join
 test: partition_prune
 test: reloptions
diff --git a/src/test/regress/sql/collation_join.sql b/src/test/regress/sql/collation_join.sql
new file mode 100644
index 0000000000..521fb7819b
--- /dev/null
+++ b/src/test/regress/sql/collation_join.sql
@@ -0,0 +1,58 @@
+CREATE TABLE raw_data (a text);
+INSERT INTO raw_data (a) VALUES ('Türkiye'),
+								('TÜRKIYE'),
+								('bıt'),
+								('BIT'),
+								('äbç'),
+								('ÄBÇ'),
+								('aaá'),
+								('coté'),
+								('Götz'),
+								('ὀδυσσεύς'),
+								('ὈΔΥΣΣΕΎΣ'),
+								('を読み取り用'),
+								('にオープンできませんでした');
+
+-- Create unpartitioned tables for test
+CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
+CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
+
+INSERT INTO alpha (SELECT a, a FROM raw_data);
+INSERT INTO beta (SELECT a, a FROM raw_data);
+
+ANALYZE alpha;
+ANALYZE beta;
+
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
+
+-- Try again, this time with list partitioning
+DROP TABLE alpha CASCADE;
+DROP TABLE beta CASCADE;
+
+CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE") PARTITION BY LIST(a);
+CREATE TABLE alpha_a PARTITION OF alpha FOR VALUES IN ('Türkiye', 'TÜRKIYE');
+CREATE TABLE alpha_b PARTITION OF alpha FOR VALUES IN ('bıt', 'BIT');
+CREATE TABLE alpha_c PARTITION OF alpha FOR VALUES IN ('äbç', 'ÄBÇ');
+CREATE TABLE alpha_d PARTITION OF alpha FOR VALUES IN ('aaá', 'coté', 'Götz');
+CREATE TABLE alpha_e PARTITION OF alpha FOR VALUES IN ('ὀδυσσεύς', 'ὈΔΥΣΣΕΎΣ');
+CREATE TABLE alpha_f PARTITION OF alpha FOR VALUES IN ('を読み取り用', 'にオープンできませんでした', NULL);
+CREATE TABLE alpha_default PARTITION OF alpha DEFAULT;
+
+CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US") PARTITION BY LIST(a);
+CREATE TABLE beta_a PARTITION OF beta FOR VALUES IN ('Türkiye', 'coté', 'ὈΔΥΣΣΕΎΣ');
+CREATE TABLE beta_b PARTITION OF beta FOR VALUES IN ('bıt', 'TÜRKIYE');
+CREATE TABLE beta_c PARTITION OF beta FOR VALUES IN ('äbç', 'を読み取り用', 'にオープンできませんでした');
+CREATE TABLE beta_d PARTITION OF beta FOR VALUES IN ('aaá', 'Götz', 'BIT', 'ὀδυσσεύς', 'ÄBÇ', NULL);
+CREATE TABLE beta_default PARTITION OF beta DEFAULT;
+
+INSERT INTO alpha (SELECT a, a FROM raw_data);
+INSERT INTO beta (SELECT a, a FROM raw_data);
+
+ANALYZE alpha;
+ANALYZE beta;
+
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
+SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
-- 
2.21.1 (Apple Git-122.3)

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Mark Dilger (#1)

Re: Hash join not finding which collation to use for string hashing

Mark Dilger <mark.dilger@enterprisedb.com> writes:

While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Hm, I don't see any bug here. You're asking it to join

CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');

so t1.a and t2.a have different collations, and the system can't resolve
which to use for the comparison.

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy. On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why. So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow. Not sure how.

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

regards, tom lane

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Tom Lane (#2)

Re: Hash join not finding which collation to use for string hashing

On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Hm, I don't see any bug here. You're asking it to join

CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');

so t1.a and t2.a have different collations, and the system can't resolve
which to use for the comparison.

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy. On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why. So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow. Not sure how.

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I contemplated that for a while before submitting the report. I agree that for strings that are not binary equal, some collations might say the two strings are equal, and other collations may say that they are not. But when does any collation say that a string is not equal to itself? All the strings in these columns were loaded from the same source table, and they should always equal themselves, so the only problem I am aware of is if some of them equal others of them under one of the collations in question, where the other collation doesn’t think so. I’m pretty sure that does not exist in this concrete example.

I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to compare, so I’m going to give up now and not look”.

I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later realizes, “Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up.

Is there some more fundamental reason this query couldn’t correctly be completed? I don’t mind being enlightened about the part that I’m missing.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Mark Dilger (#3)

Re: Hash join not finding which collation to use for string hashing

On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Hm, I don't see any bug here. You're asking it to join

CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');

so t1.a and t2.a have different collations, and the system can't resolve
which to use for the comparison.

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy. On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why. So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow. Not sure how.

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I contemplated that for a while before submitting the report. I agree that for strings that are not binary equal, some collations might say the two strings are equal, and other collations may say that they are not. But when does any collation say that a string is not equal to itself? All the strings in these columns were loaded from the same source table, and they should always equal themselves, so the only problem I am aware of is if some of them equal others of them under one of the collations in question, where the other collation doesn’t think so. I’m pretty sure that does not exist in this concrete example.

I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to compare, so I’m going to give up now and not look”.

I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later realizes, “Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up.

Is there some more fundamental reason this query couldn’t correctly be completed? I don’t mind being enlightened about the part that I’m missing.

If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than having it sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google

site:postgresql.org "could not determine which collation to use for string hashing”

you don’t get anything from the docs telling you that this is an expected outcome.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Amit Langote

amitlangote09@gmail.com

almost 6 years ago

In reply to: Mark Dilger (#4)

Re: Hash join not finding which collation to use for string hashing

Hi Mark,

On Wed, Jan 29, 2020 at 1:03 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Just to clarify, we only intended in the quoted thread to plug
relevant holes of the *partitioning* code, which IIRC was more
straightforward to do than appears to be the case here.

If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than having it sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google

site:postgresql.org "could not determine which collation to use for string hashing”

you don’t get anything from the docs telling you that this is an expected outcome.

You may have noticed that it's not only hash join that bails out:

EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
QUERY PLAN
------------------------------------------------------------
Hash Join
Hash Cond: ((t2.a)::text = (t1.a)::text)
-> Seq Scan on beta t2
-> Hash
-> Seq Scan on alpha t1
Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string hashing
HINT: Use the COLLATE clause to set the collation explicitly.

SET enable_hashjoin TO off;
-- error occurs partway through ExecInitMergeJoin(), so EXPLAIN can't finish
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.

SET enable_mergejoin TO off;
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
QUERY PLAN
------------------------------------------------------------
Nested Loop
Join Filter: ((t1.a)::text = (t2.a)::text)
-> Seq Scan on beta t2
-> Materialize
-> Seq Scan on alpha t1
Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.

With PG 11, I can see that hash join and nestloop join work. But with
PG 12, this join can't possible work without an explicit COLLATE
clause. So it would be nice if we can report a more specific error
much sooner, possibly with some parser context, given that we now know
for sure that a join qual without a collation assigned will not work
at all. IOW, maybe we should aim for making the run-time collation
errors to be of "won't happen" category as much as possible.

Tom said:

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy. On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why. So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow. Not sure how.

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

Thanks,
Amit

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Amit Langote (#5)

Re: Hash join not finding which collation to use for string hashing

On Jan 29, 2020, at 10:14 PM, Amit Langote <amitlangote09@gmail.com> wrote:

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.

With PG 11, I can see that hash join and nestloop join work. But with
PG 12, this join can't possible work without an explicit COLLATE
clause. So it would be nice if we can report a more specific error
much sooner, possibly with some parser context, given that we now know
for sure that a join qual without a collation assigned will not work
at all. IOW, maybe we should aim for making the run-time collation
errors to be of "won't happen" category as much as possible.

Tom said:

Now, I'd be the first to agree that this error could be reported better.
The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
what it does *not* know is whether the '=' operator cares for collation.
Throwing an error when the operator wouldn't care at runtime isn't going
to make many people happy. On the other hand, when the operator finally
does run and can't get a collation, all it knows is that it didn't get a
collation, not why. So we can't produce an error message as specific as
"ja_JP and tr_TR collations conflict".

Now that the collations feature has settled in, it'd be nice to go back
and see if we can't improve that somehow. Not sure how.

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

Hi Amit, I appreciate your attention to my question, but I’m not ready to delve into possible fixes, as I still don’t entirely understand the problem.

According to Tom:

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I’m still struggling with that, because the four collations I used in the example are all deterministic. I totally understand why having more than one collation matters if you ask that your data be in sorted order, as the system needs to know which ordering to use. But for equality, I would think that deterministic collations are all interchangeable, because they all agree on whether A = B, regardless of the collation defined on column A and/or on column B. Maybe I’m wrong about that. But that’s my reading of the definition of “deterministic collation” given in the docs:

A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence.

I’m reading that as “If and only if”, and maybe I’m wrong to do so. Maybe that’s my error. But assuming that part is ok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’t need them to be the *same* collations, nor even remember which collations they were. You’d just need information passed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparator should be used rather than the collation’s equality comparator, or some such solution.

I’m guessing I’m wrong about at least one of these things, and I’m hoping somebody enlightens me.

Thanks so much in advance,

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Mark Dilger (#6)

Re: Hash join not finding which collation to use for string hashing

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there?

There may be operators other than = and != that are worth thinking about, but for now I’m really only thinking about those two. It is hard for me to see how we could ignore a collation mismatch for any other operator, but maybe I’m just not thinking about it the right way.

So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first things it does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. It should first be determining if collation *matters*. It can’t do that right now, because it gets that information from this line:

if (lc_collate_is_c(collid) ||
collid == DEFAULT_COLLATION_OID ||
pg_newlocale_from_collation(collid)->deterministic)

Which obviously won’t work if collid hasn’t been set. So three approaches come to my mind:

1) Somewhere upstream from calling texteq, figure out that we don’t actually care about the collation stuff, because both the left and right side of the comparison use deterministic collations and the comparison we’re calling is equality, and then pass down a dummy collation such as “C” even though that isn’t actually true.

The problem with (1) that I see is that it could lead to the wrong collation being mentioned in error messages, though I haven’t looked, and that it’s enough of a hack that it might make coding in this area harder in the future.

2) Somewhere upstream from calling texteq, pass in a new boolean flag that specifies whether collation matters, and extend texteq to take an additional argument.

This also seems very hacky to me, but for different reasons.

3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having two values, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get built up and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULL down the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with two different collations and decide if they can deal with that or if they need to throw an error.

I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without breaking anything else. I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my analysis.

According to Tom:

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I’m still struggling with that, because the four collations I used in the example are all deterministic. I totally understand why having more than one collation matters if you ask that your data be in sorted order, as the system needs to know which ordering to use. But for equality, I would think that deterministic collations are all interchangeable, because they all agree on whether A = B, regardless of the collation defined on column A and/or on column B. Maybe I’m wrong about that. But that’s my reading of the definition of “deterministic collation” given in the docs:

A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence.

I’m reading that as “If and only if”, and maybe I’m wrong to do so. Maybe that’s my error. But assuming that part is ok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’t need them to be the *same* collations, nor even remember which collations they were. You’d just need information passed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparator should be used rather than the collation’s equality comparator, or some such solution.

I’m starting to think that “consequence of having invented nondeterministic collations” in Tom’s message really should read “consequence of having invented nondeterministic collations without reworking these other interfaces”, but once again, I’m hoping to be corrected if I’ve gone off in the wrong direction here.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Mark Dilger (#6)

Re: Hash join not finding which collation to use for string hashing

Mark Dilger <mark.dilger@enterprisedb.com> writes:

According to Tom:

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I’m still struggling with that, because the four collations I used in
the example are all deterministic. I totally understand why having more
than one collation matters if you ask that your data be in sorted order,
as the system needs to know which ordering to use. But for equality, I
would think that deterministic collations are all interchangeable,
because they all agree on whether A = B, regardless of the collation
defined on column A and/or on column B. Maybe I’m wrong about that.

Well, you're not wrong, but you're assuming much finer distinctions
than the collation machinery actually makes (or than it'd be sane
to ask it to make, IMO). We don't have a way to tell texteq that
"well, we don't know what collation to assign to this operation,
but it's okay to assume that it's deterministic". Nor does the
parser have any way to know that texteq could be satisfied by
that knowledge --- if it doesn't even know whether texteq cares
about collation, how could it know that?

There are other issues here too. Just because the query could
theoretically be implemented without reference to any specific
collation doesn't mean that that's a practical thing to do.
It'd be unclear for instance whether we can safely use indexes
that *do* have specific collations attached. We'd also lose
the option to consider plans like mergejoins.

If the parser understood that a particular operator behaved
like text equality --- which it does not, and I guarantee you
I will shoot down any proposal to hard-wire a parser test for
that particular operator --- you could imagine assigning "C"
collation when we have an unresolvable combination of
deterministic collations for the inputs. That dodges the
problem of not having any way to represent the situation.
But it's still got implementation issues, in that such a
collation choice probably won't match the collations of any
indexes for the input columns.

Another issue is that collations "bubble up" in the parse tree,
so sneaking in a collation that's not supposed to be there per
spec carries a risk of causing unexpected semantics further up.
I think we could get away with that for the particular case of
equality (which returns collation-less boolean), but this is
another thing that makes the case narrower and less useful.

In the end, TBH, I'm not finding your example compelling enough
to be worth putting in weird hacks for such cases. If you're
joining columns of dissimilar collations, you're going to be
finding it necessary to specify what collation to use in a lot
of places ... so where's the value in making a weird special
case for equality?

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 6 years ago

In reply to: Mark Dilger (#7)

Re: Hash join not finding which collation to use for string hashing

On Thu, Jan 30, 2020 at 2:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having two values, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get built up and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULL down the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with two different collations and decide if they can deal with that or if they need to throw an error.

I bet if we went with (3), the error being thrown in the example I used to start this thread would go away, without breaking anything else. I’m going to go poke at that a bit, but I’d still appreciate any comments/concerns about my analysis.

I assume that what would have to happen to implement this is that an
SQL-callable function would be passed more than one collation OID,
perhaps one per argument or something like that. Notice, however, that
this would require changing the way that functions get called. See the
DirectFunctionCall{1,2,3,...}Coll() and
FunctionCall{0,1,2,3,...}Coll() and the definition of
FunctionCallInfoBaseData -- there's only one spot for an OID available
right now. Allowing for more would likely have a noticeable impact on
the cost of calling SQL-callable functions, and that's already
expensive enough that people have been unhappy about it. It seems
unlikely that it would be worth incurring more overhead here for every
query all the time just to make this case work.

It is, perhaps, a little strange that the only two choices for an
operator are "cares about collation" and "doesn't," and I somehow feel
like there ought to be a way to do better. But I don't know what it
is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Tom Lane (#8)

Re: Hash join not finding which collation to use for string hashing

On Jan 30, 2020, at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

According to Tom:

(BTW, before v12 the text '=' operator indeed did not care for collation,
so this example would've worked. But the change in behavior is a
necessary consequence of having invented nondeterministic collations,
not a bug.)

I’m still struggling with that, because the four collations I used in
the example are all deterministic. I totally understand why having more
than one collation matters if you ask that your data be in sorted order,
as the system needs to know which ordering to use. But for equality, I
would think that deterministic collations are all interchangeable,
because they all agree on whether A = B, regardless of the collation
defined on column A and/or on column B. Maybe I’m wrong about that.

Well, you're not wrong, but you're assuming much finer distinctions
than the collation machinery actually makes (or than it'd be sane
to ask it to make, IMO). We don't have a way to tell texteq that
"well, we don't know what collation to assign to this operation,
but it's okay to assume that it's deterministic". Nor does the
parser have any way to know that texteq could be satisfied by
that knowledge --- if it doesn't even know whether texteq cares
about collation, how could it know that?

I agree. Having this in the parser seems really weird and unwholesome.

There are other issues here too. Just because the query could
theoretically be implemented without reference to any specific
collation doesn't mean that that's a practical thing to do.
It'd be unclear for instance whether we can safely use indexes
that *do* have specific collations attached. We'd also lose
the option to consider plans like mergejoins.

If the parser understood that a particular operator behaved
like text equality --- which it does not, and I guarantee you
I will shoot down any proposal to hard-wire a parser test for
that particular operator --- you could imagine assigning "C"
collation when we have an unresolvable combination of
deterministic collations for the inputs. That dodges the
problem of not having any way to represent the situation.
But it's still got implementation issues, in that such a
collation choice probably won't match the collations of any
indexes for the input columns.

Yeah, I disclaimed that idea in a subsequent email, but if you’re responding to my emails in the order that you receive them (which is totally reasonable), then you aren’t to know that yet.

Another issue is that collations "bubble up" in the parse tree,
so sneaking in a collation that's not supposed to be there per
spec carries a risk of causing unexpected semantics further up.
I think we could get away with that for the particular case of
equality (which returns collation-less boolean), but this is
another thing that makes the case narrower and less useful.

I was wondering if bubbling up (LeftCollation,RightCollation) would be ok. There are likely cases that can’t make use of that, but those places would just throw the same sort of error that they’re currently throwing, except they’d have a more useful error message because it would include which collations were mismatched.

In the end, TBH, I'm not finding your example compelling enough
to be worth putting in weird hacks for such cases. If you're
joining columns of dissimilar collations, you're going to be
finding it necessary to specify what collation to use in a lot
of places ... so where's the value in making a weird special
case for equality?

I agree with your position against weird hacks. If the only way to do this is a weird hack, then forget about it.

If I’m not putting upon your time too much, could you respond to my other email in this thread as to whether it sounds any better?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Mark Dilger (#7)

Re: Hash join not finding which collation to use for string hashing

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there?

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first things it does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. It should first be determining if collation *matters*.

But of course it matters. How do you know whether the operation is
deterministic if you don't know the collation?

3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having two values, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get built up and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULL down the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with two different collations and decide if they can deal with that or if they need to throw an error.

Maybe this could work. I think it would get messy when bubbling up
collations, but as long as you're talking about "sets" not "pairs"
it might be possible to postpone collation resolution.

To me, though, the main advantage of this is that we could throw a
more explicit error like "collations "ja_JP" and "tr_TR" cannot be
unified", since that information would still be there at runtime.
I'm still pretty dubious that having texteq special-case the situation
where the collations are different but all deterministic is a reasonable
thing to do.

One practical problem is that postponing that work to runtime could be
a huge performance hit, because you'd have to do it over again on each
call of the operator. I suppose some caching might be possible.

Another issue is that you're still putting far too much emphasis on
the fact that a hash-join plan manages to avoid this error, and ignoring
the problem that a lot of other plans for the same query will not avoid
it. What if the planner had chosen a merge-join, for instance? How
useful is it to allow the join if things still break the moment you
add an ORDER BY?

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Robert Haas (#9)

Re: Hash join not finding which collation to use for string hashing

Robert Haas <robertmhaas@gmail.com> writes:

I assume that what would have to happen to implement this is that an
SQL-callable function would be passed more than one collation OID,
perhaps one per argument or something like that. Notice, however, that
this would require changing the way that functions get called. See the
DirectFunctionCall{1,2,3,...}Coll() and
FunctionCall{0,1,2,3,...}Coll() and the definition of
FunctionCallInfoBaseData -- there's only one spot for an OID available
right now. Allowing for more would likely have a noticeable impact on
the cost of calling SQL-callable functions, and that's already
expensive enough that people have been unhappy about it. It seems
unlikely that it would be worth incurring more overhead here for every
query all the time just to make this case work.

The implementation I was visualizing was replacing, eg,
FuncExpr.inputcollid with an OID List, and then teaching PG_GET_COLLATION
to throw an error if the list is longer than one element. I agree that
the performance implications of that would be pretty troublesome, though.

In the end, it seems like the only solution that would be remotely
practical from a performance standpoint is to redefine things so that
collation-sensitive functions have to be labeled as such in pg_proc,
and then we can have the parser throw the appropriate error if it
can't resolve an input collation for such a function. Perhaps the
backwards-compatibility hit wouldn't be as bad as it first seems,
since the whole thing can be ignored for functions that haven't got at
least one collatable input, and most of those would likely be all right
with a default assumption that they are collation sensitive. Or maybe
better, we could make the default assumption be that they aren't
sensitive, with the same error still being thrown at runtime if they are,
so that extensions have to take positive action to get the better error
behavior but if they don't then things are no worse than today.

Mark, obviously, would then lobby for the pg_proc marking to
include one state that identifies functions that only care about
collation when it's nondeterministic. But I'm still not very
sure how that would work as soon as you look anyplace except at
what texteq() itself would do. The questions of whether such a
query matches a given index, or could be implemented via mergejoin,
etc, remain.

regards, tom lane

#13

Mark Dilger

mark.dilger@enterprisedb.com

almost 6 years ago

In reply to: Tom Lane (#11)

Re: Hash join not finding which collation to use for string hashing

On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there?

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

Thanks, Tom, for confirming this.

Given the excellent explanations you and Robert have given, I think I’m retracting this whole idea and accepting your positions that it’s not worth it.

For the archives, I’m still going to respond to the rest of what you say:

So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first things it does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. It should first be determining if collation *matters*.

But of course it matters. How do you know whether the operation is
deterministic if you don't know the collation?

3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having two values, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get built up and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULL down the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with two different collations and decide if they can deal with that or if they need to throw an error.

Maybe this could work. I think it would get messy when bubbling up
collations, but as long as you're talking about "sets" not "pairs"
it might be possible to postpone collation resolution.

To me, though, the main advantage of this is that we could throw a
more explicit error like "collations "ja_JP" and "tr_TR" cannot be
unified", since that information would still be there at runtime.
I'm still pretty dubious that having texteq special-case the situation
where the collations are different but all deterministic is a reasonable
thing to do.

On my mac, when I run “SELECT * FROM pg_collation”, every one of the 271 rows I get back have collisdeterministic true. I know that which collations you get on a system is variable, so I’m not saying that nobody has nondeterministic collations, but it seems common enough that mismatched collations will both be deterministic. That’s the common case, not some weird edge case.

So the issue here seems to be whether equality should get different treatment from other operators, and I obviously am arguing that it should, though you and Robert have both made really good points against that position.

One practical problem is that postponing that work to runtime could be
a huge performance hit, because you'd have to do it over again on each
call of the operator. I suppose some caching might be possible.

Yes, Robert mentioned performance implications, too.

Another issue is that you're still putting far too much emphasis on
the fact that a hash-join plan manages to avoid this error, and ignoring
the problem that a lot of other plans for the same query will not avoid
it. What if the planner had chosen a merge-join, for instance?

You’re looking at the problem from the point of view of how postgres is currently and historically implemented, and seeing that this problem is hard. I was looking at it more from the perspective of a user who gets the error message and thinks, “this is stupid, the query is refusing to run for want of a collation being specified, but I can clearly see that it doesn’t actually need one.” I think that same reaction from the user would happen if the planner chose a merge-join. The user would just say, “gee, what a stupid planner, why did it choose a merge join for this when the lack of a collation clause clearly indicates that it should have limited itself to something that only needs equality comparison.”

I’m not calling the planner stupid, nor the system generally, but I know that people get frustrated with systems that have unintuitive limitations like this when they don’t know the internals of the system that give rise to the limitations.

How
useful is it to allow the join if things still break the moment you
add an ORDER BY?

I think that’s apples-to-oranges. If you ask the system to order the data, and you’ve got ambiguity about which ordering you mean, then of course it can’t continue until you tell it which collation you want.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Amit Langote

amitlangote09@gmail.com

almost 6 years ago

In reply to: Mark Dilger (#13)

Re: Hash join not finding which collation to use for string hashing

On Fri, Jan 31, 2020 at 6:15 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Jan 30, 2020, at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mark Dilger <mark.dilger@enterprisedb.com> writes:

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there?

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

Thanks, Tom, for confirming this.

Just for the record, I will explain why I brought up doing anything
with operator families at all. What I was really imagining is putting
a hard-coded check somewhere in the middle of equivalence processing
to see if a given qual's operator would be sensitive to collation
based *only* on whether it belongs to a text operator family, such as
TEXT_BTREE_FAM_OID, whereas the qual expression's inputcollid is 0
(parser failed to resolve collation conflict among its arguments) and
erroring out if so. If we do that, maybe we won't need
check_collation_set() that's used in various text operators. Also,
erroring out sooner might allow us to produce more specific error
message, which as I understand it, would help with one of the Mark's
complaints that the error message is too ambiguous due to emitted as
such a low layer.

I thought of the idea after running into a recent commit relevant to
non-deterministic collations:

commit 2810396312664bdb941e549df7dfa75218d73a1c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Sep 21 16:29:17 2019 -0400

Fix up handling of nondeterministic collations with pattern_ops opclasses.

text_pattern_ops and its siblings can't be used with nondeterministic
collations, because they use the text_eq operator which will not behave
as bitwise equality if applied with a nondeterministic collation. The
initial implementation of that restriction was to insert a run-time test
in the related comparison functions, but that is inefficient, may throw
misleading errors, and will throw errors in some cases that would work.
It seems sufficient to just prevent the combination during CREATE INDEX,
so do that instead.

Lacking any better way to identify the opclasses involved, we need to
hard-wire tests for them, which requires hand-assigned values for their
OIDs, which forces a catversion bump because they previously had OIDs
that would be assigned automatically. That's slightly annoying in the
v12 branch, but fortunately we're not at rc1 yet, so just do it.

Discussion: /messages/by-id/22566.1568675619@sss.pgh.pa.us

IIUC, the above commit removes a check_collation_set() call from a
operator class comparison function in favor of ensuring that an index
using that operator class can only be defined with a deterministic
collation in the first place. But as the above commit is purportedly
only a stop-gap solution due to lacking operator class infrastructure
to consider collation in operator semantics, maybe we shouldn't spread
such a hack in other places.

Thanks,
Amit