SQL/JSON path: collation for comparisons, minor typos in docs

Started by Markus Winandover 6 years ago9 messages

markus.winand@winand.at

over 6 years ago

1 attachment(s)

Hi!

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

I guess this is the relevant part of the code: src/backend/utils/adt/jsonpath_exec.c (compareItems)

case jbvString:
if (op == jpiEqual)
return jb1->val.string.len != jb2->val.string.len ||
memcmp(jb1->val.string.val,
jb2->val.string.val,
jb1->val.string.len) ? jpbFalse : jpbTrue;

cmp = varstr_cmp(jb1->val.string.val, jb1->val.string.len,
jb2->val.string.val, jb2->val.string.len,
DEFAULT_COLLATION_OID);
break;

Testcase:

postgres 12beta3=# select * from jsonb_path_query('"dummy"', '$ ? ("a" < "A")');
jsonb_path_query
------------------
"dummy"
(1 row)

In code points, lower case ‘a' is not less than upper case ‘A’—the result should be empty.

To convince myself:

postgres 12beta3=# select datcollate, 'a' < 'A', 'a' <'A' COLLATE ucs_basic from pg_database where datname=current_database();
datcollate | ?column? | ?column?
-------------+----------+----------
en_US.UTF-8 | t | f
(1 row)

I also found two minor typos in the docs. Patch attached.

-markus
ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the collation issue is the only one I found. Excellent work. Down to the SQLSTATEs. For sure the most complete and correct SQL/JSON path implementation I've seen.

Attachments:

0001-Doc-Fix-typos-in-json-path-documentation.patchapplication/octet-stream; name=0001-Doc-Fix-typos-in-json-path-documentation.patch; x-unix-mode=0644Download

From a1f21d2d9562be86a1011ce2a929ab19c22b241a Mon Sep 17 00:00:00 2001
From: Markus Winand <markus.winand@winand.at>
Date: Wed, 7 Aug 2019 13:09:48 +0200
Subject: [PATCH] Doc: Fix typos in json path documentation

---
 doc/src/sgml/func.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7412df0bae..f1d1005ae2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -11735,7 +11735,7 @@ table2-mapping
      A path expression can be a Boolean predicate, although the SQL/JSON
      standard allows predicates only in filters.  This is necessary for
      implementation of the <literal>@@</literal> operator. For example,
-     the following<type>jsonpath</type> expression is valid in
+     the following <type>jsonpath</type> expression is valid in
      <productname>PostgreSQL</productname>:
 <programlisting>
 '$.track.segments[*].HR &lt; 70'
@@ -12997,7 +12997,7 @@ table2-mapping
         </entry>
         <entry>
          <para><literal>
-           select * jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] ? (@ >= $min &amp;&amp; @ &lt;= $max)', '{"min":2,"max":4}');
+           select * from jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] ? (@ >= $min &amp;&amp; @ &lt;= $max)', '{"min":2,"max":4}');
          </literal></para>
         </entry>
         <entry>
-- 
2.20.1

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Markus Winand (#1)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

Hi!

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

I also found two minor typos in the docs. Patch attached.

Pushed, thanks.

-markus
ps.: I’ve created 230 test cases. Besides the WIP topic .datetime(), the collation issue is the only one I found. Excellent work. Down to the SQLSTATEs. For sure the most complete and correct SQL/JSON path implementation I've seen.

Thank you!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#2)

1 attachment(s)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance. So, "==" remains per-byte comparison. For consistency
in other operators we compare code points first, then do per-byte
comparison. In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard. In future we can implement strict standard
conformance by normalization of input JSON strings.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Use-Unicode-codepoint-collation-in-jsonpath-2.patchapplication/octet-stream; name=0001-Use-Unicode-codepoint-collation-in-jsonpath-2.patchDownload

commit e0e14e301c9da9d25dfebfe89039210f46c0efe1
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Wed Aug 7 23:41:58 2019 +0300

    Adjust string comparison in jsonpath
    
    We have implemented jsonpath string comparison using default database locale.
    However, standard requires us to compare Unicode codepoints.  This commit
    implements that, but for performance reasons we still use per-byte comparison
    for "==" operator.  Thus, for consistency other comparison operators do per-byte
    comparison if Unicode codepoints appear to be equal.
    
    In some edge cases, when same Unicode codepoints have different binary
    representations in database encoding, we diverge standard to achieve better
    performance of "==" operator.  In future to implement strict standard
    conformance, we can do normalization of input JSON strings.
    
    Original patch was written by Nikita Glukhov, rewritten by me.
    
    Reported-by: Markus Winand
    Discussion: https://postgr.es/m/8B7FA3B4-328D-43D7-95A8-37B8891B8C78%40winand.at
    Author: Nikita Glukhov, Alexander Korotkov
    Backpatch-through: 12

diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index 293d6da027c..ced036d112d 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1980,6 +1980,103 @@ executeComparison(JsonPathItem *cmp, JsonbValue *lv, JsonbValue *rv, void *p)
 	return compareItems(cmp->type, lv, rv);
 }
 
+/*
+ * Perform per-byte comparison of two strings.
+ */
+static int
+binaryCompareStrings(const char *s1, int len1,
+					 const char *s2, int len2)
+{
+	int			cmp;
+
+	cmp = memcmp(s1, s2, Min(len1, len2));
+
+	if (cmp != 0)
+		return cmp;
+
+	if (len1 == len2)
+		return 0;
+
+	return len1 < len2 ? -1 : 1;
+}
+
+/*
+ * Compare two strings in the current server encoding using Unicode codepoint
+ * collation.
+ */
+static int
+compareStrings(const char *mbstr1, int mblen1,
+			   const char *mbstr2, int mblen2)
+{
+	if (GetDatabaseEncoding() == PG_SQL_ASCII)
+	{
+		/*
+		 * ASCII encoding: simple case, just do binary comparison.
+		 */
+		return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+	}
+	else
+	{
+		/*
+		 * In order to get Unicode codepoints, we have to convert to utf8
+		 * first.  If server encoding is utf8, pg_server_to_any() will just
+		 * return original pointers.
+		 */
+		char	   *utf8str1 = pg_server_to_any(mbstr1, mblen1, PG_UTF8),
+				   *utf8str2 = pg_server_to_any(mbstr2, mblen2, PG_UTF8);
+		unsigned char *utf8c1,
+				   *utf8c2;
+		pg_wchar	uchar1,
+					uchar2;
+		int			cmp = 0;
+
+		/*
+		 * Loop over utf8 characters comparing corresponding Unicode
+		 * codepoints.
+		 */
+		utf8c1 = (unsigned char *) utf8str1;
+		utf8c2 = (unsigned char *) utf8str2;
+		while (*utf8c1 && *utf8c2)
+		{
+			uchar1 = utf8_to_unicode(utf8c1);
+			uchar2 = utf8_to_unicode(utf8c2);
+
+			if (uchar1 != uchar2)
+			{
+				cmp = uchar1 < uchar2 ? -1 : 1;
+				break;
+			}
+
+			utf8c1 += pg_utf_mblen(utf8c1);
+			utf8c2 += pg_utf_mblen(utf8c2);
+		}
+
+		/* If one string is prefix on another, shorter is lesser */
+		if (cmp == 0 && *utf8c1 != *utf8c2)
+			cmp = *utf8c1 == 0 ? -1 : 1;
+
+		/* Free allocated strings if any */
+		if (utf8str1 != mbstr1)
+			pfree(utf8str1);
+		if (utf8str2 != mbstr2)
+			pfree(utf8str2);
+
+		/*
+		 * When all Unicode codepoints are equal, return result of binary
+		 * comparison.  In some edge cases, same characters may have different
+		 * representations in encoding.  Then our behavior could diverge from
+		 * standard.  However, that allow us to do simple binary comparison
+		 * for "==" operator, which is performance critical in typical cases.
+		 * In future to implement strict standard conformance, we should do
+		 * normalization of input JSON strings.
+		 */
+		if (cmp != 0)
+			return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+		else
+			return cmp;
+	}
+}
+
 /*
  * Compare two SQL/JSON items using comparison operation 'op'.
  */
@@ -2017,14 +2114,13 @@ compareItems(int32 op, JsonbValue *jb1, JsonbValue *jb2)
 			break;
 		case jbvString:
 			if (op == jpiEqual)
-				return jb1->val.string.len != jb2->val.string.len ||
-					memcmp(jb1->val.string.val,
-						   jb2->val.string.val,
-						   jb1->val.string.len) ? jpbFalse : jpbTrue;
-
-			cmp = varstr_cmp(jb1->val.string.val, jb1->val.string.len,
-							 jb2->val.string.val, jb2->val.string.len,
-							 DEFAULT_COLLATION_OID);
+				return binaryCompareStrings(jb1->val.string.val,
+											jb1->val.string.len,
+											jb2->val.string.val,
+											jb2->val.string.len) ? jpbFalse : jpbTrue;
+
+			cmp = compareStrings(jb1->val.string.val, jb1->val.string.len,
+								 jb2->val.string.val, jb2->val.string.len);
 			break;
 
 		case jbvBinary:
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index 31a871af028..0202667a1f7 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -1833,3 +1833,166 @@ SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
  t
 (1 row)
 
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;
+       j       |       j       |  lt   |  le   |  eq   |  ge   |  gt   
+---------------+---------------+-------+-------+-------+-------+-------
+ {"s": ""}     | {"s": ""}     | false | true  | true  | true  | false
+ {"s": ""}     | {"s": "a"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "b"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "A"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "a"}    | {"s": "a"}    | false | true  | true  | true  | false
+ {"s": "a"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "a"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "a"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "a"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ab"}   | false | true  | true  | true  | false
+ {"s": "ab"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "abc"}  | false | true  | true  | true  | false
+ {"s": "abc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abcd"} | false | true  | true  | true  | false
+ {"s": "abcd"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abcd"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "B"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "b"}    | {"s": "a"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abcd"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "b"}    | false | true  | true  | true  | false
+ {"s": "b"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "A"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "A"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "A"}    | false | true  | true  | true  | false
+ {"s": "A"}    | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "AB"}   | false | true  | true  | true  | false
+ {"s": "AB"}   | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "ABC"}  | false | true  | true  | true  | false
+ {"s": "ABC"}  | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABc"}  | false | true  | true  | true  | false
+ {"s": "ABc"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABcD"} | false | true  | true  | true  | false
+ {"s": "ABcD"} | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "B"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "B"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "B"}    | {"s": "B"}    | false | true  | true  | true  | false
+(144 rows)
+
diff --git a/src/test/regress/sql/jsonb_jsonpath.sql b/src/test/regress/sql/jsonb_jsonpath.sql
index 733fbd4e0d0..e7629fb7f9d 100644
--- a/src/test/regress/sql/jsonb_jsonpath.sql
+++ b/src/test/regress/sql/jsonb_jsonpath.sql
@@ -387,3 +387,19 @@ SELECT jsonb_path_match('[true, true]', '$[*]', silent => false);
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 1';
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 2';
 SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
+
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#3)

1 attachment(s)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance. So, "==" remains per-byte comparison. For consistency
in other operators we compare code points first, then do per-byte
comparison. In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard. In future we can implement strict standard
conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings(). Revised version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Use-Unicode-codepoint-collation-in-jsonpath-3.patchapplication/octet-stream; name=0001-Use-Unicode-codepoint-collation-in-jsonpath-3.patchDownload

commit 014a4480b2863275429cea3b0b5fbc816342b9ce
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Wed Aug 7 23:41:58 2019 +0300

    Adjust string comparison in jsonpath
    
    We have implemented jsonpath string comparison using default database locale.
    However, standard requires us to compare Unicode codepoints.  This commit
    implements that, but for performance reasons we still use per-byte comparison
    for "==" operator.  Thus, for consistency other comparison operators do per-byte
    comparison if Unicode codepoints appear to be equal.
    
    In some edge cases, when same Unicode codepoints have different binary
    representations in database encoding, we diverge standard to achieve better
    performance of "==" operator.  In future to implement strict standard
    conformance, we can do normalization of input JSON strings.
    
    Original patch was written by Nikita Glukhov, rewritten by me.
    
    Reported-by: Markus Winand
    Discussion: https://postgr.es/m/8B7FA3B4-328D-43D7-95A8-37B8891B8C78%40winand.at
    Author: Nikita Glukhov, Alexander Korotkov
    Backpatch-through: 12

diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index 293d6da027c..814b34c575c 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1980,6 +1980,115 @@ executeComparison(JsonPathItem *cmp, JsonbValue *lv, JsonbValue *rv, void *p)
 	return compareItems(cmp->type, lv, rv);
 }
 
+/*
+ * Perform per-byte comparison of two strings.
+ */
+static int
+binaryCompareStrings(const char *s1, int len1,
+					 const char *s2, int len2)
+{
+	int			cmp;
+
+	cmp = memcmp(s1, s2, Min(len1, len2));
+
+	if (cmp != 0)
+		return cmp;
+
+	if (len1 == len2)
+		return 0;
+
+	return len1 < len2 ? -1 : 1;
+}
+
+/*
+ * Compare two strings in the current server encoding using Unicode codepoint
+ * collation.
+ */
+static int
+compareStrings(const char *mbstr1, int mblen1,
+			   const char *mbstr2, int mblen2)
+{
+	if (GetDatabaseEncoding() == PG_SQL_ASCII)
+	{
+		/*
+		 * ASCII encoding: simple case, just do binary comparison.
+		 */
+		return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+	}
+	else
+	{
+		/*
+		 * In order to get Unicode codepoints, we have to convert to utf8
+		 * first.  If server encoding is utf8, pg_server_to_any() will just
+		 * return original pointers.
+		 */
+		char	   *utf8str1 = pg_server_to_any(mbstr1, mblen1, PG_UTF8),
+				   *utf8str2 = pg_server_to_any(mbstr2, mblen2, PG_UTF8);
+		unsigned char *utf8c1,
+				   *utf8c2,
+				   *utf8end1,
+				   *utf8end2;
+		pg_wchar	uchar1,
+					uchar2;
+		int			cmp = 0;
+
+		utf8c1 = (unsigned char *) utf8str1;
+		utf8c2 = (unsigned char *) utf8str2;
+		if (utf8str1 == mbstr1)
+			utf8end1 = utf8c1 + mblen1;
+		else
+			utf8end1 = utf8c1 + strlen(utf8str1);
+
+		if (utf8str2 == mbstr2)
+			utf8end2 = utf8c2 + mblen2;
+		else
+			utf8end2 = utf8c2 + strlen(utf8str2);
+
+		/*
+		 * Loop over utf8 characters comparing corresponding Unicode
+		 * codepoints.
+		 */
+		while (utf8c1 < utf8end1 && utf8c2 < utf8end2)
+		{
+			uchar1 = utf8_to_unicode(utf8c1);
+			uchar2 = utf8_to_unicode(utf8c2);
+
+			if (uchar1 != uchar2)
+			{
+				cmp = uchar1 < uchar2 ? -1 : 1;
+				break;
+			}
+
+			utf8c1 += pg_utf_mblen(utf8c1);
+			utf8c2 += pg_utf_mblen(utf8c2);
+		}
+
+		/* If one string is prefix on another, shorter is lesser */
+		if (cmp == 0 && (utf8c1 < utf8end1 || utf8c2 < utf8end2))
+			cmp = utf8c1 == utf8end1 ? -1 : 1;
+
+		/* Free allocated strings if any */
+		if (utf8str1 != mbstr1)
+			pfree(utf8str1);
+		if (utf8str2 != mbstr2)
+			pfree(utf8str2);
+
+		/*
+		 * When all Unicode codepoints are equal, return result of binary
+		 * comparison.  In some edge cases, same characters may have different
+		 * representations in encoding.  Then our behavior could diverge from
+		 * standard.  However, that allow us to do simple binary comparison
+		 * for "==" operator, which is performance critical in typical cases.
+		 * In future to implement strict standard conformance, we can do
+		 * normalization of input JSON strings.
+		 */
+		if (cmp == 0)
+			return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+		else
+			return cmp;
+	}
+}
+
 /*
  * Compare two SQL/JSON items using comparison operation 'op'.
  */
@@ -2022,9 +2131,8 @@ compareItems(int32 op, JsonbValue *jb1, JsonbValue *jb2)
 						   jb2->val.string.val,
 						   jb1->val.string.len) ? jpbFalse : jpbTrue;
 
-			cmp = varstr_cmp(jb1->val.string.val, jb1->val.string.len,
-							 jb2->val.string.val, jb2->val.string.len,
-							 DEFAULT_COLLATION_OID);
+			cmp = compareStrings(jb1->val.string.val, jb1->val.string.len,
+								 jb2->val.string.val, jb2->val.string.len);
 			break;
 
 		case jbvBinary:
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index 31a871af028..0202667a1f7 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -1833,3 +1833,166 @@ SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
  t
 (1 row)
 
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;
+       j       |       j       |  lt   |  le   |  eq   |  ge   |  gt   
+---------------+---------------+-------+-------+-------+-------+-------
+ {"s": ""}     | {"s": ""}     | false | true  | true  | true  | false
+ {"s": ""}     | {"s": "a"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "b"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "A"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "a"}    | {"s": "a"}    | false | true  | true  | true  | false
+ {"s": "a"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "a"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "a"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "a"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ab"}   | false | true  | true  | true  | false
+ {"s": "ab"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "abc"}  | false | true  | true  | true  | false
+ {"s": "abc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abcd"} | false | true  | true  | true  | false
+ {"s": "abcd"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abcd"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "B"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "b"}    | {"s": "a"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abcd"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "b"}    | false | true  | true  | true  | false
+ {"s": "b"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "A"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "A"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "A"}    | false | true  | true  | true  | false
+ {"s": "A"}    | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "AB"}   | false | true  | true  | true  | false
+ {"s": "AB"}   | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "ABC"}  | false | true  | true  | true  | false
+ {"s": "ABC"}  | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABc"}  | false | true  | true  | true  | false
+ {"s": "ABc"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABcD"} | false | true  | true  | true  | false
+ {"s": "ABcD"} | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "B"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "B"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "B"}    | {"s": "B"}    | false | true  | true  | true  | false
+(144 rows)
+
diff --git a/src/test/regress/sql/jsonb_jsonpath.sql b/src/test/regress/sql/jsonb_jsonpath.sql
index 733fbd4e0d0..e7629fb7f9d 100644
--- a/src/test/regress/sql/jsonb_jsonpath.sql
+++ b/src/test/regress/sql/jsonb_jsonpath.sql
@@ -387,3 +387,19 @@ SELECT jsonb_path_match('[true, true]', '$[*]', silent => false);
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 1';
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 2';
 SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
+
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#4)

1 attachment(s)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance. So, "==" remains per-byte comparison. For consistency
in other operators we compare code points first, then do per-byte
comparison. In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard. In future we can implement strict standard
conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings(). Revised version is attached.

Nikita pointed me that for UTF-8 strings per-byte comparison result
matches codepoints comparison result. That allows simplify patch a
lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patchapplication/octet-stream; name=0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patchDownload

commit 78d0188fcf4c01fa824e6f09a8026629eaa725ef
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Wed Aug 7 23:41:58 2019 +0300

    Adjust string comparison in jsonpath
    
    We have implemented jsonpath string comparison using default database locale.
    However, standard requires us to compare Unicode codepoints.  This commit
    implements that, but for performance reasons we still use per-byte comparison
    for "==" operator.  Thus, for consistency other comparison operators do per-byte
    comparison if Unicode codepoints appear to be equal.
    
    In some edge cases, when same Unicode codepoints have different binary
    representations in database encoding, we diverge standard to achieve better
    performance of "==" operator.  In future to implement strict standard
    conformance, we can do normalization of input JSON strings.
    
    Original patch was written by Nikita Glukhov, rewritten by me.
    
    Reported-by: Markus Winand
    Discussion: https://postgr.es/m/8B7FA3B4-328D-43D7-95A8-37B8891B8C78%40winand.at
    Author: Nikita Glukhov, Alexander Korotkov
    Backpatch-through: 12

diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index 293d6da027c..4a0de4fe95a 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1980,6 +1980,73 @@ executeComparison(JsonPathItem *cmp, JsonbValue *lv, JsonbValue *rv, void *p)
 	return compareItems(cmp->type, lv, rv);
 }
 
+/*
+ * Perform per-byte comparison of two strings.
+ */
+static int
+binaryCompareStrings(const char *s1, int len1,
+					 const char *s2, int len2)
+{
+	int			cmp;
+
+	cmp = memcmp(s1, s2, Min(len1, len2));
+
+	if (cmp != 0)
+		return cmp;
+
+	if (len1 == len2)
+		return 0;
+
+	return len1 < len2 ? -1 : 1;
+}
+
+/*
+ * Compare two strings in the current server encoding using Unicode codepoint
+ * collation.
+ */
+static int
+compareStrings(const char *mbstr1, int mblen1,
+			   const char *mbstr2, int mblen2)
+{
+	if (GetDatabaseEncoding() == PG_SQL_ASCII ||
+		GetDatabaseEncoding() == PG_UTF8)
+	{
+		/*
+		 * It's known property of UTF-8 strings that their per-byte comparison
+		 * result matches codepoints comparison result.  ASCII can be
+		 * considered as special case of UTF-8.
+		 */
+		return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+	}
+	else
+	{
+		/* We have to convert other encodings to UTF-8 first, then compare. */
+		char	   *utf8str1 = pg_server_to_any(mbstr1, mblen1, PG_UTF8),
+				   *utf8str2 = pg_server_to_any(mbstr2, mblen2, PG_UTF8);
+		int			cmp;
+
+		cmp = binaryCompareStrings(utf8str1, strlen(utf8str1),
+								   utf8str2, strlen(utf8str2));
+
+		pfree(utf8str1);
+		pfree(utf8str2);
+
+		/*
+		 * When all Unicode codepoints are equal, return result of binary
+		 * comparison.  In some edge cases, same characters may have different
+		 * representations in encoding.  Then our behavior could diverge from
+		 * standard.  However, that allow us to do simple binary comparison
+		 * for "==" operator, which is performance critical in typical cases.
+		 * In future to implement strict standard conformance, we can do
+		 * normalization of input JSON strings.
+		 */
+		if (cmp == 0)
+			return binaryCompareStrings(mbstr1, mblen1, mbstr2, mblen2);
+		else
+			return cmp;
+	}
+}
+
 /*
  * Compare two SQL/JSON items using comparison operation 'op'.
  */
@@ -2022,9 +2089,8 @@ compareItems(int32 op, JsonbValue *jb1, JsonbValue *jb2)
 						   jb2->val.string.val,
 						   jb1->val.string.len) ? jpbFalse : jpbTrue;
 
-			cmp = varstr_cmp(jb1->val.string.val, jb1->val.string.len,
-							 jb2->val.string.val, jb2->val.string.len,
-							 DEFAULT_COLLATION_OID);
+			cmp = compareStrings(jb1->val.string.val, jb1->val.string.len,
+								 jb2->val.string.val, jb2->val.string.len);
 			break;
 
 		case jbvBinary:
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index 31a871af028..0202667a1f7 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -1833,3 +1833,166 @@ SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
  t
 (1 row)
 
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;
+       j       |       j       |  lt   |  le   |  eq   |  ge   |  gt   
+---------------+---------------+-------+-------+-------+-------+-------
+ {"s": ""}     | {"s": ""}     | false | true  | true  | true  | false
+ {"s": ""}     | {"s": "a"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "b"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "A"}    | true  | true  | false | false | false
+ {"s": ""}     | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": ""}     | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": ""}     | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "a"}    | {"s": "a"}    | false | true  | true  | true  | false
+ {"s": "a"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "a"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "a"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "a"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "a"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "a"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ab"}   | false | true  | true  | true  | false
+ {"s": "ab"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ab"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "ab"}   | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "abc"}  | false | true  | true  | true  | false
+ {"s": "abc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abc"}  | {"s": "B"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "a"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "abcd"} | false | true  | true  | true  | false
+ {"s": "abcd"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "abcd"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "abcd"} | {"s": "B"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "b"}    | {"s": "a"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ab"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "abcd"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "b"}    | false | true  | true  | true  | false
+ {"s": "b"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "b"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "b"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "b"}    | {"s": "B"}    | false | false | false | true  | true
+ {"s": "A"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "A"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "A"}    | false | true  | true  | true  | false
+ {"s": "A"}    | {"s": "AB"}   | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "A"}    | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": ""}     | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "A"}    | false | false | false | true  | true
+ {"s": "AB"}   | {"s": "AB"}   | false | true  | true  | true  | false
+ {"s": "AB"}   | {"s": "ABC"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "AB"}   | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABC"}  | {"s": "ABC"}  | false | true  | true  | true  | false
+ {"s": "ABC"}  | {"s": "ABc"}  | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABC"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABc"}  | {"s": "ABc"}  | false | true  | true  | true  | false
+ {"s": "ABc"}  | {"s": "ABcD"} | true  | true  | false | false | false
+ {"s": "ABc"}  | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": ""}     | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "ABcD"} | {"s": "A"}    | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "ABcD"} | {"s": "ABcD"} | false | true  | true  | true  | false
+ {"s": "ABcD"} | {"s": "B"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": ""}     | false | false | false | true  | true
+ {"s": "B"}    | {"s": "a"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "ab"}   | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abc"}  | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "abcd"} | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "b"}    | true  | true  | false | false | false
+ {"s": "B"}    | {"s": "A"}    | false | false | false | true  | true
+ {"s": "B"}    | {"s": "AB"}   | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABC"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABc"}  | false | false | false | true  | true
+ {"s": "B"}    | {"s": "ABcD"} | false | false | false | true  | true
+ {"s": "B"}    | {"s": "B"}    | false | true  | true  | true  | false
+(144 rows)
+
diff --git a/src/test/regress/sql/jsonb_jsonpath.sql b/src/test/regress/sql/jsonb_jsonpath.sql
index 733fbd4e0d0..e7629fb7f9d 100644
--- a/src/test/regress/sql/jsonb_jsonpath.sql
+++ b/src/test/regress/sql/jsonb_jsonpath.sql
@@ -387,3 +387,19 @@ SELECT jsonb_path_match('[true, true]', '$[*]', silent => false);
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 1';
 SELECT jsonb '[{"a": 1}, {"a": 2}]' @@ '$[*].a > 2';
 SELECT jsonb_path_match('[{"a": 1}, {"a": 2}]', '$[*].a > 1');
+
+-- test string comparison (Unicode codepoint collation)
+WITH str(j, num) AS
+(
+	SELECT jsonb_build_object('s', s), num
+	FROM unnest('{"", "a", "ab", "abc", "abcd", "b", "A", "AB", "ABC", "ABc", "ABcD", "B"}'::text[]) WITH ORDINALITY AS a(s, num)
+)
+SELECT
+	s1.j, s2.j,
+	jsonb_path_query_first(s1.j, '$.s < $s', vars => s2.j) lt,
+	jsonb_path_query_first(s1.j, '$.s <= $s', vars => s2.j) le,
+	jsonb_path_query_first(s1.j, '$.s == $s', vars => s2.j) eq,
+	jsonb_path_query_first(s1.j, '$.s >= $s', vars => s2.j) ge,
+	jsonb_path_query_first(s1.j, '$.s > $s', vars => s2.j) gt
+FROM str s1, str s2
+ORDER BY s1.num, s2.num;

Markus Winand

markus.winand@winand.at

over 6 years ago

In reply to: Alexander Korotkov (#5)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

Hi!

The patch makes my tests pass.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

-markus
ps.: for me, testing the regular expression dialect of like_regex is out of scope

Show quoted text

On 8 Aug 2019, at 02:27, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:

On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:

I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance. So, "==" remains per-byte comparison. For consistency
in other operators we compare code points first, then do per-byte
comparison. In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard. In future we can implement strict standard
conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings(). Revised version is attached.

Nikita pointed me that for UTF-8 strings per-byte comparison result
matches codepoints comparison result. That allows simplify patch a
lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Company
<0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Markus Winand (#6)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

Hi, Markus!

On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:

The patch makes my tests pass.

Cool.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

PostgreSQL supports ucs_basic, but it's alias to C collation and works
only for utf-8. Jsonpath code may work in different encodings. New
string comparison code can work in different encodings.

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

For object keys we don't actually care about whether strings are less
or greater. We only search for equal keys. So, per-byte comparison
we currently use should be fine. The same states for "starts with"
predicate.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

That's OK. There should be other beta testers around :)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#7)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

On Thu, Aug 8, 2019 at 11:30 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:

The patch makes my tests pass.

Cool.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

PostgreSQL supports ucs_basic, but it's alias to C collation and works
only for utf-8. Jsonpath code may work in different encodings. New
string comparison code can work in different encodings.

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

For object keys we don't actually care about whether strings are less
or greater. We only search for equal keys. So, per-byte comparison
we currently use should be fine. The same states for "starts with"
predicate.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

That's OK. There should be other beta testers around :)

So, I'm going to push this if no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#8)

Re: SQL/JSON path: collation for comparisons, minor typos in docs

On Fri, Aug 9, 2019 at 5:27 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 11:30 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 11:53 AM Markus Winand <markus.winand@winand.at> wrote:

The patch makes my tests pass.

Cool.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

PostgreSQL supports ucs_basic, but it's alias to C collation and works
only for utf-8. Jsonpath code may work in different encodings. New
string comparison code can work in different encodings.

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).
- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

For object keys we don't actually care about whether strings are less
or greater. We only search for equal keys. So, per-byte comparison
we currently use should be fine. The same states for "starts with"
predicate.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

That's OK. There should be other beta testers around :)

So, I'm going to push this if no objections.

So, pushed.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company