change regexp_substr first argument make tests more easier to understand.

Started by jian heabout 2 years ago6 messages
#1jian he
jian.universality@gmail.com

hi.
https://git.postgresql.org/cgit/postgresql.git/tree/src/test/regress/expected/strings.out#n928

SELECT regexp_substr('abcabcabc', 'a.c');
SELECT regexp_substr('abcabcabc', 'a.c', 2);
SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');

they all return 'abc', there are 3 'abc ' in string 'abcabcabc'
except IS NULL query.
maybe we can change regexp_substr first argument from "abcabcabc" to
"abcaXcaYc".
so the result would be more easier to understand.

#2jian he
jian.universality@gmail.com
In reply to: jian he (#1)
1 attachment(s)
Re: change regexp_substr first argument make tests more easier to understand.

On Thu, Dec 28, 2023 at 12:13 AM jian he <jian.universality@gmail.com> wrote:

hi.
https://git.postgresql.org/cgit/postgresql.git/tree/src/test/regress/expected/strings.out#n928

SELECT regexp_substr('abcabcabc', 'a.c');
SELECT regexp_substr('abcabcabc', 'a.c', 2);
SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');

they all return 'abc', there are 3 'abc ' in string 'abcabcabc'
except IS NULL query.
maybe we can change regexp_substr first argument from "abcabcabc" to
"abcaXcaYc".
so the result would be more easier to understand.

anyway here is the minor patch to change string from "abcabcabc" to
"abcaXcaYc".

Attachments:

v1-0001-make-regex-result-more-easier-to-understand.patchtext/x-patch; charset=US-ASCII; name=v1-0001-make-regex-result-more-easier-to-understand.patchDownload
From ffe63d9607824e6aa89d590761fcb05219b799d8 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universality@gmail.com>
Date: Thu, 28 Dec 2023 14:11:31 +0800
Subject: [PATCH v1 1/1] make regexp_substr result more easier to understand.

  regexp_substr using "abcabcabc" is not good for test,
  since in many cases, it will return "abc".
  there are 3  "abc" in "abcabcabc", change the test string to something else
  make the results more clear.

---
 src/test/regress/expected/strings.out | 14 +++++++-------
 src/test/regress/sql/strings.sql      |  8 ++++----
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index b7500d9c..7098934a 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -925,22 +925,22 @@ SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
  t
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c');
+SELECT regexp_substr('abcaXcaYc', 'a.c');
  regexp_substr 
 ---------------
  abc
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c', 2);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 2);
  regexp_substr 
 ---------------
- abc
+ aXc
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 1, 3);
  regexp_substr 
 ---------------
- abc
+ aYc
 (1 row)
 
 SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
@@ -949,10 +949,10 @@ SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
  t
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('abcaXcaYc', 'A.C', 1, 2, 'i');
  regexp_substr 
 ---------------
- abc
+ aXc
 (1 row)
 
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 39596789..4ed86516 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -274,11 +274,11 @@ SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, '', -1);
 -- regexp_substr tests
 SELECT regexp_substr('abcdefghi', 'd.f');
 SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
-SELECT regexp_substr('abcabcabc', 'a.c');
-SELECT regexp_substr('abcabcabc', 'a.c', 2);
-SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_substr('abcaXcaYc', 'a.c');
+SELECT regexp_substr('abcaXcaYc', 'a.c', 2);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 1, 3);
 SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
-SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('abcaXcaYc', 'A.C', 1, 2, 'i');
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 1);
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 2);
-- 
2.34.1

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: jian he (#2)
Re: change regexp_substr first argument make tests more easier to understand.

Hi,

If I understand correctly, the problem is that it's not clear which of
the 'abc' substrings is matched/returned by the function, right?

I wonder if this is a problem only for understanding the test, or if it
makes the tests a bit weaker. I mean, what if the function returns the
wrong substring? How would we know?

Also, if we tweak this, shouldn't we tweak also the regext_instr() calls
a bit earlier in the test script?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4jian he
jian.universality@gmail.com
In reply to: Tomas Vondra (#3)
1 attachment(s)
Re: change regexp_substr first argument make tests more easier to understand.

On Fri, Jul 19, 2024 at 5:49 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

If I understand correctly, the problem is that it's not clear which of
the 'abc' substrings is matched/returned by the function, right?

I wonder if this is a problem only for understanding the test, or if it
makes the tests a bit weaker. I mean, what if the function returns the
wrong substring? How would we know?

this is for understanding the test.
personally, sometimes, I feel the documentation is too dry, hard to follow.
so i can based on regress tests better understand the documentation.
that was my intention for the changes.

we have more sophisticated regex test at
https://git.postgresql.org/cgit/postgresql.git/tree/src/test/modules/test_regex

Also, if we tweak this, shouldn't we tweak also the regext_instr() calls
a bit earlier in the test script?

sure.
please check attached.

Attachments:

v2-0001-refactor-regex-related-tests.patchtext/x-patch; charset=US-ASCII; name=v2-0001-refactor-regex-related-tests.patchDownload
From 0b9e7a9c32d4e3163ef0cc955a0916b9d5c7f395 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 25 Jul 2024 22:46:07 +0800
Subject: [PATCH v2 1/1] refactor regex related tests

some tests for regexp_instr, regexp_substr using "abcabcabc" as the source string,
which is not good for testing.

Because in many tests the matched pattern is "abc", but in the source string we have 3 "abc"
sometimes we don't know which "abc '' refers to.
changing the source string so the results would be easier to understand.
---
 src/test/regress/expected/strings.out | 42 +++++++++++++++++++--------
 src/test/regress/sql/strings.sql      | 21 ++++++++------
 2 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 52b69a10..4471f9a2 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -790,31 +790,43 @@ SELECT regexp_instr('abcdefghi', 'd.q');
             0
 (1 row)
 
-SELECT regexp_instr('abcabcabc', 'a.c');
+SELECT regexp_instr('abcaXcaYc', 'a.c');
  regexp_instr 
 --------------
             1
 (1 row)
 
-SELECT regexp_instr('abcabcabc', 'a.c', 2);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 2);
  regexp_instr 
 --------------
             4
 (1 row)
 
-SELECT regexp_instr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 3);
  regexp_instr 
 --------------
             7
 (1 row)
 
-SELECT regexp_instr('abcabcabc', 'a.c', 1, 4);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 3, 1);
+ regexp_instr 
+--------------
+           10
+(1 row)
+
+SELECT regexp_instr('abcaXcaYc', 'a.c', 2, 2, 0);
+ regexp_instr 
+--------------
+            7
+(1 row)
+
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 4);
  regexp_instr 
 --------------
             0
 (1 row)
 
-SELECT regexp_instr('abcabcabc', 'A.C', 1, 2, 0, 'i');
+SELECT regexp_instr('abcaXcayc', 'A.C', 1, 2, 0, 'i');
  regexp_instr 
 --------------
             4
@@ -925,22 +937,22 @@ SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
  t
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c');
+SELECT regexp_substr('abcaXcaYc', 'a.c');
  regexp_substr 
 ---------------
  abc
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c', 2);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 2);
  regexp_substr 
 ---------------
- abc
+ aXc
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 1, 3);
  regexp_substr 
 ---------------
- abc
+ aYc
 (1 row)
 
 SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
@@ -949,10 +961,16 @@ SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
  t
 (1 row)
 
-SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('abcaXcaYc', 'A.C', 1, 2, 'i');
  regexp_substr 
 ---------------
- abc
+ aXc
+(1 row)
+
+SELECT regexp_substr('abcaXcaYc', 'A.C', 2, 2, 'i');
+ regexp_substr 
+---------------
+ aYc
 (1 row)
 
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 39596789..d553a98c 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -244,11 +244,13 @@ SELECT regexp_like('abc', 'a.c', 'g');  -- error
 -- regexp_instr tests
 SELECT regexp_instr('abcdefghi', 'd.f');
 SELECT regexp_instr('abcdefghi', 'd.q');
-SELECT regexp_instr('abcabcabc', 'a.c');
-SELECT regexp_instr('abcabcabc', 'a.c', 2);
-SELECT regexp_instr('abcabcabc', 'a.c', 1, 3);
-SELECT regexp_instr('abcabcabc', 'a.c', 1, 4);
-SELECT regexp_instr('abcabcabc', 'A.C', 1, 2, 0, 'i');
+SELECT regexp_instr('abcaXcaYc', 'a.c');
+SELECT regexp_instr('abcaXcaYc', 'a.c', 2);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 3);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 3, 1);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 2, 2, 0);
+SELECT regexp_instr('abcaXcaYc', 'a.c', 1, 4);
+SELECT regexp_instr('abcaXcayc', 'A.C', 1, 2, 0, 'i');
 SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 0);
 SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 1);
 SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 2);
@@ -274,11 +276,12 @@ SELECT regexp_instr('abcabcabc', 'a.c', 1, 1, 0, '', -1);
 -- regexp_substr tests
 SELECT regexp_substr('abcdefghi', 'd.f');
 SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;
-SELECT regexp_substr('abcabcabc', 'a.c');
-SELECT regexp_substr('abcabcabc', 'a.c', 2);
-SELECT regexp_substr('abcabcabc', 'a.c', 1, 3);
+SELECT regexp_substr('abcaXcaYc', 'a.c');
+SELECT regexp_substr('abcaXcaYc', 'a.c', 2);
+SELECT regexp_substr('abcaXcaYc', 'a.c', 1, 3);
 SELECT regexp_substr('abcabcabc', 'a.c', 1, 4) IS NULL AS t;
-SELECT regexp_substr('abcabcabc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('abcaXcaYc', 'A.C', 1, 2, 'i');
+SELECT regexp_substr('abcaXcaYc', 'A.C', 2, 2, 'i');
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 0);
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 1);
 SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 2);
-- 
2.34.1

#5Ilia Evdokimov
ilya.evdokimov@tantorlabs.com
In reply to: jian he (#4)
Re: change regexp_substr first argument make tests more easier to understand.

Hi everybody

Current tests with regexp_instr() and regexp_substr()  with string
'abcabcabc' are really unreadable and you would spend time to understand
that happens in these tests and if they are really correct. I'd better
change them into "abcdefghi" just like in query

    SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;

Regards

Ilia Evdokimov,
Tantor Labs LLC.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ilia Evdokimov (#5)
Re: change regexp_substr first argument make tests more easier to understand.

Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> writes:

Current tests with regexp_instr() and regexp_substr()  with string
'abcabcabc' are really unreadable and you would spend time to understand
that happens in these tests and if they are really correct. I'd better
change them into "abcdefghi" just like in query

    SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t;

On looking more closely at these test cases, I think the point of them
is exactly to show the behavior of the functions with multiple copies
of the target substring. Thus, what Jian is proposing breaks the
tests: it's no longer perfectly clear whether the result is because
the function did what we expect, or because the pattern failed to
match anywhere else. (Sure, "a.c" *should* match "aXc", but if it
didn't, you wouldn't discover that from this test.) What Ilia
proposes would break them worse.

I think we should just reject this patch, or at least reject the
parts of it that change existing test cases. I have no opinion
about whether the new test cases add anything useful.

regards, tom lane