SIMILAR TO expressions translate wildcards where they shouldn't

Started by Laurenz Albe11 months ago11 messagesbugs
Jump to latest
#1Laurenz Albe
laurenz.albe@cybertec.at

The following surprising result

SELECT 'a_b' SIMILAR TO '[_[:alpha:]]*',
'a_b' SIMILAR TO '[[:alpha:]_]*';

?column? │ ?column?
══════════╪══════════
t │ f
(1 row)

becomes clear when we look how the expressions are translated to
regular expressions:

EXPLAIN (VERBOSE, GENERIC_PLAN, COSTS OFF)
SELECT $1 SIMILAR TO '[_[:alpha:]]*',
$1 SIMILAR TO '[[:alpha:]_]*';

QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════
Result
Output: ($1 ~ '^(?:[_[:alpha:]]*)$'::text), ($1 ~ '^(?:[[:alpha:].]*)$'::text)
(2 rows)

The underscore before the [:alpha:] is left alone, but the one after
it gets translated to a period. Now the underscore is a wildcard
that corresponds to the period in regular expressions, but characters
in square brackets should lose their special meaning. The code in
utils/adt/regexp.c doesn't expect that square brackets can be nested.

The attached patch fixes the bug.

Yours,
Laurenz Albe

Attachments:

v1-0001-Fix-SIMILAR-TO-regex-translation.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-SIMILAR-TO-regex-translation.patchDownload+28-7
#2Michael Paquier
michael@paquier.xyz
In reply to: Laurenz Albe (#1)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Thu, May 22, 2025 at 11:18:44PM +0200, Laurenz Albe wrote:

The underscore before the [:alpha:] is left alone, but the one after
it gets translated to a period. Now the underscore is a wildcard
that corresponds to the period in regular expressions, but characters
in square brackets should lose their special meaning. The code in
utils/adt/regexp.c doesn't expect that square brackets can be nested.

The attached patch fixes the bug.

Oh, good catch. [_[:alpha:]] and [[:alpha:]_] both that this should
match every string made of a-zA-Z and underscores, but this is failing
to do the job for the latter.

+           if (pchar != '^' && charclass_start)
+               charclass_start = false;

I'm a bit puzzled by this part about '^', though, resetting the fact
that we are in a squared bracket section with '^' treated as an
exception. Perhaps this deserves a comment?
--
Michael

#3Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#2)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Fri, May 23, 2025 at 10:10:04AM +0900, Michael Paquier wrote:

Oh, good catch. [_[:alpha:]] and [[:alpha:]_] both that this should
match every string made of a-zA-Z and underscores, but this is failing
to do the job for the latter.

+           if (pchar != '^' && charclass_start)
+               charclass_start = false;

I'm a bit puzzled by this part about '^', though, resetting the fact
that we are in a squared bracket section with '^' treated as an
exception. Perhaps this deserves a comment?

Ah, I see. This is just a way of saying that the caret ^ should still
be able to track the first character in the character class.

Anyway, I don't think that the tests in the patch are complete. For
example, '[[^](]' is transformed into '^(?:[[^](])$' in the SIMILAR TO
conversion with the patch, and before the patch we get
'^(?:[[^](?:])$'. Note the translation of the last parenthesis '(' to
"(?:" when inside the character class, but your patch does not
document that. AFAIU, we should not convert the parenthesis '(' while
in a multi-level character class as the patch does, but we have no
tests for it and the patch does not document this part, either.

Could it be possible to split the single SIMILAR TO expression into
multiple smaller pieces for each character that should not be
converted while inside a character class? This is hard to parse as
written in your proposal of patch.
--
Michael

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#3)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Fri, 2025-05-23 at 12:22 +0900, Michael Paquier wrote:

Anyway, I don't think that the tests in the patch are complete. For
example, '[[^](]' is transformed into '^(?:[[^](])$' in the SIMILAR TO
conversion with the patch, and before the patch we get
'^(?:[[^](?:])$'. Note the translation of the last parenthesis '(' to
"(?:" when inside the character class, but your patch does not
document that. AFAIU, we should not convert the parenthesis '(' while
in a multi-level character class as the patch does, but we have no
tests for it and the patch does not document this part, either.

Could it be possible to split the single SIMILAR TO expression into
multiple smaller pieces for each character that should not be
converted while inside a character class? This is hard to parse as
written in your proposal of patch.

I have changed the regression test like you suggest.

I also improved the code by adding more comments.
I renamed "incharclass" to "charclass_depth", which is more descriptive.

Also, I had to work some more on handling carets:
While the closing bracket is a regular character in []] and [^]], it
is not in expressions like [^^].

Yours,
Laurenz Albe

Attachments:

v2-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchDownload+45-7
#5Michael Paquier
michael@paquier.xyz
In reply to: Laurenz Albe (#4)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Fri, May 23, 2025 at 11:42:10AM +0200, Laurenz Albe wrote:

I also improved the code by adding more comments.
I renamed "incharclass" to "charclass_depth", which is more descriptive.

Also, I had to work some more on handling carets:
While the closing bracket is a regular character in []] and [^]], it
is not in expressions like [^^].

The transformation with the patch:
'^(?:\.[.[:alnum:]_].[]%].*[^]$]\$[^^]\^[(](?:p))$'
And on HEAD:
'^(?:\.[.[:alnum:].].[].*].*[^]\$]\$[^^]\^[(](?:p))$'

If I am not missing something, '%', '_', '$', single and double carets
at the beginning are covered. '.' and '(' are not.

Anyway, that's really hard to parse so I would suggest to split each
check into queries of their own to show individual conversions in
these EXPLAIN outputs (we don't care if tese regexps are correct, just
want to check the output to the POSIX style). I am OK with the point
based on charclass_start to count the number of carets at the
beginning of a character class.

With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?
--
Michael

Attachments:

v3-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchtext/x-diff; charset=us-asciiDownload+143-7
#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#5)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:

Anyway, that's really hard to parse so I would suggest to split each
check into queries of their own to show individual conversions in
these EXPLAIN outputs (we don't care if tese regexps are correct, just
want to check the output to the POSIX style).  I am OK with the point
based on charclass_start to count the number of carets at the
beginning of a character class.

With some tweaks and the tests reworked, I am finishing with the
reviewed version attached.  What do you think?

Thank you; I think that is good to go.

Yours,
Laurenz Albe

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#6)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:

With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?

Thank you; I think that is good to go.

Code changes look good, but I think the test cases are too cute:

+EXPLAIN (VERBOSE, COSTS OFF) SELECT (SELECT '') SIMILAR TO '_[_[:alpha:]_]_';
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Result
+   Output: ((InitPlan 1).col1 ~ '^(?:.[_[:alpha:]_].)$'::text)
+   InitPlan 1
+     ->  Result
+           Output: ''::text
+(5 rows)

This will break whenever somebody decides it's worth optimizing
a sub-select that looks like that. I'd suggest following the
pattern

explain (costs off) select * from text_tbl where f1 similar to 'z';
QUERY PLAN
----------------------------------
Seq Scan on text_tbl
Filter: (f1 ~ '^(?:z)$'::text)
(2 rows)

which is both less noisy and less likely to change in future.

regards, tom lane

#8Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#7)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Tue, 2025-05-27 at 10:54 -0400, Tom Lane wrote:

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:

With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?

Thank you; I think that is good to go.

Code changes look good, but I think the test cases are too cute:

+EXPLAIN (VERBOSE, COSTS OFF) SELECT (SELECT '') SIMILAR TO '_[_[:alpha:]_]_';
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Result
+   Output: ((InitPlan 1).col1 ~ '^(?:.[_[:alpha:]_].)$'::text)
+   InitPlan 1
+     ->  Result
+           Output: ''::text
+(5 rows)

This will break whenever somebody decides it's worth optimizing
a sub-select that looks like that. I'd suggest following the
pattern

explain (costs off) select * from text_tbl where f1 similar to 'z';
QUERY PLAN
----------------------------------
Seq Scan on text_tbl
Filter: (f1 ~ '^(?:z)$'::text)
(2 rows)

which is both less noisy and less likely to change in future.

That's a good point.
I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.

Yours,
Laurenz Albe

#9Michael Paquier
michael@paquier.xyz
In reply to: Laurenz Albe (#8)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Tue, May 27, 2025 at 11:39:02PM +0200, Laurenz Albe wrote:

I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.

Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#9)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Wed, May 28, 2025 at 09:00:20AM +0900, Michael Paquier wrote:

Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.

And I've managed to miss that the case with an opening parenthesis
throws an error due to an unbalanced set of parentheses because the
regexp is processed when the query is written the way Tom has
mentioned.

Will fix in a bit after an extra round of check-world, sorry about
that..
--
Michael

#11Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#9)
Re: SIMILAR TO expressions translate wildcards where they shouldn't

On Wed, 2025-05-28 at 09:00 +0900, Michael Paquier wrote:

On Tue, May 27, 2025 at 11:39:02PM +0200, Laurenz Albe wrote:

I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.

Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.

Thanks for taking care of this!

Yours,
Laurenz Albe