SIMILAR TO expressions translate wildcards where they shouldn't
The following surprising result
SELECT 'a_b' SIMILAR TO '[_[:alpha:]]*',
'a_b' SIMILAR TO '[[:alpha:]_]*';
?column? │ ?column?
══════════╪══════════
t │ f
(1 row)
becomes clear when we look how the expressions are translated to
regular expressions:
EXPLAIN (VERBOSE, GENERIC_PLAN, COSTS OFF)
SELECT $1 SIMILAR TO '[_[:alpha:]]*',
$1 SIMILAR TO '[[:alpha:]_]*';
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════
Result
Output: ($1 ~ '^(?:[_[:alpha:]]*)$'::text), ($1 ~ '^(?:[[:alpha:].]*)$'::text)
(2 rows)
The underscore before the [:alpha:] is left alone, but the one after
it gets translated to a period. Now the underscore is a wildcard
that corresponds to the period in regular expressions, but characters
in square brackets should lose their special meaning. The code in
utils/adt/regexp.c doesn't expect that square brackets can be nested.
The attached patch fixes the bug.
Yours,
Laurenz Albe
Attachments:
v1-0001-Fix-SIMILAR-TO-regex-translation.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-SIMILAR-TO-regex-translation.patchDownload+28-7
On Thu, May 22, 2025 at 11:18:44PM +0200, Laurenz Albe wrote:
The underscore before the [:alpha:] is left alone, but the one after
it gets translated to a period. Now the underscore is a wildcard
that corresponds to the period in regular expressions, but characters
in square brackets should lose their special meaning. The code in
utils/adt/regexp.c doesn't expect that square brackets can be nested.The attached patch fixes the bug.
Oh, good catch. [_[:alpha:]] and [[:alpha:]_] both that this should
match every string made of a-zA-Z and underscores, but this is failing
to do the job for the latter.
+ if (pchar != '^' && charclass_start)
+ charclass_start = false;
I'm a bit puzzled by this part about '^', though, resetting the fact
that we are in a squared bracket section with '^' treated as an
exception. Perhaps this deserves a comment?
--
Michael
On Fri, May 23, 2025 at 10:10:04AM +0900, Michael Paquier wrote:
Oh, good catch. [_[:alpha:]] and [[:alpha:]_] both that this should
match every string made of a-zA-Z and underscores, but this is failing
to do the job for the latter.+ if (pchar != '^' && charclass_start) + charclass_start = false;I'm a bit puzzled by this part about '^', though, resetting the fact
that we are in a squared bracket section with '^' treated as an
exception. Perhaps this deserves a comment?
Ah, I see. This is just a way of saying that the caret ^ should still
be able to track the first character in the character class.
Anyway, I don't think that the tests in the patch are complete. For
example, '[[^](]' is transformed into '^(?:[[^](])$' in the SIMILAR TO
conversion with the patch, and before the patch we get
'^(?:[[^](?:])$'. Note the translation of the last parenthesis '(' to
"(?:" when inside the character class, but your patch does not
document that. AFAIU, we should not convert the parenthesis '(' while
in a multi-level character class as the patch does, but we have no
tests for it and the patch does not document this part, either.
Could it be possible to split the single SIMILAR TO expression into
multiple smaller pieces for each character that should not be
converted while inside a character class? This is hard to parse as
written in your proposal of patch.
--
Michael
On Fri, 2025-05-23 at 12:22 +0900, Michael Paquier wrote:
Anyway, I don't think that the tests in the patch are complete. For
example, '[[^](]' is transformed into '^(?:[[^](])$' in the SIMILAR TO
conversion with the patch, and before the patch we get
'^(?:[[^](?:])$'. Note the translation of the last parenthesis '(' to
"(?:" when inside the character class, but your patch does not
document that. AFAIU, we should not convert the parenthesis '(' while
in a multi-level character class as the patch does, but we have no
tests for it and the patch does not document this part, either.Could it be possible to split the single SIMILAR TO expression into
multiple smaller pieces for each character that should not be
converted while inside a character class? This is hard to parse as
written in your proposal of patch.
I have changed the regression test like you suggest.
I also improved the code by adding more comments.
I renamed "incharclass" to "charclass_depth", which is more descriptive.
Also, I had to work some more on handling carets:
While the closing bracket is a regular character in []] and [^]], it
is not in expressions like [^^].
Yours,
Laurenz Albe
Attachments:
v2-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchDownload+45-7
On Fri, May 23, 2025 at 11:42:10AM +0200, Laurenz Albe wrote:
I also improved the code by adding more comments.
I renamed "incharclass" to "charclass_depth", which is more descriptive.Also, I had to work some more on handling carets:
While the closing bracket is a regular character in []] and [^]], it
is not in expressions like [^^].
The transformation with the patch:
'^(?:\.[.[:alnum:]_].[]%].*[^]$]\$[^^]\^[(](?:p))$'
And on HEAD:
'^(?:\.[.[:alnum:].].[].*].*[^]\$]\$[^^]\^[(](?:p))$'
If I am not missing something, '%', '_', '$', single and double carets
at the beginning are covered. '.' and '(' are not.
Anyway, that's really hard to parse so I would suggest to split each
check into queries of their own to show individual conversions in
these EXPLAIN outputs (we don't care if tese regexps are correct, just
want to check the output to the POSIX style). I am OK with the point
based on charclass_start to count the number of carets at the
beginning of a character class.
With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?
--
Michael
Attachments:
v3-0001-Fix-SIMILAR-TO-regex-translation-for-character-cl.patchtext/x-diff; charset=us-asciiDownload+143-7
On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:
Anyway, that's really hard to parse so I would suggest to split each
check into queries of their own to show individual conversions in
these EXPLAIN outputs (we don't care if tese regexps are correct, just
want to check the output to the POSIX style). I am OK with the point
based on charclass_start to count the number of carets at the
beginning of a character class.With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?
Thank you; I think that is good to go.
Yours,
Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:
With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?
Thank you; I think that is good to go.
Code changes look good, but I think the test cases are too cute:
+EXPLAIN (VERBOSE, COSTS OFF) SELECT (SELECT '') SIMILAR TO '_[_[:alpha:]_]_';
+ QUERY PLAN
+---------------------------------------------------------------
+ Result
+ Output: ((InitPlan 1).col1 ~ '^(?:.[_[:alpha:]_].)$'::text)
+ InitPlan 1
+ -> Result
+ Output: ''::text
+(5 rows)
This will break whenever somebody decides it's worth optimizing
a sub-select that looks like that. I'd suggest following the
pattern
explain (costs off) select * from text_tbl where f1 similar to 'z';
QUERY PLAN
----------------------------------
Seq Scan on text_tbl
Filter: (f1 ~ '^(?:z)$'::text)
(2 rows)
which is both less noisy and less likely to change in future.
regards, tom lane
On Tue, 2025-05-27 at 10:54 -0400, Tom Lane wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Tue, 2025-05-27 at 14:57 +0900, Michael Paquier wrote:
With some tweaks and the tests reworked, I am finishing with the
reviewed version attached. What do you think?Thank you; I think that is good to go.
Code changes look good, but I think the test cases are too cute:
+EXPLAIN (VERBOSE, COSTS OFF) SELECT (SELECT '') SIMILAR TO '_[_[:alpha:]_]_'; + QUERY PLAN +--------------------------------------------------------------- + Result + Output: ((InitPlan 1).col1 ~ '^(?:.[_[:alpha:]_].)$'::text) + InitPlan 1 + -> Result + Output: ''::text +(5 rows)This will break whenever somebody decides it's worth optimizing
a sub-select that looks like that. I'd suggest following the
patternexplain (costs off) select * from text_tbl where f1 similar to 'z';
QUERY PLAN
----------------------------------
Seq Scan on text_tbl
Filter: (f1 ~ '^(?:z)$'::text)
(2 rows)which is both less noisy and less likely to change in future.
That's a good point.
I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.
Yours,
Laurenz Albe
On Tue, May 27, 2025 at 11:39:02PM +0200, Laurenz Albe wrote:
I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.
Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.
--
Michael
On Wed, May 28, 2025 at 09:00:20AM +0900, Michael Paquier wrote:
Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.
And I've managed to miss that the case with an opening parenthesis
throws an error due to an unbalanced set of parentheses because the
regexp is processed when the query is written the way Tom has
mentioned.
Will fix in a bit after an extra round of check-world, sorry about
that..
--
Michael
On Wed, 2025-05-28 at 09:00 +0900, Michael Paquier wrote:
On Tue, May 27, 2025 at 11:39:02PM +0200, Laurenz Albe wrote:
I originally considered EXPLAIN (GENERIC_PLAN), but that would only
backpatch so far.Nice trick that makes the test output much smaller. I've used this
idea and applied the fix down to v13 after a second lookup this
morning.
Thanks for taking care of this!
Yours,
Laurenz Albe