Bizarre behavior of \w in a regular expression bracket construct

Started by Tom Laneabout 5 years ago12 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Our documentation says specifically "A character class cannot be used
as an endpoint of a range." This should apply to the character class
shorthand escapes (\d and so on) too, and for the most part it does:

# select 'x' ~ '[\d-a]';
ERROR: invalid regular expression: invalid character range

However, certain combinations involving \w don't throw any error:

# select 'x' ~ '[\w-a]';
?column?
----------
t
(1 row)

while others do:

# select 'x' ~ '[\w-;]';
ERROR: invalid regular expression: invalid character range

It turns out that what's happening here is that \w is being
macro-expanded into "[:alnum:]_" (see the brbackw[] constant
in regc_lex.c), so then we have

select 'x' ~ '[[:alnum:]_-a]';

and that's valid as long as '_' is less than the trailing
range bound. The fact that we're using REG_ERANGE for both
"range syntax botch" and "range start is greater than range
end" helps to mask the fact that the wrong thing is happening,
i.e. my last example above is giving the right error string
for the wrong reason.

I thought of changing the expansion to "_[:alnum:]" but of
course that just moves the problem around: then some cases
with \w after a dash would be accepted when they shouldn't be.

I have a patch in progress that gets rid of the hokey macro
expansion implementation of \w and friends, and I noticed
this issue because it started to reject "[\w-_]", which our
existing code accepts. There's a bunch of examples like that
in Joel's Javascript regex corpus. I suspect that Javascript
is reading such cases as "\w plus the literal characters '-'
and '_'", but I'm not 100% sure of that.

Anyway, I don't see any non-invasive way to fix this in the
back branches, and I'm not sure that anyone would appreciate
our changing it in stable branches anyway. But I wanted to
document the issue for the record.

regards, tom lane

#2Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#1)
Re: Bizarre behavior of \w in a regular expression bracket construct

On Sat, Feb 20, 2021, at 23:20, Tom Lane wrote:

I have a patch in progress that gets rid of the hokey macro
expansion implementation of \w and friends, and I noticed
this issue because it started to reject "[\w-_]", which our
existing code accepts. There's a bunch of examples like that
in Joel's Javascript regex corpus. I suspect that Javascript
is reading such cases as "\w plus the literal characters '-'
and '_'", but I'm not 100% sure of that.

In an attempt trying to demystify how \w works in various regex engines,
I created a test to deduce the matching ranges for a given bracket expression.

In the ASCII mode, it just tries all characters between 1...255:

regex | engine | deduced_ranges
------------+--------+-------------------------------
^([a-z])$ | pg | [a-z]
^([a-z])$ | pl | [a-z]
^([a-z])$ | v8 | [a-z]
^([\d-a])$ | pg |
^([\d-a])$ | pl | [-0-9a]
^([\d-a])$ | v8 | [-0-9a]
^([\w-;])$ | pg |
^([\w-;])$ | pl | [-0-9;A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
^([\w-_])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-_])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-_])$ | v8 | [-0-9A-Z_a-z]
^([\w])$ | pg | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w])$ | pl | [0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w])$ | v8 | [0-9A-Z_a-z]
^([\W])$ | pg |
^([\W])$ | pl | [\x01-/:-@[-^`{-©«-´¶-¹»-¿×÷]
^([\W])$ | v8 | [\x01-/:-@[-^`{-ÿ]
^([\w-a])$ | pg | [0-9A-Z_-zªµºÀ-ÖØ-öø-ÿ]
^([\w-a])$ | pl | [-0-9A-Z_a-zªµºÀ-ÖØ-öø-ÿ]
^([\w-a])$ | v8 | [-0-9A-Z_a-z]

In the UTF8 mode, it generates a 10000 random valid UTF-8 byte sequences converted to text.
This will of course leave a lot of gaps, but one gets the idea on what ranges there are.

regex | engine | deduced_ranges
------------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------
^([a-z])$ | pg | [a-z]
^([a-z])$ | pl | [a-z]
^([a-z])$ | v8 | [a-z]
^([\d-a])$ | pg | ERROR
^([\d-a])$ | pl | [-0-9a٤-٦٩۲-۴۶۸-۹߀-߃߅߇०२४९১-২৫-৬੦੩-੪੬੯૧૭ ... 5 chars ... -୯௧௩௫౦౩౯೧೩-೫೯൧൪-൯෧෫໐໖-໗໙༡-༢២᠔᥎᪁᮵258-9𐒨𝟿]
^([\d-a])$ | v8 | [-0-9a]
^([\w-;])$ | pg | ERROR
^([\w-;])$ | pl | [-0-9;A-Z_a-zÀÂÆÌÎ-ÐÔÙ-Úß-áéëîñó-õø-ùûýÿ ... 3901 chars ... 𭈞𭈢𭈴𭑇𭒐𭕵𭖋𭙋𭟞𭢘𭥋𭥬𭧊𭧝𭫘𭯙𭯟𭶾𭷵𭸴𭹊𭻚𭼁𭽁𭾠𮄖𮅮𮉵𮏲𮕙𮛣𮝎𮣂𮥑𮪨忹殺灊鏹]
^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
^([\w-_])$ | pg | [0-9A-Z_a-zªÁÆ-ÇÍ-ÒÔÙ-ÚÜÞáä-æèë-ìî-ïñõùý ... 3704 chars ... 𭍱𭓆𭓡𭕆𭖋𭖮𭘤𭙬𭣯𭦞𭬍𭭈𭲌𭶓𭶶𭷻𭹣𭹩𭼪𭾘𭿡𮄄𮄿𮆟𮆢𮇴𮋬𮍠𮏕𮒹𮜒𮝒𮡺𮦐𮨲𮩣𡛪韠𪊑]
^([\w-_])$ | pl | [-0-9A-Z_a-zªµÀ-ÁÅÈÊÑÓÕ-ÖØÚà-áã-æê-ìîð-ó ... 3884 chars ... 𭙐𭙥𭛏𭜆𭝃𭞗𭟺𭠼𭥮𭧕𭧙𭫢𭯛𭲠𭷱𭸡𭾉𮁣𮃦𮄫𮈔𮉞𮊀𮑳𮕝𮘊𮘚𮛍𮣝𮧕𮩺𮪇𮬊𮬡𡬘㩬茝鄛󠇂]
^([\w-_])$ | v8 | [-0-9A-Z_a-z]
^([\w])$ | pg | [0-9A-Z_a-zÃÇÉ-ÊÍ-ÎÐÒÖÙÛ-Þà-âåêî-ðò-ôöøú ... 3803 chars ... 𭏟-𭏠𭗷𭘱𭚆𭛿𭝵𭡓𭢕𭩪𭬞𭭆𭭾𭮺𭯌𭰅𭱇𭲩𭶧𭷡𭹿𭺟𮀑𮆔𮇩𮇰𮈯𮋷𮌜𮌨𮞄-𮞅𮩧𮫷𮬕𮮿舁]
^([\w])$ | pl | [0-9A-Z_a-zºÁÄ-ÆÉÍ-ÎÐÓ-ÔÖÙÛ-àâ-æéíð-ñø-ù ... 3881 chars ... 𭙗𭙳𭛨𭞌𭣘𭤁𭥖𭥜𭥷𭦋𭧺𭯊𭸘𭹍𭼷𭿰𮁵𮈅𮈇𮊩𮖛𮖹𮘠𮚞𮜞𮝀𮟟𮡖𮣝𮦖𮦘𮧏𮬅𮭁𮮟𮯓𦾱嶲󠇋]
^([\w])$ | v8 | [0-9A-Z_a-z]
^([\W])$ | pg | ERROR
^([\W])$ | pl | [\x01-/:-@[-^`{-\x7F\u0085-\u0089\u008B-\u008C\u008E-\u0092\u0098¥-§©«-¯±-²¸×˄-˅ ... 4264 chars ... 􏞢􏟆􏟐􏟘􏢄􏣢􏥭􏦡􏧎􏧰􏩤􏪃􏪠􏪵􏫎􏫤􏬌􏭇􏭴􏭷􏮩􏮷􏯭􏯴􏯾􏰬􏲡􏲾􏳧􏳵􏵡􏶾􏷤􏷫􏹶􏺷􏼁􏽷􏿵]
^([\W])$ | v8 | [\x01-/:-@[-^`{-\u0080\u0084\u0087\u008C\u008F\u0091\u0096\u009A -¡¥§ª-«®-°²-³µ¹¿ÁÄ ... 4855 chars ... -BGJLQT-Ubgkr-sy}「-」ェャスハホムᄀ-ᄁ좌￐ᅭᅵ￧↑￾]
^([\w-a])$ | pg | [0-9A-Z_-zªºÁ-ÃÇÌ-ÎÐ-ÑÔÖÝâ-ãå-æé-êìî-ñõü ... 3717 chars ... 𭝕𭟞𭡂𭡶𭤇𭥷𭦃𭧝𭮄-𭮅𭳐𭴁𭵦𭷥𭸍𭾙𭿘𮅕𮅳𮆈𮍪𮚝𮛶𮜠𮝁𮠦𮣆𮣼𮥴𮨨𮭘𮮛仌壮望-朡變]
^([\w-a])$ | pl | [-0-9A-Z_a-zºÁÃÇÉ-ÊÏÒ-ÔÖØÚ-ÛÞáäæí-ïõúü-ý ... 3854 chars ... 𭏇𭒧𭔃𭔽𭙟𭞽𭡖𭢮𭢱𭤙𭤶𭧝𭪁𭪻𭯰𭰭𭲟𭳚𭵊𭵽𭸷𭾏𮂗𮃴𮈄𮋝𮌫𮍏𮚅𮞞𮠾𮡊𮡿𮢐𮨍兤潮䏕𩅅]
^([\w-a])$ | v8 | [-0-9A-Z_a-z]

pg=PostgreSQL
pl=Perl
v8=Javascript

I think the use of \w and \W should be considered an anti-pattern when writing regexes, in any language,
due to the apparent variations between popular engines. It will never be obvious to neither the reader
nor writer of the regex what was meant or what it means.

/Joel

Attachments:

brute_matches.sqlapplication/octet-stream; name=brute_matches.sqlDownload
#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joel Jacobson (#2)
Re: Bizarre behavior of \w in a regular expression bracket construct

On 2021-Feb-21, Joel Jacobson wrote:

regex | engine | deduced_ranges
------------+--------+-------------------------------
^([a-z])$ | pg | [a-z]
^([a-z])$ | pl | [a-z]
^([a-z])$ | v8 | [a-z]
^([\d-a])$ | pg |
^([\d-a])$ | pl | [-0-9a]
^([\d-a])$ | v8 | [-0-9a]
^([\w-;])$ | pg |
^([\w-;])$ | pl | [-0-9;A-Z_a-z����-��-��-�]
^([\w-;])$ | v8 | [-0-9;A-Z_a-z]
^([\w-_])$ | pg | [0-9A-Z_a-z����-��-��-�]
^([\w-_])$ | pl | [-0-9A-Z_a-z����-��-��-�]
^([\w-_])$ | v8 | [-0-9A-Z_a-z]
^([\w])$ | pg | [0-9A-Z_a-z����-��-��-�]
^([\w])$ | pl | [0-9A-Z_a-z����-��-��-�]
^([\w])$ | v8 | [0-9A-Z_a-z]
^([\W])$ | pg |
^([\W])$ | pl | [\x01-/:-@[-^`{-��-��-��-���]
^([\W])$ | v8 | [\x01-/:-@[-^`{-�]
^([\w-a])$ | pg | [0-9A-Z_-z����-��-��-�]
^([\w-a])$ | pl | [-0-9A-Z_a-z����-��-��-�]
^([\w-a])$ | v8 | [-0-9A-Z_a-z]

It looks like the interpretation of these other engines is that [\d-a]
is the set of \d, the literal character "-", and the literal character
"a". In other words, the - preceded by \d or \w (or any other character
class, I guess?) loses its special meaning of identifying a character
range.

This one I didn't understand:

^([\W])$ | pg |

--
�lvaro Herrera Valdivia, Chile
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... �Qui�n es el machito que tendr�a carnet?" (Mafalda)

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: Bizarre behavior of \w in a regular expression bracket construct

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

It looks like the interpretation of these other engines is that [\d-a]
is the set of \d, the literal character "-", and the literal character
"a". In other words, the - preceded by \d or \w (or any other character
class, I guess?) loses its special meaning of identifying a character
range.

Yeah. While I can see the attraction of being picky about this,
I can also see the attraction of being more compatible with other
engines. Should we relax this?

A quick experiment with perl shows that its opinion is "if the
atom before or after a potentially range-defining dash is a
character class, then take the dash as an ordinary character".
(This confirms Joel's result, and also I found that e.g. [3-\w]
treats the dash as a literal character.)

This one I didn't understand:

^([\W])$ | pg |

I think Joel just forgot to mark that as ERROR. It certainly
doesn't work in our engine today (though I'm nearly done with
a patch to fix that).

regards, tom lane

#5Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#4)
Re: Bizarre behavior of \w in a regular expression bracket construct

On Sun, Feb 21, 2021, at 18:39, Tom Lane wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

This one I didn't understand:

^([\W])$ | pg |

I think Joel just forgot to mark that as ERROR.

Yes, my mistake, sorry about that,
(I manually edited the query result and replaced empty-field with "ERROR").

(I see I also forgot to mark the ones in the first ASCII part
of the email as ERROR, which should have been the
ones with an empty field for engine "pg".)

/Joel

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: Bizarre behavior of \w in a regular expression bracket construct

I wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

It looks like the interpretation of these other engines is that [\d-a]
is the set of \d, the literal character "-", and the literal character
"a". In other words, the - preceded by \d or \w (or any other character
class, I guess?) loses its special meaning of identifying a character
range.

Yeah. While I can see the attraction of being picky about this,
I can also see the attraction of being more compatible with other
engines. Should we relax this?

After some more research I'm feeling that this would be a bad idea.
The POSIX spec states that such cases are unspecified, meaning that
implementations can do what they like. Hence Perl and JS are not
out of line to interpret it this way. However, XQuery and therefore
also SQL consider that a character class after a dash means character
set subtraction [1]https://www.regular-expressions.info/charclasssubtract.html, which is pretty nearly the exact opposite
semantics. Keeping in mind that we are likely to someday want to
provide a closer match for XQuery, I'm thinking we're best off to
keep such cases as an error for now. Otherwise the risk of confusion
will be pretty high.

Anyway, 0001 attached is the promised patch to enable \D, \S, \W
to work inside bracket expressions. I did some cleanup in the
general area as well:

* Create infrastructure to allow treating \w as a character class
in its own right. (I did not expose [[:word:]] as a class name,
though it would be a little more symmetric to do so; should we?)

* Split cclass() into separate functions to look up a char class
name (producing an enum) and to produce a cvec character vector
from the enum. This allows the char class escapes to use the
enum values directly without an artificial lookup.

* Remove the lexnest() hack, and in consequence clean up wordchrs()
to not interact with the lexer.

* Fix colorcomplement() to not be O(N^2) in the number of colors
involved. I didn't detect any measurable speedup on Joel's corpus,
but it seems like a good idea anyway.

* Get rid of useless-as-far-as-I-can-see calls of element()
on single-character character element names in brackpart().
element() always maps these to the character itself, and things
would be quite broken if it didn't --- should "[a]" match something
different than "a" does? Besides, the shortcut path in brackpart()
wasn't doing this anyway, making it even more inconsistent.

0001 preserves the current behavior of these constructs with
respect to newlines, namely that:

\s matches newline, with or without 'n' flag
\S doesn't match newline, with or without 'n' flag
\w doesn't match newline, with or without 'n' flag
\W matches newline, except with 'n' flag
\d doesn't match newline, with or without 'n' flag
\D matches newline, except with 'n' flag

Perl and Javascript believe that \W and \D should match newlines
regardless of their 's' flag, so there's a case for changing
\W and \D to match newline regardless of our 'n' flag. 0002
attached is the quite trivial patch to do this. I'm not quite
100% convinced whether this is a good change to make, but if we're
going to do it now would be the time.

Thoughts?

regards, tom lane

[1]: https://www.regular-expressions.info/charclasssubtract.html

Attachments:

0001-rework-char-class-escapes.patchtext/x-diff; charset=us-ascii; name=0001-rework-char-class-escapes.patchDownload+674-264
0002-DW-always-match-newline.patchtext/x-diff; charset=us-ascii; name=0002-DW-always-match-newline.patchDownload+21-10
#7Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#6)
Re: Bizarre behavior of \w in a regular expression bracket construct

Hi,

On Tue, Feb 23, 2021, at 18:15, Tom Lane wrote:

0001 preserves the current behavior of these constructs with
respect to newlines, namely that:

\s matches newline, with or without 'n' flag
\S doesn't match newline, with or without 'n' flag
\w doesn't match newline, with or without 'n' flag
\W matches newline, except with 'n' flag
\d doesn't match newline, with or without 'n' flag
\D matches newline, except with 'n' flag

Perl and Javascript believe that \W and \D should match newlines
regardless of their 's' flag, so there's a case for changing
\W and \D to match newline regardless of our 'n' flag. 0002
attached is the quite trivial patch to do this. I'm not quite
100% convinced whether this is a good change to make, but if we're
going to do it now would be the time.

Thoughts?

I've tested 4.4M different regex/subject pairs
against 0001 and 0001+0002 trying to find
some interesting examples to analyze:

SELECT COUNT(*) FROM regex_tests;
4468843

Out of these, 64783 (1.4%) contained \W
that could be processed by the regex engine
and that didn't produce an error:

CREATE TABLE "\W" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\W%';
SELECT 64783

Out of these, 539 gave a different result
when comparing 0001 vs 0001+0002:

CREATE TABLE "\W diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\W" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n');
SELECT 539

Out of these, 62 didn't contain any \W
when the special [\w\W] construct had been filtered out.

CREATE TABLE "\W diff ignore [\w\W]" AS SELECT * FROM "\W diff" WHERE regexp_replace(pattern,'\[\\w\\W\]','','g') LIKE '%\\W%';
SELECT 62

Out of these, here is a break-down showing number of distinct subjects per pattern:

SELECT COUNT(*), pattern FROM "\W diff ignore [\w\W]" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+--------------------------------------------------
47 | (?:^|\W+)@apply\s*\(?([^);\n]*)\)?
12 | \W
1 | ((?:^|}|,|;)\W*)((?:\w+)?\.(?:mc|mg|row)[\-\w]+)
1 | [\W\d]+
1 | \W*$
(5 rows)

Let's go through each case:

Pattern #1: (?:^|\W+)@apply\s*\(?([^);\n]*)\)?
====================================

This pattern is always used with the flags "gi".

Example subject:

font-family: var(--paper-font-common-base_-_font-family); -webkit-font-smoothing: var(--paper-font-common-base_-_-webkit-font-smoothing);
@apply --paper-font-common-nowrap;

If the author would have intended to only match non-word characters without newlines,
then these kind of subjects would only match by coincidence, since @apply in indented
using blank space, which is included in \W.

The \W+ in this example makes the regex match the ");" on the line before "@apply", which looks very odd.

My conclusion is the author in this example wrongly think \W+ means "at least one white space".

I therefore it would be an improvement in this case to always include newlines in \W.

Patch 0002 therefore gets +1 due to this example.

Pattern 2: \W
============

Flags used for this pattern (among all examples, not just the ones producing a diff):

SELECT flags, count FROM patterns WHERE pattern = '\W' ORDER BY 2 DESC;
flags | count
-------+-------
g | 2805
| 1476
gi | 39
y | 22
(4 rows)

All subjects for this pattern had some white-space in the beginning,
and all of them even have at least one new-line in the beginning:

SELECT length((regexp_match(subject,'^(\n*)'))[1]), COUNT(*) FROM "\W diff ignore [\w\W]" WHERE pattern = '\W' GROUP BY 1 ORDER BY 1;
length | count
--------+-------
1 | 9
2 | 1
3 | 2
(3 rows)

This, in combination with the popularity of the "g" flag with this pattern,
makes me think \W is used to strip away leading white-space,
including new-lines.

Patch 0002 therefore gets +1 due to this example.

Pattern 3: ((?:^|}|,|;)\W*)((?:\w+)?\.(?:mc|mg|row)[\-\w]+)
==============================================

Flags: g

Subject:

div.mgline:hover a.close-informer {
opacity: 0.7;
-moz-transition: all 0.3s ease-out;
-o-transition: all 0.3s ease-out;
-webkit-transition: all 0.3s ease-out;
-ms-transition: all 0.3s ease-out;
transition: all 0.3s ease-out;
}

To me it looks like the author wrongly thinks \W means "white space".

What makes me believe this is that \W* is in between

(?:^|}|,|;)

which matches end of statements, and,

(?:\w+)?\.

which matches a HTML-tag and CSS class name, or just a CSS class name.

The only natural thing I see could exist in between those two constructs is white space.

Normally this regex doesn't produce any difference for cases found,
since most CSS code has been minified where newlines are removed,
but the case above was not minified and produced a diff.

Patch 0002 therefore gets +1 due to this example.

Pattern 4: [\W\d]+
================

No flags for this pattern.

The case that caused a diff was a subject with just a single comma, followed by newline and then blank spaces.

Subject in hex: 2c 0a 09 09 09 09 09 09 09

This caused 0001 to only match the comma,
whereas 0002 (and Javascript/Perl) matches the blank spaces as well.

Here are some other subjects that don't necessarily cause a diff,
but that could hopefully makes us understand the intent of the regex:

SELECT DISTINCT ON (regexp_match_v8) * FROM (SELECT regexp_match_v8(subject,'[\W\d]+'), shrink_text(subject,40) FROM subjects WHERE pattern_id = 25935) AS x;
regexp_match_v8 | shrink_text
------------------------------------------------------------+-------------------------------------------------------------
{", +| , +
"} |
{", "} | ,
{.} | .col-item
{/} | /content/phonak/se/s ... 106 chars ... e.jpg, (largeretina)
{//} | //images.images4us.c ... 53 chars ... -481919.png, (large)
{://} | https://www.dilling. ... 55 chars ... .webp, (medium-only)
{3} | typo3conf/ext/rlp/Re ... 23 chars ... lp-logo.png, (large)
(7 rows)

We can see the diffing case on the first line, the one with comma and newlines+blank spaces.
No clue on what that one is, but looking at the rest,
to me it looks like they are trying to match the the non-word characters in the beginning.
The strange thing is why \d is included in the bracket expression.
This causes a different in the last example:

{3} | typo3conf/ext/rlp/Re ... 23 chars ... lp-logo.png, (large)

If \d would not have been included, the first "/" would be matched instead of the "3".

I cannot draw any conclusions for this pattern on what would be advisable,
except that in most cases for this pattern, it wouldn't make any difference to include
or not include newlines in \W.

Pattern 5: \W*$
==============

No flags for this pattern.

The subject is redacted due to being a promotional text for some cryptocurrency.
it's just four normal English sentences, where the last one is separated from the first three
with two newlines in between, rewritten:

"Example sentence. Some other sentence.

Yet some other sentence. "

Double-quotes added to show the trailing blank space in the last sentence.
Due to it, the 'n' regex flag causes the dot and newline to match with the 0002 patch,
but only match the dot without the 0002 patch.

In Javascript/Perl, since $ only means end-of-string there (unless using the "m" flag),
they instead match the last blank space. 0002 would give the same behaviour without the "n" flag.

My conclusion is \W*$ is typically wrongly used to remove trailing white-space.

Always including newlines in \W would be an improvement here,
since otherwise newlines wouldn't be stripped.

Patch 0002 therefore gets +1 due to this example.

======END OF PATTERNS=====

Final conclusion:

Out of the 5 patterns analyzed,
I found 4 of them would benefit from including newlines in \W.

The risk of changing this seems rather small,
since only 0.01% of the cases found produced
any difference at all (539 out of 4468843),
and out of these cases, most only contained
the obvious [\w\W] which greatly benefits,
and the rest of the 62 cases have now been
manually verified to also benefit from a change.

My opinion is therefore we should change \W to include newlines.

I will hopefully be able to provide a similar analysis of \D soon,
but wanted to send this in the meantime.

/Joel

#8Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#7)
Re: Bizarre behavior of \w in a regular expression bracket construct

On Wed, Feb 24, 2021, at 16:23, I wrote:

I will hopefully be able to provide a similar analysis of \D soon,
but wanted to send this in the meantime.

CREATE TABLE "\D" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\D%';
SELECT 67558

CREATE TABLE "\D diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\D" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n');
SELECT 12

SELECT COUNT(*), pattern FROM "\D diff" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+----------
11 | \D
1 | [\D|\d]*
(2 rows)

Pattern 1: \D
============

This pattern is used to find the first decimal separator, normally dot (.):

SELECT subject FROM regex_tests WHERE pattern = '\D' ORDER BY RANDOM() LIMIT 10;
subject
---------------------------
1.11.00.24975645674952163
1.11.30.6944442955860683
1.12.40.38502468714280424
3.5.10.9407443094500285
1.12.40.34334381021879845
2.0.20.5175496920692813
1.8.30.09144561055484002
3.4.10.6083619758942858
3.5.10.15406771889459425
2.0.00.6309370335082272
(10 rows)

We can see how this works in almost all cases:

SELECT captured_pg, captured_v8, count(*) from regex_tests where pattern = '\D' GROUP BY 1,2 ORDER BY 3 DESC LIMIT 3;
captured_pg | captured_v8 | count
-------------+-------------+-------
{.} | {.} | 66797
| | 103
{-} | {-} | 64
(10 rows)

If we take a look at the diffs found,
all such cases have a subjects that starts with newlines:

SELECT COUNT(*), subject ~ '^\n' AS starts_with_newline FROM "\D diff" WHERE pattern = '\D' GROUP BY 2;
count | starts_with_newline
-------+---------------------
11 | t
(1 row)

Naturally, if newlines are not included, then something else will match instead.

Now, if in these cases, ignoring the newline(s) and instead proceeding
to match the first non-digit non-newline, maybe we wound find a dot (.)
like in the normal case? No, that is not the case. Instead, we will hit
some arbitrary blank space or tab:

SELECT convert_to(captured_pg[1],'utf8') AS "0001+0002", convert_to(captured_pg_0001[1],'utf8') AS "0001", COUNT(*) FROM "\D diff" WHERE pattern = '\D' GROUP BY 1,2;
0001+0002 | 0001 | count
-----------+------+-------
\x0a | \x09 | 3
\x0a | \x20 | 7
\x0a | | 1
(3 rows)

The last example where nothing at all matched, was due to the string only contained a single newline,
which couldn't be matched.

None of these outliners contain any decimal-looking-digit-sequences at all,
it's all just white space, one "€ EUR" text and some text that looks like
it's coming from some web shop's title:

SELECT ROW_NUMBER() OVER (), subject FROM "\D diff" WHERE pattern = '\D';
row_number | subject
------------+----------------------------------------------------------------
1 | +
| +
| +
|
2 | +
|
3 | +
|
4 | +
|
5 | +
| € EUR +
|
6 | +
|
7 | +
|
8 | +
|
9 | +
|
10 | +
| Dunjackor, duntäcken och dunkuddar | Joutsen Dunspecialist+
| +
| +
| +
| – Joutsen Sweden +
| +
|
11 | +
|
(11 rows)

My conclusion is all of these are nonsensical subjects when applied to the \D regex.

Out of the subjects with actual digit-sequences,
none of them starts with newlines,
so including newlines in \D wouldn't cause any effect.

I see no benefit, but also no harm, in including newlines.

Pattern 2: [\D|\d]*
===============

This looks similar to [\w\W], the author has probably not understood pipe ("|") is not needed in between bracket expression parts. The author's intention is probably to match everything in the string, like .*, but including newlines.

Patch 0002 therefore gets +1 due to this example.

===END OF PATTERNS===

My final conclusion is we should always include newlines in \D.

/Joel

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Bizarre behavior of \w in a regular expression bracket construct

On 2021-Feb-23, Tom Lane wrote:

* Create infrastructure to allow treating \w as a character class
in its own right. (I did not expose [[:word:]] as a class name,
though it would be a little more symmetric to do so; should we?)

Apparently [:word:] is a GNU extension (or at least a "bash-specific
character class"[1]https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions but apparently Emacs also supports it?); all the
others are mandated by POSIX[2]https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05.

[1]: https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

I think it'd be fine to expose [:word:] ...

[1] https://www.regular-expressions.info/charclasssubtract.html

I had never heard of this subtraction thing. Nightmarish and confusing
syntax, but useful.

+    Also, the character class shorthands <literal>\D</literal>
+    and <literal>\W</literal> will match a newline regardless of this mode.
+    (Before <productname>PostgreSQL</productname> 14, they did not match
+    newlines in newline-sensitive mode.)

This seems an acceptable change to me, but then I only work here.

--
�lvaro Herrera 39�49'30"S 73�17'W

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#7)
Re: Bizarre behavior of \w in a regular expression bracket construct

"Joel Jacobson" <joel@compiler.org> writes:

On Tue, Feb 23, 2021, at 18:15, Tom Lane wrote:

Perl and Javascript believe that \W and \D should match newlines
regardless of their 's' flag, so there's a case for changing
\W and \D to match newline regardless of our 'n' flag. 0002
attached is the quite trivial patch to do this. I'm not quite
100% convinced whether this is a good change to make, but if we're
going to do it now would be the time.

[ extensive analysis ]
My opinion is therefore we should change \W to include newlines.

Wow, thanks for doing all that work! But OTOH, looking at a
corpus taken from Javascript practice seems like it'd inevitably
lead to that conclusion, since that is what \W does in Javascript.
Whether the regex authors knew the exact rules or not (and I share
your suspicions that some of them didn't), if they'd done any
testing they'd have been led to write their code that way.

Still, I am not convinced that there's much to justify our current
definition either. Looking at the existing code shows that the way
\W and \D work now was forced by Spencer's decision to make 'n' mode
affect complemented character classes in general, since they're just
macros for complemented character classes. With this reimplementation,
that connection isn't there anymore, so we can change it if we like.

Since (AFAICS) the main use of 'n' mode is to make our regexes work
more like these other products, bringing \W and \D into line with
them seems like a reasonable thing to do.

I've also decided after reflection that the patch should indeed
create a named "word" character class. That's allowed per POSIX,
and it simplifies some aspects of the documentation, since we can
rely on referencing the class instead of repeating ourselves.
The attached 0001 v2 does that; it's otherwise the same as before.

Speaking of documentation, I'm wondering more and more why we're
continuing to carry along re_syntax.n. We don't expose that to
users in any way, and it has not been maintained nearly as faithfully
as the SGML docs. (Looking at the git history, I think I included
it in 7bcc6d98f because it replaced re_format.7, which had been there
in that directory since Postgres95. But that history is immaterial
now that we've got proper user-facing documentation.)

regards, tom lane

#text/x-diff; name="0001-rework-char-class-escapes-2.patch" [0001-rework-char-class-escapes-2.patch] /home/tgl/pgsql/0001-rework-char-class-escapes-2.patch
#text/x-diff; name="0002-DW-always-match-newline.patch" [0002-DW-always-match-newline.patch] /home/tgl/pgsql/0002-DW-always-match-newline.patch

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#9)
Re: Bizarre behavior of \w in a regular expression bracket construct

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

On 2021-Feb-23, Tom Lane wrote:

* Create infrastructure to allow treating \w as a character class
in its own right. (I did not expose [[:word:]] as a class name,
though it would be a little more symmetric to do so; should we?)

Apparently [:word:] is a GNU extension (or at least a "bash-specific
character class"[1] but apparently Emacs also supports it?); all the
others are mandated by POSIX[2].
I think it'd be fine to expose [:word:] ...

Yeah, I'd independently come to the same conclusion. This GNU precedent
offers even more basis for that, though.

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#10)
Re: Bizarre behavior of \w in a regular expression bracket construct

I wrote:

I've also decided after reflection that the patch should indeed
create a named "word" character class. That's allowed per POSIX,
and it simplifies some aspects of the documentation, since we can
rely on referencing the class instead of repeating ourselves.
The attached 0001 v2 does that; it's otherwise the same as before.

Sigh, this time with the attachments ...

regards, tom lane

Attachments:

0001-rework-char-class-escapes-2.patchtext/x-diff; charset=us-ascii; name=0001-rework-char-class-escapes-2.patchDownload+680-271
0002-DW-always-match-newline.patchtext/x-diff; charset=us-ascii; name=0002-DW-always-match-newline.patchDownload+21-10