BUG #18708: regex problem: (?:[^\d\D]){0} asserts with "lp->nouts == 0 && rp->nins == 0"
The following bug has been logged on the website:
Bug reference: 18708
Logged by: Nikolay Shaplov (PostgresPro)
Email address: dhyan@nataraj.su
PostgreSQL version: 16.4
Operating system: Debian 12
Description:
Hi!
We've found a bug:
If you run
SELECT '' ~ '(?:[^\d\D]){0}';
it will assert with "lp->nouts == 0 && rp->nins == 0"
This behavior have been introduced in 2a0af7fe460 commit.
This bug have been found while fuzzing jsonpath_in function, and then
narrowed down to regex problem. Thanks to Andrey Bille for help with
narrowing sample down and finding commit that caused the problem.
Hi Nikolay,
If you run
SELECT '' ~ '(?:[^\d\D]){0}';
it will assert with "lp->nouts == 0 && rp->nins == 0"
This behavior have been introduced in 2a0af7fe460 commit.
This bug have been found while fuzzing jsonpath_in function, and then
narrowed down to regex problem. Thanks to Andrey Bille for help with
narrowing sample down and finding commit that caused the problem.
Thanks for the report. I can reproduce it with 18devel too:
```
#6 0x00005906af1a1e5b in delsub (nfa=0x5906b179f8b0,
lp=0x5906b179fd88, rp=0x5906b179fdc0) at
../src/backend/regex/regc_nfa.c:1292
1292 assert(lp->nouts == 0 && rp->nins == 0); /* did the job */
(gdb) p *lp
$1 = {no = 4, flag = 0 '\000', nins = 1, nouts = 0, ins =
0x5906b17a3418, outs = 0x0, tmp = 0x0, next = 0x5906b179fdc0, prev =
0x5906b179fd50}
(gdb) p *rp
$2 = {no = 5, flag = 0 '\000', nins = 1, nouts = 1, ins =
0x5906b17a34f0, outs = 0x5906b17a3460, tmp = 0x5906b179fdc0, next =
0x5906b179fe30,
prev = 0x5906b179fd88}
```
I wonder if the Assert is just wrong or if it's more complicated than that.
For the record:
([^\d\D]){0} - OK
(?:[^\d\D]){1} - OK
(?:[^\D]){0} - OK
(?:[^\d]){0} - OK
'(?:[^\d\D]){0}' - FAIL
The value of the left argument of the `~` operator is not important.
Thoughts?
--
Best regards,
Aleksander Alekseev
Hi,
I wonder if the Assert is just wrong or if it's more complicated than that.
For the record:
([^\d\D]){0} - OK
(?:[^\d\D]){1} - OK
(?:[^\D]){0} - OK
(?:[^\d]){0} - OK
'(?:[^\d\D]){0}' - FAIL
Well, removing the assert helps, and the regex seems to work correctly
after that.
This however is almost certainly not a correct fix (the assert is
right about lp->nouts == 0 it's only unhappy about rp->nins != 0) and
a second opinion is certainly needed since I'm looking at
src/backend/regex/regc_nfa.c for the first time in my life :)
--
Best regards,
Aleksander Alekseev
PG Bug reporting form <noreply@postgresql.org> writes:
If you run
SELECT '' ~ '(?:[^\d\D]){0}';
it will assert with "lp->nouts == 0 && rp->nins == 0"
This behavior have been introduced in 2a0af7fe460 commit.
Thanks for the report --- I'll dig into this later.
regards, tom lane
Aleksander Alekseev <aleksander@timescale.com> writes:
If you run
SELECT '' ~ '(?:[^\d\D]){0}';
it will assert with "lp->nouts == 0 && rp->nins == 0"
I wonder if the Assert is just wrong or if it's more complicated than that.
Interesting example. I think the bug's actually in my commit
08c0d6ad6 (Invent "rainbow" arcs), although it is unreachable
before 2a0af7fe4. What's happening is:
* "\d\D" can match any character whatsoever. optimizebracket()
notices this and replaces a sheaf of NFA arcs with a single
RAINBOW arc, which is the same representation as for ".".
* Then we apply the complementation process in cbracket(),
and that calls colorcomplement() which does this:
/* A RAINBOW arc matches all colors, making the complement empty */
if (findarc(of, PLAIN, RAINBOW) != NULL)
return;
We thus end up with no arcs from the bracket expression's start
state to its end state. This is not wrong in itself: "you can't
get there from here" is a perfectly reasonable representation
of [^\d\D], which cannot match any character.
* However, the (?: ... ){0} superstructure around that eventually
leads us to
/* annoying special case: {0} or {0,0} cancels everything */
if (m == 0 && n == 0)
{
...
/* Otherwise, we can clean up any subre infrastructure we made */
if (atom != NULL)
freesubre(v, atom);
delsub(v->nfa, lp, rp);
}
which is trying to throw away the detritus from inside the quantified
subexpression. But delsub fails to remove it all, because there's no
path between the two specified states, and its assert notices that.
This is, I believe, not harmful in itself: the leftover unreachable
states/arcs will be cleaned up later, and we end with a valid NFA.
So, as you noted, removing that assert would suppress all visible
symptoms of the problem. But I really don't want to do that; this
code is complicated and we need all the help we can get to find
bugs in it.
So I think the right thing to do is to fix colorcomplement() so
that it still produces some arc between its "from" and "to"
states, keeping the graph connected until we finish parsing the
regex. It has to be a dummy arc that can't match anything, of
course. We can throw away such an arc once we get to regex
optimization, because we don't need delsub() to work anymore.
In the attached draft patch I represented the dummy arc as a PLAIN
arc with a new fake color BLACK. Another way to do it could be to
introduce a new arc "type" value, but that seemed like it would
involve touching more code.
This is admittedly a lot more code than removing one assert,
but I think we'll regret it if we allow delsub failures to go
undetected.
regards, tom lane
Attachments:
fix-bug-18708.patchtext/x-diff; charset=us-ascii; name=fix-bug-18708.patchDownload+75-5
I wrote:
So I think the right thing to do is to fix colorcomplement() so
that it still produces some arc between its "from" and "to"
states, keeping the graph connected until we finish parsing the
regex. It has to be a dummy arc that can't match anything, of
course. We can throw away such an arc once we get to regex
optimization, because we don't need delsub() to work anymore.
In the attached draft patch I represented the dummy arc as a PLAIN
arc with a new fake color BLACK. Another way to do it could be to
introduce a new arc "type" value, but that seemed like it would
involve touching more code.
After thinking awhile longer, I decided to try it that way (with
a new arc type), and I believe I like the result better after all.
It's just a few lines more code, and I think it's more robust.
Up to now PLAIN arcs always matched exactly one character,
so the first version was putting a major dent in their semantics.
We got rid of the bogus arcs before doing anything that really leans
hard on arc semantics, but still it seems messy. Hence v2 attached.
I'm not especially in love with the CANTMATCH arc type name, but
other possibilities such as DUMMY or NOMATCH felt too generic.
Better ideas anyone?
regards, tom lane
Attachments:
v2-fix-bug-18708.patchtext/x-diff; charset=us-ascii; name=v2-fix-bug-18708.patchDownload+79-1
Hi,
I reviewed, applied and tested patch v2 on Linux x64 and MacOS x64. LGTM.
I'm not especially in love with the CANTMATCH arc type name, but
other possibilities such as DUMMY or NOMATCH felt too generic.
Better ideas anyone?
Well... at least it's consistent with the current naming e.g.
HASCANTMATCH. I suggest keeping it as is.
--
Best regards,
Aleksander Alekseev
Hi,
Well... at least it's consistent with the current naming e.g.
HASCANTMATCH. I suggest keeping it as is.
Oh I wrote something stupid, HASCANTMATCH is part of the patch.
In any case, IMO the name is OK.
--
Best regards,
Aleksander Alekseev
Aleksander Alekseev <aleksander@timescale.com> writes:
In any case, IMO the name is OK.
I've not thought of a better name since yesterday, so pushed.
Thanks for reviewing!
regards, tom lane
Hello Tom,
16.11.2024 03:08, Tom Lane wrote:
I've not thought of a better name since yesterday, so pushed.
Thanks for reviewing!
Please look at the hornet's failure on processing a test query added here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
Please look at the hornet's failure on processing a test query added here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-11-15%2023%3A49%3A38
Hmm... for the archives' sake, that looks like
select * from test_regex('[^\\d\\D]', '0123456789abc*', 'ILPE');
- test_regex
- --------------------------------------------------------
- {0,REG_UBBS,REG_UNONPOSIX,REG_ULOCALE,REG_UIMPOSSIBLE}
- (1 row)
-
+ ERROR: invalid regular expression: out of memory
-- check char classes' handling of newlines
I'm not sure what to make of it. That regex shouldn't consume very
much memory. To confirm that, I stepped through it and found that
newstate() is reached 14 times and newarc() 35 times. That's a
pretty tiny amount of memory, and there are other regexps in the
tests that are far larger.
Moreover, no other animal has shown this, including hornet itself
on the v16 branch. (It's only run this test in v15 and v16 so far,
so that's not a lot of data points.)
I'm inclined to guess this was some weird momentary glitch. If it
reproduces then I'll look closer.
regards, tom lane
16.11.2024 19:02, Tom Lane wrote:
Moreover, no other animal has shown this, including hornet itself
on the v16 branch. (It's only run this test in v15 and v16 so far,
so that's not a lot of data points.)I'm inclined to guess this was some weird momentary glitch. If it
reproduces then I'll look closer.
(Un)fortunately, tern (which is also a ppc animal) has produced the same
failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
16.11.2024 19:02, Tom Lane wrote:
I'm inclined to guess this was some weird momentary glitch. If it
reproduces then I'll look closer.
(Un)fortunately, tern (which is also a ppc animal) has produced the same
failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12
Yeah, I saw that. Even more confused now about what it could be.
regards, tom lane
I wrote:
Alexander Lakhin <exclusion@gmail.com> writes:
(Un)fortunately, tern (which is also a ppc animal) has produced the same
failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2024-11-16%2022%3A00%3A12
Yeah, I saw that. Even more confused now about what it could be.
After testing on hornet's host, it seems that this is a pre-existing
issue that we didn't happen to hit before. Since the regex
'[^\d\D]' is unsatisfiable, it collapses to nothing (start state,
end state, and no arcs) in the first cleanup() call in optimize().
Then fixempties() counts the number of in-arcs and gets zero,
and then it does
arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
if (arcarray == NULL)
{
NERR(REG_ESPACE);
...
On a machine where malloc(0) returns NULL, this mistakenly
thinks that's an error.
I verified that
- if (arcarray == NULL)
+ if (arcarray == NULL && totalinarcs != 0)
makes the failure go away, but I wonder if any other places in
backend/regex/ are at the same hazard. Maybe the smartest fix
would be to put in a wrapper layer that does what pg_malloc
does:
/* Avoid unportable behavior of malloc(0) */
if (size == 0)
size = 1;
One other point is that this theory fails to explain why
hornet didn't fail in the v16 branch ... oh, wait:
v15 has
#define MALLOC(n) malloc(n)
where later branches have
#define MALLOC(n) palloc_extended((n), MCXT_ALLOC_NO_OOM)
So the right answer seems to be to figure out why we didn't
back-patch that change.
regards, tom lane
On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote:
arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
if (arcarray == NULL)
{
NERR(REG_ESPACE);
...On a machine where malloc(0) returns NULL, this mistakenly
thinks that's an error.I verified that
- if (arcarray == NULL) + if (arcarray == NULL && totalinarcs != 0)makes the failure go away, but I wonder if any other places in
backend/regex/ are at the same hazard. Maybe the smartest fix
would be to put in a wrapper layer that does what pg_malloc
does:/* Avoid unportable behavior of malloc(0) */
if (size == 0)
size = 1;
Either of those sound reasonable. The consequence of missing this hazard, a
deterministic ERROR, is modest. This affects just one platform, in the oldest
branches. There's a lack of complaints. To me, all that would make the
one-line diff tempting.
One other point is that this theory fails to explain why
hornet didn't fail in the v16 branch ... oh, wait:
v15 has#define MALLOC(n) malloc(n)
where later branches have
#define MALLOC(n) palloc_extended((n), MCXT_ALLOC_NO_OOM)
So the right answer seems to be to figure out why we didn't
back-patch that change.
I don't recall a specific reason or see one in the discussion of commit
bea3d7e38. It was done mainly to unblock commit db4f21e, which in turn
unblocked commit 0da096d. The last commit is heavy, so I can understand it
skipping the back branches. If I were making a (weak) argument against
back-patching bea3d7e38, I might cite the extra memory use from
RegexpCacheMemoryContext and children.
Noah Misch <noah@leadboat.com> writes:
On Sun, Nov 17, 2024 at 01:26:38AM -0500, Tom Lane wrote:
arcarray = (struct arc **) MALLOC(totalinarcs * sizeof(struct arc *));
if (arcarray == NULL)
On a machine where malloc(0) returns NULL, this mistakenly
thinks that's an error.
Either of those sound reasonable. The consequence of missing this hazard, a
deterministic ERROR, is modest. This affects just one platform, in the oldest
branches. There's a lack of complaints. To me, all that would make the
one-line diff tempting.
I dug through the MALLOC and REALLOC calls in backend/regex/ and
convinced myself that this is the only one that's at risk. (Some
of those conclusions depend on the assumption that a regex NFA
never has nstates == 0, but I think that's okay: we create start
and end states to begin with and never remove them.) So the
one-liner fix is looking attractive. I'd prefer a malloc wrapper
for future-proofing if this code were likely to receive a lot of
churn in the pre-v16 branches, but that seems pretty improbable
at this point.
So the right answer seems to be to figure out why we didn't
back-patch that change.
I don't recall a specific reason or see one in the discussion of commit
bea3d7e38. It was done mainly to unblock commit db4f21e, which in turn
unblocked commit 0da096d. The last commit is heavy, so I can understand it
skipping the back branches. If I were making a (weak) argument against
back-patching bea3d7e38, I might cite the extra memory use from
RegexpCacheMemoryContext and children.
I think just on minimum-risk grounds, I wouldn't consider
back-patching bea3d7e38. I had more in mind a bespoke
three-line malloc wrapper function. But the one-line fix
seems sufficient for the problem at hand.
regards, tom lane