Test harness for regex code (to allow importing Tcl's test suite)

Started by Tom Laneover 5 years ago6 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Over the holiday break I've been fooling with some regex performance
improvements. I don't have anything ready to show yet in that line,
but I was feeling the need for more-thorough test coverage, so I set
to work on something that's been in the back of my mind for a long
time: we need to absorb the test cases that Henry Spencer wrote way-
back-when for that code, which up to now existed only as a script
in the Tcl test suite. That state of affairs seemed OK to me in
the beginning, when we thought of Tcl as the upstream for that code,
and figured they'd vet any nontrivial changes. They've pretty
thoroughly dropped the ball on that though, and indeed I think that
they now believe *we're* the upstream. So we need to have a test
suite that reflects that status, at least to the extent of running
all the test cases that exist for that code.

Accordingly, here's a new src/test/modules package that creates a
function modeled on regexp_matches(), but using a set of flags that
matches Spencer's design for the Tcl test suite, allowing parts
of src/backend/regex/ to be tested that can't be reached with our
existing SQL-exposed functions. The test scripts in the module
reproduce all the tests in Tcl's "tests/reg.test" script as of
Tcl 8.6.10, plus a few others that I felt advisable, such as tests
for the lookbehind constraints we added a few years ago. (Note:
Tcl also has regexp.test and regexpComp.test, but those seem to be
oriented towards testing their language-specific wrappers not the
regex engine itself.)

According to my testing, this increases our test code coverage for
src/backend/regex/ from 71.1% to 86.7%, which is not too shabby,
especially seeing that a lot of the remainder is not-deterministically-
reachable code for malloc failure handling.

Thoughts? Is anyone interested in reviewing this? Since it's only
test code, I'd be okay with pushing it without review, but I'd be
happy if someone else wants to look at it.

regards, tom lane

Attachments:

add-test_regex-module-1.patchtext/x-diff; charset=utf-8; name=add-test_regex-module-1.patchDownload+7264-0
#2Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#1)
Re: Test harness for regex code (to allow importing Tcl's test suite)

On Jan 3, 2021, at 7:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over the holiday break I've been fooling with some regex performance
improvements. I don't have anything ready to show yet in that line,
but I was feeling the need for more-thorough test coverage, so I set
to work on something that's been in the back of my mind for a long
time: we need to absorb the test cases that Henry Spencer wrote way-
back-when for that code, which up to now existed only as a script
in the Tcl test suite. That state of affairs seemed OK to me in
the beginning, when we thought of Tcl as the upstream for that code,
and figured they'd vet any nontrivial changes. They've pretty
thoroughly dropped the ball on that though, and indeed I think that
they now believe *we're* the upstream. So we need to have a test
suite that reflects that status, at least to the extent of running
all the test cases that exist for that code.

Accordingly, here's a new src/test/modules package that creates a
function modeled on regexp_matches(), but using a set of flags that
matches Spencer's design for the Tcl test suite, allowing parts
of src/backend/regex/ to be tested that can't be reached with our
existing SQL-exposed functions. The test scripts in the module
reproduce all the tests in Tcl's "tests/reg.test" script as of
Tcl 8.6.10, plus a few others that I felt advisable, such as tests
for the lookbehind constraints we added a few years ago. (Note:
Tcl also has regexp.test and regexpComp.test, but those seem to be
oriented towards testing their language-specific wrappers not the
regex engine itself.)

According to my testing, this increases our test code coverage for
src/backend/regex/ from 71.1% to 86.7%, which is not too shabby,
especially seeing that a lot of the remainder is not-deterministically-
reachable code for malloc failure handling.

Thoughts? Is anyone interested in reviewing this? Since it's only
test code, I'd be okay with pushing it without review, but I'd be
happy if someone else wants to look at it.

I've quickly read this over and generally like it. Thanks for working on this!

Have you thought about whether If it weren't in test/modules, it might be nice to expose test_regex from SQL with a slightly different interface that doesn't throw on regex compilation error? Maybe something in contrib? It might be useful to some users to validate regular expressions. I'm just asking... I don't have any problem with how you have it here.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#2)
Re: Test harness for regex code (to allow importing Tcl's test suite)

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Have you thought about whether If it weren't in test/modules, it might be nice to expose test_regex from SQL with a slightly different interface that doesn't throw on regex compilation error? Maybe something in contrib? It might be useful to some users to validate regular expressions. I'm just asking... I don't have any problem with how you have it here.

Hm. There isn't anything about test_regex()'s API that I would want
to expose to end users ;-) ... it's just designed to replicate the
test API that Henry Spencer designed a couple decades ago, which IMO
was not too clean even by the standards of the time.

However, I can see the argument for a function along the lines of
"validate_regex(text) returns bool". If you or somebody else wants to
write that, independently of this patch, I think it'd be a fine idea.

regards, tom lane

#4Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#1)
Re: Test harness for regex code (to allow importing Tcl's test suite)

On Mon, Jan 4, 2021, at 04:49, Tom Lane wrote:

Over the holiday break I've been fooling with some regex performance
improvements.

Cool! I've also been fooling with regex performance over the years myself, not in the PostgreSQL code, but in general.

More specifically, to first DFA-minimize the regex,
and then to generate LLVMIR directly from the graph.

Perhaps some of the ideas could be interesting to look at.

Here is a live demo: https://compiler.org/reason-re-nfa/src/index.html

One idea that I came up with myself is the "merge_linear" step,
where when possible, multiple characters are read in the same operation.
Not sure if other regex JIT engines does this, but it makes quite a difference
for large regexes where you have long strings.

Note: There is no support for capture groups, back-references, etc, but | + * () [] [^] works.

/Joel

#5Alexander Lakhin
exclusion@gmail.com
In reply to: Tom Lane (#3)
Re: Test harness for regex code (to allow importing Tcl's test suite)

Hello Tom,
04.01.2021 08:47, Tom Lane wrote:

Hm. There isn't anything about test_regex()'s API that I would want
to expose to end users ;-) ... it's just designed to replicate the
test API that Henry Spencer designed a couple decades ago, which IMO
was not too clean even by the standards of the time.

As test_regex() is not meant to be public, maybe this is of little
importance, but I've found another bug when exploiting the new test module:
select * from test_regex(repeat('(x)', 32), 'a', 'c');
leads to
==00:00:00:05.736 2605072== Invalid write of size 4
==00:00:00:05.736 2605072==    at 0x4866D09: setup_test_matches
(test_regex.c:564)
==00:00:00:05.736 2605072==    by 0x4867276: test_regex (test_regex.c:105)
==00:00:00:05.736 2605072==    by 0x37B10C: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:00:05.736 2605072==    by 0x38AAA4: FunctionNext
(nodeFunctionscan.c:95)
==00:00:00:05.736 2605072==    by 0x37BA68: ExecScanFetch (execScan.c:133)
==00:00:00:05.736 2605072==    by 0x37BB05: ExecScan (execScan.c:182)
==00:00:00:05.736 2605072==    by 0x38A9CE: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:00:05.736 2605072==    by 0x378E5D: ExecProcNodeFirst
(execProcnode.c:450)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecProcNode (executor.h:247)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecutePlan (execMain.c:1542)
==00:00:00:05.736 2605072==    by 0x372E22: standard_ExecutorRun
(execMain.c:364)
==00:00:00:05.736 2605072==    by 0x372EF2: ExecutorRun (execMain.c:308)
==00:00:00:05.736 2605072==    by 0x4E2B77: PortalRunSelect (pquery.c:912)
==00:00:00:05.736 2605072==  Address 0xee4d5ec is 908 bytes inside a
block of size 1,024 alloc'd
==00:00:00:05.736 2605072==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:05.736 2605072==    by 0x62239C: AllocSetAlloc (aset.c:919)
==00:00:00:05.736 2605072==    by 0x629656: palloc0 (mcxt.c:995)
==00:00:00:05.736 2605072==    by 0x4866A06: setup_test_matches
(test_regex.c:450)
==00:00:00:05.736 2605072==    by 0x4867276: test_regex (test_regex.c:105)
==00:00:00:05.736 2605072==    by 0x37B10C: ExecMakeTableFunctionResult
(execSRF.c:234)
==00:00:00:05.736 2605072==    by 0x38AAA4: FunctionNext
(nodeFunctionscan.c:95)
==00:00:00:05.736 2605072==    by 0x37BA68: ExecScanFetch (execScan.c:133)
==00:00:00:05.736 2605072==    by 0x37BB05: ExecScan (execScan.c:182)
==00:00:00:05.736 2605072==    by 0x38A9CE: ExecFunctionScan
(nodeFunctionscan.c:270)
==00:00:00:05.736 2605072==    by 0x378E5D: ExecProcNodeFirst
(execProcnode.c:450)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecProcNode (executor.h:247)
==00:00:00:05.736 2605072==    by 0x372CBC: ExecutePlan (execMain.c:1542)
==00:00:00:05.736 2605072==

Best regards,
Alexander

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#5)
Re: Test harness for regex code (to allow importing Tcl's test suite)

Alexander Lakhin <exclusion@gmail.com> writes:

select * from test_regex(repeat('(x)', 32), 'a', 'c');
leads to
==00:00:00:05.736 2605072== Invalid write of size 4

Indeed, silly oversight on my part. Fixed, thanks!

regards, tom lane