regexp_positions()

Started by Joel Jacobsonover 5 years ago4 messageshackers

joel@compiler.org

over 5 years ago

Hi,

Finding all matches in a string is convenient using regexp_matches() with the 'g' flag.

But if instead wanting to know the start and end positions of the occurrences,
one would have to first call regexp_matches(...,'g') to get all matches,
and then iterate through the results and search using strpos() and length()
repeatedly to find all start and end positions.

Assuming regexp_matches() internally already have knowledge of the occurrences,
maybe we could add a regexp_ranges() function that returns a two-dimensional array,
with all the [[start,end], ...] positions?

Some other databases have a singular regexp_position() function,
that just returns the start positions for the first match.
but I don't think such function adds much value,
but if adding the pluralis one then maybe the singularis should be added as well,
for completeness, since we have array_position() and array_positions().

I just wanted to share this idea now since there is currently a lot of other awesome work on the regex engine,
and hear what others who are currently thinking a lot about regexes think of the idea.

/Joel

David Fetter

david@fetter.org

over 5 years ago

In reply to: Joel Jacobson (#1)

Re: regexp_positions()

On Sat, Feb 27, 2021 at 08:51:27PM +0100, Joel Jacobson wrote:

Hi,

Finding all matches in a string is convenient using regexp_matches() with the 'g' flag.

But if instead wanting to know the start and end positions of the occurrences,
one would have to first call regexp_matches(...,'g') to get all matches,
and then iterate through the results and search using strpos() and length()
repeatedly to find all start and end positions.

Assuming regexp_matches() internally already have knowledge of the occurrences,
maybe we could add a regexp_ranges() function that returns a two-dimensional array,
with all the [[start,end], ...] positions?

Maybe an int4multirange, which would fit unless I'm misunderstanding
g's meaning with respect to non-overlapping patterns, but that might
be a little too cute and not easy ever to extend.

Come to that, would a row structure that looked like

(match, start, end)

be useful?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Joel Jacobson

joel@compiler.org

over 5 years ago

In reply to: David Fetter (#2)

Re: regexp_positions()

Hi,

On Sun, Feb 28, 2021, at 03:13, David Fetter wrote:

Maybe an int4multirange, which would fit unless I'm misunderstanding
g's meaning with respect to non-overlapping patterns, but that might
be a little too cute and not easy ever to extend.

Come to that, would a row structure that looked like

(match, start, end)

be useful?

Nice, didn't know about the new multirange.
Its data structure seems like a perfect fit for this.

Hmm, I cannot find any catalog function to extract the ranges from the data structure though?
As a caller, I might need the exact start/end values,
not just wanting to know if a certain values overlaps any of the ranges.
Is there such a function?

Here is a PoC that just returns the start_pos and end_pos for all the matches.
It would be simple to modify it to instead return multirange.

CREATE OR REPLACE FUNCTION regexp_ranges(string text, pattern text, OUT start_pos integer, OUT end_pos integer)
RETURNS SETOF RECORD
LANGUAGE plpgsql
AS
$$
DECLARE
match text;
remainder text := string;
len integer;
BEGIN
end_pos := 0;
--
-- Ignore possible capture groups, instead just wrap the entire regex
-- in an enclosing capture group, which is then extracted as the first array element.
--
FOR match IN SELECT (regexp_matches(string,format('(%s)',pattern),'g'))[1] LOOP
len := length(match);
start_pos := position(match in remainder) + end_pos;
end_pos := start_pos + len - 1;
RETURN NEXT;
remainder := right(remainder, -len);
END LOOP;
RETURN;
END
$$;

This works fine for small strings:

SELECT * FROM regexp_ranges('aaaa aa aaa','a+');
start_pos | end_pos
-----------+---------
1 | 4
6 | 7
10 | 12
(3 rows)

Time: 0.209 ms

But quickly gets slow for longer strings:

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',10000),'a+');
20001
Time: 98.663 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',20000),'a+');
40001
Time: 348.027 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',30000),'a+');
60001
Time: 817.412 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',40000),'a+');
80001
Time: 1478.438 ms (00:01.478)

Compared to the much nicer observed O-notation for regexp_matches():

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',10000),'(a+)','g');
20001
Time: 12.602 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',20000),'(a+)','g');
40001
Time: 25.161 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',30000),'(a+)','g');
60001
Time: 44.795 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',40000),'(a+)','g');
80001
Time: 57.292 ms

/Joel

Joel Jacobson

joel@compiler.org

over 5 years ago

In reply to: Joel Jacobson (#3)

Re: regexp_positions()

I had a bug in the function, and I see I also accidentally renamed it to regexp_ranges().

Attached is a fixed version of the PoC.

This function is e.g. useful when we're interested in patterns in meta-data,
where we're not actually finding patterns in the data,
but in a string where each character corresponds to an element
in an array, containing the actual data.

In such case, we need to know the positions of the matches,
since they tell what corresponding array elements that matched.

For instance, let's take the UNIX diff tool we all know as an example.

Let's say you have all the raw diff lines stored in a text[] array,
and we want to produce a unified diff, by finding hunks
with up to 3 unchanged lines before/after each hunk
containing changes.

If we produce a text string containing one character per diff line,
using "=" for unchanged, "+" for addition, "-" for deletion.

Example: =====-=======+=====-+======

We could then find the hunks using this regex:

(={0,3}[-+]+={0,3})+

using regexp_positions() to find the start and end positions for each hunk:

SELECT * FROM regexp_positions('=====-=======+=====-+======','(={0,3}[-+]+={0,3})+');
start_pos | end_pos
-----------+---------
3 | 9
11 | 24
(2 rows)

/Joel

regexp_positions()

Attachments: