[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Started by Joel Jacobsonalmost 5 years ago55 messages
#1Joel Jacobson
joel@compiler.org
1 attachment(s)

Hi,

I suggest adding a new function, regexp_positions(),
which works exactly like regexp_matches(),
except it returns int4range[] start/end positions for *where* the matches occurs.

I first thought I could live without this function,
and just get the positions using strpos(),
but as Andreas Karlsson kindly helped me understand,
that naive idea doesn't always work.

Andreas provided this pedagogic example
to demonstrate the problem:

SELECT regexp_matches('caaabaaa', '(?<=b)(a+)', 'g');
regexp_matches
----------------
{aaa}
(1 row)

If we would try to use strpos() to find the position,
based on the returned matched substring,
we would get the wrong answer:

SELECT strpos('caaabaaa','aaa');
strpos
--------
2
(1 row)

Sure, there is "aaa" at position 2,
but that's not where the match occurred,
since the (?<=b) means "positive lookbehind of the character b",
so the match actually occurred at position 6,
where there is a "b" before the "aaa".

Using regexp_positions(), we can now get the correct answer:

SELECT regexp_positions('caaabaaa', '(?<=b)(a+)', 'g');
regexp_positions
------------------
{"[6,9)"}
(1 row)

Some more examples from the regress/sql/strings.sql,
showing both regexp_matches() and regexp_positions()
for the same examples, as they both return the same structure,
but with different types:

SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)

SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_positions
-----------------------
{"[4,7)","[7,12)"}
{"[12,17)","[17,21)"}
(2 rows)

I've added documentation and tests.

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

A few words on the implementation:
I copied build_regexp_match_result() to a new function build_regexp_positions_result(),
and removed the string parts, replacing it with code to make ranges instead.
Maybe there are common parts that could be put into some helper-function,
but I think not, since the functions are two small for it to be worth it.

Thanks to David Fetter for the idea on using ranges.

Based on HEAD (f5a5773a9dc4185414fe538525e20d8512c2ba35).

/Joel

Attachments:

0001-regexp-positions.patchapplication/octet-stream; name=0001-regexp-positions.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 08f08322ca..34a338fb21 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3140,6 +3140,29 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
       </row>
 
       <row>
+        <entry role="func_table_entry"><para role="func_signature">
+         <indexterm>
+          <primary>regexp_positions</primary>
+         </indexterm>
+         <function>regexp_positions</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+         <returnvalue>setof int4range[]</returnvalue>
+        </para>
+        <para>
+         Returns start and end positions of captured substring(s) resulting from matching a POSIX regular
+         expression to the <parameter>string</parameter>; see
+         <xref linkend="functions-posix-regexp"/>.
+        </para>
+        <para>
+         <literal>regexp_positions('foobarbequebaz', 'ba.', 'g')</literal>
+         <returnvalue></returnvalue>
+ <programlisting>
+  {"[4,7)"}
+  {"[12,15)"}
+ </programlisting>
+        </para></entry>
+       </row>
+ 
+       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
          <primary>regexp_replace</primary>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index a32c5c82ab..fde3d1b80f 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -37,6 +37,7 @@
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/varlena.h"
+#include "utils/rangetypes.h"
 
 #define PG_GETARG_TEXT_PP_IF_EXISTS(_n) \
 	(PG_NARGS() > (_n) ? PG_GETARG_TEXT_PP(_n) : NULL)
@@ -118,6 +119,7 @@ static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
 												bool ignore_degenerate,
 												bool fetching_unmatched);
 static ArrayType *build_regexp_match_result(regexp_matches_ctx *matchctx);
+static ArrayType *build_regexp_positions_result(regexp_matches_ctx *matchctx);
 static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
 
 
@@ -1056,6 +1058,58 @@ regexp_matches(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * regexp_positions()
+ *		Return a table of ranges where a pattern matches within a string.
+ */
+Datum
+regexp_positions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	regexp_matches_ctx *matchctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *pattern = PG_GETARG_TEXT_PP(1);
+		text	   *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+		pg_re_flags re_flags;
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* Determine options */
+		parse_re_flags(&re_flags, flags);
+
+		/* be sure to copy the input string into the multi-call ctx */
+		matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
+										&re_flags,
+										PG_GET_COLLATION(),
+										true, false, false);
+
+		/* Pre-create workspace that build_regexp_match_result needs */
+		matchctx->elems = (Datum *) palloc(sizeof(Datum) * matchctx->npatterns);
+		matchctx->nulls = (bool *) palloc(sizeof(bool) * matchctx->npatterns);
+
+		MemoryContextSwitchTo(oldcontext);
+		funcctx->user_fctx = (void *) matchctx;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	matchctx = (regexp_matches_ctx *) funcctx->user_fctx;
+
+	if (matchctx->next_match < matchctx->nmatches)
+	{
+		ArrayType  *result_ary;
+
+		result_ary = build_regexp_positions_result(matchctx);
+		matchctx->next_match++;
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(result_ary));
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
 /* This is separate to keep the opr_sanity regression test from complaining */
 Datum
 regexp_matches_no_flags(PG_FUNCTION_ARGS)
@@ -1063,6 +1117,13 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 	return regexp_matches(fcinfo);
 }
 
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_positions_no_flags(PG_FUNCTION_ARGS)
+{
+	return regexp_positions(fcinfo);
+}
+
 /*
  * setup_regexp_matches --- do the initial matching for regexp_match
  *		and regexp_split functions
@@ -1332,6 +1393,64 @@ build_regexp_match_result(regexp_matches_ctx *matchctx)
 							  TEXTOID, -1, false, TYPALIGN_INT);
 }
 
+/*
+ * build_regexp_positions_result - build output array for current match
+ */
+static ArrayType *
+build_regexp_positions_result(regexp_matches_ctx *matchctx)
+{
+	Datum	   *elems = matchctx->elems;
+	bool	   *nulls = matchctx->nulls;
+	int			dims[1];
+	int			lbs[1];
+	int			loc;
+	int			i;
+	RangeType  *range;
+	TypeCacheEntry *typcache;
+	RangeBound	lower;
+	RangeBound	upper;
+
+	typcache = lookup_type_cache(INT4RANGEOID, TYPECACHE_RANGE_INFO);
+
+	/* Extract matching substrings from the original string */
+	loc = matchctx->next_match * matchctx->npatterns * 2;
+	for (i = 0; i < matchctx->npatterns; i++)
+	{
+		int			so = matchctx->match_locs[loc++];
+		int			eo = matchctx->match_locs[loc++];
+
+		if (so < 0 || eo < 0)
+		{
+			elems[i] = (Datum) 0;
+			nulls[i] = true;
+		}
+		else
+		{
+			lower.val = Int32GetDatum(so + 1);
+			lower.infinite = false;
+			lower.inclusive = true;
+			lower.lower = true;
+
+			upper.val = Int32GetDatum(eo);
+			upper.infinite = false;
+			upper.inclusive = true;
+			upper.lower = false;
+
+			range = make_range(typcache, &lower, &upper, false);
+
+			elems[i] = RangeTypePGetDatum(range);
+			nulls[i] = false;
+		}
+	}
+
+	/* And form an array */
+	dims[0] = matchctx->npatterns;
+	lbs[0] = 1;
+	/* XXX: this hardcodes assumptions about the text type */
+	return construct_md_array(elems, nulls, 1, dims, lbs,
+							  INT4RANGEOID, -1, false, TYPALIGN_INT);
+}
+
 /*
  * regexp_split_to_table()
  *		Split the string at matches of the pattern, returning the
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1487710d59..e2e76935a0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3557,6 +3557,14 @@
   proname => 'regexp_matches', prorows => '10', proretset => 't',
   prorettype => '_text', proargtypes => 'text text text',
   prosrc => 'regexp_matches' },
+{ oid => '8104', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10', proretset => 't',
+  prorettype => '_int4range', proargtypes => 'text text text',
+  prosrc => 'regexp_positions' },
+{ oid => '8105', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10', proretset => 't',
+  prorettype => '_int4range', proargtypes => 'text text',
+  prosrc => 'regexp_positions_no_flags' },
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index fb4573d85f..5071165cd3 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -601,6 +601,12 @@ SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
  {bar,beque}
 (1 row)
 
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+  regexp_positions  
+--------------------
+ {"[4,7)","[7,12)"}
+(1 row)
+
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
  regexp_matches 
@@ -616,6 +622,13 @@ SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g')
  {bazil,barf}
 (2 rows)
 
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+   regexp_positions    
+-----------------------
+ {"[4,7)","[7,12)"}
+ {"[12,17)","[17,21)"}
+(2 rows)
+
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
  regexp_matches 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 57a48c9d0b..aa8b0553f0 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -198,12 +198,14 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
 
 -- return all matches from regexp
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
 
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
 
 -- global option - more than one match
 SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
 
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
#2Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#1)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 1, 2021, at 9:38 AM, Joel Jacobson <joel@compiler.org> wrote:

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

I like the idea so I did a bit of testing. I think the following should not error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#2)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:

I like the idea so I did a bit of testing. I think the following should not error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound

Thanks for testing!

The bug is due to using an (inclusive,inclusive) range,
so for the example the code tried to construct a (7,6,'[]') range.

When trying to fix, I think I've found a general problem with ranges:

I'll use int4range() to demonstrate the problem:

First the expected error for what the patch tries to do internally using make_range().
This is all good:

# SELECT int4range(7,6,'[]');
ERROR: range lower bound must be less than or equal to range upper bound

I tried to fix this like this:

@ src/backend/utils/adt/regexp.c
-                       lower.val = Int32GetDatum(so + 1);
+                       lower.val = Int32GetDatum(so);
                        lower.infinite = false;
-                       lower.inclusive = true;
+                       lower.inclusive = false;
                        lower.lower = true;

Which would give the same result as doing:

SELECT int4range(6,6,'(]');
int4range
-----------
empty
(1 row)

Hmm. This "empty" value what surprise to me.
I would instead have assumed the canonical form "[7,7)".

If I wanted to know if the range is empty or not,
I would have guessed I should use the isempty() function, like this:

SELECT isempty(int4range(6,6,'(]'));
isempty
---------
t
(1 row)

Since we have this isempty() function, I don't see the point of discarding
the lower/upper vals, since they contain possibly interesting information
on where the empty range exists.

I find it strange two ranges of zero-length with different bounds are considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which would be nice.

/Joel

#4Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#3)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 00:06, Joel Jacobson <joel@compiler.org> wrote:

I find it strange two ranges of zero-length with different bounds are
considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[]
for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which
would be nice.

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be
convenient to have empty ranges at specific locations. Doing this would
introduce numerous questions which would have to be resolved. For example,
where/when is the empty range resulting from an intersection operation?

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#3)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

regards, tom lane

#6Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#4)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:22, Isaac Morland wrote:

On Tue, 2 Mar 2021 at 00:06, Joel Jacobson <joel@compiler.org> wrote:

I find it strange two ranges of zero-length with different bounds are considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which would be nice.

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be convenient to have empty ranges at specific locations. Doing this would introduce numerous questions which would have to be resolved. For example, where/when is the empty range resulting from an intersection operation?

Hmm, I think I would assume the intersection of two non-overlapping ranges to be isempty()=TRUE,
and for lower() and upper() to continue to return NULL.

But I think a zero-length range created with actual bounds should
return the lower() and upper() values during creation, instead of NULL.

I tried to find some other programming environments with range types.

The first one I found was Ada.

The below example is similar to int4range(7,6,'[]') which is invalid in PostgreSQL:

with Ada.Text_IO; use Ada.Text_IO;
procedure Hello is
type Foo is range 7 .. 6;
begin
Put_Line ( Foo'Image(Foo'First) );
Put_Line ( Foo'Image(Foo'Last) );
end Hello;

$ ./gnatmake hello.adb
$ ./hello
7
6

I Ada, the 'Range of the Empty_String is 1 .. 0
https://en.wikibooks.org/wiki/Ada_Programming/Types/array#Array_Attributes

I think there is a case for allowing access to the the lower/upper vals instead of returning NULL,
since we can do so without changing what isempty() would return for the same values,.

/Joel

#7Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#5)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

I think there is a case to allow creating empty ranges *with* bounds information, e.g. '[6,7)'::int4range,
as well as the current only possibility to create empty ranges *without* bounds information, e.g. 'empty'::int4range

I've had a look at how ranges are implemented,
and I think I've found a way to support this is a simple non-invasive way.

I've outlined the idea in a patch, which I will send separately,
as it's a different feature, possibly useful for purposes other than regexp_positions().

/Joel

#8Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#6)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 00:52, Joel Jacobson <joel@compiler.org> wrote:

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be
convenient to have empty ranges at specific locations. Doing this would
introduce numerous questions which would have to be resolved. For example,
where/when is the empty range resulting from an intersection operation?

Hmm, I think I would assume the intersection of two non-overlapping ranges
to be isempty()=TRUE,
and for lower() and upper() to continue to return NULL.

But I think a zero-length range created with actual bounds should
return the lower() and upper() values during creation, instead of NULL.

I tried to find some other programming environments with range types.

The first one I found was Ada.

Interesting!

Array indices are a bit different than general ranges however.

One question I would have is whether empty ranges are all equal to each
other. If they are, you have an equality that isn’t really equality; if
they aren’t then you would have ranges that are unequal even though they
have exactly the same membership. Although I suppose this is already true
for some types where ends can be specified as open or closed but end up
with the same end element; many range types canonicalize to avoid this but
I don’t think they all do.

Returning to the RE result issue, I wonder how much it actually matters
where any empty matches are. Certainly the actual contents of the match
don’t matter; you don’t need to be able to index into the string to extract
the substring. The only scenario I can see where it could matter is if the
RE is using lookahead or look back to find occurrences before or after
something else. If we stipulate that the result array will be in order,
then you still don’t have the exact location of empty matches but you do at
least have where they are relative to non-empty matches.

#9Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#8)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Hi Isaac,

Many thanks for the comments.

On Tue, Mar 2, 2021, at 14:34, Isaac Morland wrote:

One question I would have is whether empty ranges are all equal to each other. If they are, you have an equality that isn’t really equality; if they aren’t then you would have ranges that are unequal even though they have exactly the same membership. Although I suppose this is already true for some types where ends can be specified as open or closed but end up with the same end element; many range types canonicalize to avoid this but I don’t think they all do.

I thought about this problem too. I don't think there is a perfect solution.
Leaving things as they are is problematic too since it makes the range type useless for some use-cases.
I've sent a patch in a separate thread with the least invasive idea I could come up with.

Returning to the RE result issue, I wonder how much it actually matters where any empty matches are. Certainly the actual contents of the match don’t matter; you don’t need to be able to index into the string to extract the substring. The only scenario I can see where it could matter is if the RE is using lookahead or look back to find occurrences before or after something else.

Hmm, I think it would be ugly to have corner-cases handled differently than the rest.

If we stipulate that the result array will be in order, then you still don’t have the exact location of empty matches but you do at least have where they are relative to non-empty matches.

This part I didn't fully understand. Can you please provide some example on this?

/Joel

#10Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#9)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 08:58, Joel Jacobson <joel@compiler.org> wrote:

If we stipulate that the result array will be in order, then you still
don’t have the exact location of empty matches but you do at least have
where they are relative to non-empty matches.

This part I didn't fully understand. Can you please provide some example
on this?

Suppose the match results are:

[4,8)
[10,10)
[13,16)
[20,20)
[24,24)

Then this gets turned into:

[4,8)
empty
[13,16)
empty
empty

So you know that there are non-empty matches from 4-8 and 13-16, plus an
empty match between them and two empty matches at the end. Given that all
empty strings are identical, I think it's only in pretty rare circumstances
where you need to know exactly where the empty matches are; it would have
to be a matter of identifying empty matches immediately before or after a
specific pattern; in which case I suspect it would usually be just as easy
to match the pattern itself directly.

Does this help?

#11Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#10)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 15:05, Isaac Morland wrote:

Suppose the match results are:

[4,8)
[10,10)
[13,16)
[20,20)
[24,24)

Then this gets turned into:

[4,8)
empty
[13,16)
empty
empty

So you know that there are non-empty matches from 4-8 and 13-16, plus an empty match between them and two empty matches at the end. Given that all empty strings are identical, I think it's only in pretty rare circumstances where you need to know exactly where the empty matches are; it would have to be a matter of identifying empty matches immediately before or after a specific pattern; in which case I suspect it would usually be just as easy to match the pattern itself directly.

Does this help?

Thanks, I see what you mean now.

I agree it's probably a corner-case,
but I think I would still prefer a complete solution by just returning setof two integer[] values,
instead of the cuter-but-only-partial solution of using the existing int4range[].

Even better would be if we could fix the range type so it could actually be used in this and other similar situations.

If so, then we could do:

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)a+','g') AS r;
r
-----------
{"[6,9)"}
(1 row)

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
r
---------
{empty}
{empty}
(2 rows)

SELECT lower(r[1]), upper(r[1]) FROM regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
lower | upper
-------+-------
5 | 5
9 | 9
(2 rows)

/Joel

#12Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Isaac Morland (#8)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 2, 2021, at 5:34 AM, Isaac Morland <isaac.morland@gmail.com> wrote:

Returning to the RE result issue, I wonder how much it actually matters where any empty matches are. Certainly the actual contents of the match don’t matter; you don’t need to be able to index into the string to extract the substring. The only scenario I can see where it could matter is if the RE is using lookahead or look back to find occurrences before or after something else. If we stipulate that the result array will be in order, then you still don’t have the exact location of empty matches but you do at least have where they are relative to non-empty matches.

I agree the contents of the match don't matter, because they are always empty. But the position matters. You could intend to split a string in multiple places using lookaheads and lookbehinds to determine the split points.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#5)
1 attachment(s)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

Example:

SELECT * FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
-----------+---------
{3,6} | {6,11}
{11,16} | {16,20}
(2 rows)

Based on HEAD (040af779382e8e4797242c49b93a5a8f9b79c370).

I've updated docs and tests.

/Joel

Attachments:

0002-regexp-positions.patchapplication/octet-stream; name=0002-regexp-positions.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index fee0561961..b761a49d4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3140,6 +3140,31 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
       </row>
 
       <row>
+        <entry role="func_table_entry"><para role="func_signature">
+         <indexterm>
+          <primary>regexp_positions</primary>
+         </indexterm>
+         <function>regexp_positions</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>setof record</returnvalue>
+        ( <parameter>start_pos</parameter> <type>integer[]</type>,
+        <parameter>end_pos</parameter> <type>integer[]</type> )
+        </para>
+        <para>
+         Returns start and end positions of captured substring(s) resulting from matching a POSIX regular
+         expression to the <parameter>string</parameter>; see
+         <xref linkend="functions-posix-regexp"/>.
+        </para>
+        <para>
+         <literal>regexp_positions('foobarbequebaz', 'ba.', 'g')</literal>
+         <returnvalue></returnvalue>
+ <programlisting>
+  ({3},{6})
+  ({11},{14})
+ </programlisting>
+        </para></entry>
+       </row>
+ 
+       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
          <primary>regexp_replace</primary>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index a32c5c82ab..af3564e3e9 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -118,6 +118,7 @@ static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
 												bool ignore_degenerate,
 												bool fetching_unmatched);
 static ArrayType *build_regexp_match_result(regexp_matches_ctx *matchctx);
+static Datum build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc tupdesc);
 static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
 
 
@@ -1056,6 +1057,66 @@ regexp_matches(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * regexp_positions()
+ *		Return a setof record of positions where a pattern matches within a string.
+ */
+Datum
+regexp_positions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	regexp_matches_ctx *matchctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *pattern = PG_GETARG_TEXT_PP(1);
+		text	   *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+		pg_re_flags re_flags;
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(2);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "start_pos",
+						   INT4ARRAYOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "end_pos",
+						   INT4ARRAYOID, -1, 0);
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/* Determine options */
+		parse_re_flags(&re_flags, flags);
+
+		/* be sure to copy the input string into the multi-call ctx */
+		matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
+										&re_flags,
+										PG_GET_COLLATION(),
+										true, false, false);
+
+		/* Pre-create workspace that build_regexp_positions_result needs */
+		matchctx->elems = (Datum *) palloc(sizeof(Datum) * matchctx->npatterns * 2);
+		matchctx->nulls = (bool *) palloc(sizeof(bool) * matchctx->npatterns);
+
+		MemoryContextSwitchTo(oldcontext);
+		funcctx->user_fctx = (void *) matchctx;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	matchctx = (regexp_matches_ctx *) funcctx->user_fctx;
+
+	if (matchctx->next_match < matchctx->nmatches)
+	{
+		Datum tuple;
+
+		tuple = build_regexp_positions_result(matchctx, funcctx->tuple_desc);
+		matchctx->next_match++;
+		SRF_RETURN_NEXT(funcctx, tuple);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
 /* This is separate to keep the opr_sanity regression test from complaining */
 Datum
 regexp_matches_no_flags(PG_FUNCTION_ARGS)
@@ -1063,6 +1124,13 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 	return regexp_matches(fcinfo);
 }
 
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_positions_no_flags(PG_FUNCTION_ARGS)
+{
+	return regexp_positions(fcinfo);
+}
+
 /*
  * setup_regexp_matches --- do the initial matching for regexp_match
  *		and regexp_split functions
@@ -1332,6 +1400,59 @@ build_regexp_match_result(regexp_matches_ctx *matchctx)
 							  TEXTOID, -1, false, TYPALIGN_INT);
 }
 
+/*
+ * build_regexp_positions_result - build output array for current match
+ */
+static Datum
+build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc	tupdesc)
+{
+	Datum	   *elems = matchctx->elems;
+	bool	   *nulls = matchctx->nulls;
+	int			dims[1];
+	int			lbs[1];
+	int			loc;
+	int			i;
+	ArrayType  *so_ary;
+	ArrayType  *eo_ary;
+	Datum		values[2];
+	bool		tuple_nulls[2] = {false, false};
+	HeapTuple	tuple;
+
+	/* Extract matching positions from the original string */
+	loc = matchctx->next_match * matchctx->npatterns * 2;
+	for (i = 0; i < matchctx->npatterns; i++)
+	{
+		int			so = matchctx->match_locs[loc++];
+		int			eo = matchctx->match_locs[loc++];
+
+		if (so < 0 || eo < 0)
+		{
+			elems[i] = (Datum) 0;
+			elems[i + matchctx->npatterns] = (Datum) 0;
+			nulls[i] = true;
+		}
+		else
+		{
+			elems[i] = Int32GetDatum(so);
+			elems[i + matchctx->npatterns] = Int32GetDatum(eo);
+			nulls[i] = false;
+		}
+	}
+
+	/* And form two arrays */
+	dims[0] = matchctx->npatterns;
+	lbs[0] = 1;
+	so_ary = construct_md_array(elems, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	eo_ary = construct_md_array(elems + matchctx->npatterns, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	values[0] = PointerGetDatum(so_ary);
+	values[1] = PointerGetDatum(eo_ary);
+	tuple = heap_form_tuple(tupdesc, values, tuple_nulls);
+	return HeapTupleGetDatum(tuple);
+
+}
+
 /*
  * regexp_split_to_table()
  *		Split the string at matches of the pattern, returning the
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 59d2b71ca9..51ac045b70 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3560,6 +3560,23 @@
   proname => 'regexp_matches', prorows => '10', proretset => 't',
   prorettype => '_text', proargtypes => 'text text text',
   prosrc => 'regexp_matches' },
+
+{ oid => '8104', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text', proallargtypes => '{text,text,_int4,_int4}',
+  proargmodes => '{i,i,o,o}',
+  proargnames => '{string,pattern,start_pos,end_pos}',
+  prosrc => 'regexp_positions_no_flags' },
+
+{ oid => '8105', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text text', proallargtypes => '{text,text,text,_int4,_int4}',
+  proargmodes => '{i,i,i,o,o}',
+  proargnames => '{string,pattern,flags,start_pos,end_pos}',
+  prosrc => 'regexp_positions' },
+
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index fb4573d85f..adcc2bea50 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -601,6 +601,12 @@ SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
  {bar,beque}
 (1 row)
 
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+  regexp_positions  
+--------------------
+ ("{3,6}","{6,11}")
+(1 row)
+
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
  regexp_matches 
@@ -616,6 +622,13 @@ SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g')
  {bazil,barf}
 (2 rows)
 
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+   regexp_positions    
+-----------------------
+ ("{3,6}","{6,11}")
+ ("{11,16}","{16,20}")
+(2 rows)
+
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
  regexp_matches 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 57a48c9d0b..aa8b0553f0 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -198,12 +198,14 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
 
 -- return all matches from regexp
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
 
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
 
 -- global option - more than one match
 SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
 
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#13)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

regards, tom lane

#15Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 03/04/21 10:40, Tom Lane wrote:

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

Not only that, the functions in that other proposal are very similar
to the standard's own functions that are specified to use XML Query
regular expression syntax (sample implementations in [1]https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary).

These differently-named (which is good) functions seem to be a de facto
standard where the regexp syntax and semantics are those native to the
DBMS, the correspondence being

de facto ISO XQuery-based
-------------- ------------------
regexp_like like_regex
regexp_count occurrences_regex
regexp_instr position_regex
regexp_substr substring_regex
regexp_replace translate_regex

The regexp_positions proposal highlights an interesting apparent gap in
both the de facto and the ISO specs: the provided functions allow you
to specify which occurrence you're talking about, and get the corresponding
positions or the corresponding substring, but neither set of functions
includes one to just give you all the matching positions at once as
a SETOF something.

What the proposed regexp_positions() returns is pretty much exactly
the notional "list of match vectors" that appears internally throughout
the specs of the ISO functions, but is never directly exposed.

In the LOMV as described in the standard, the position/length arrays
are indexed from zero, and the start and length at index 0 are those
for the overall match as a whole.

Right now, if you have a query that involves, say,

substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also
substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2),

a na�ve implementation like [1]https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary will of course compile and evaluate
the regexp twice and return one group each time. It makes me wonder
whether the standards committee was picturing a clever parse analyzer
and planner that would say "aha! you want group 1 and group 2 from
a single evaluation of this regex!", and that might even explain the
curious rule in the standard that the regex must be an actual literal,
not any other expression. (Still, that strikes me as an awkward way to
have to write it, spelling the regex out as a literal, twice.)

It has also made my idly wonder how close we could get to behaving
that way, perhaps with planner support functions and other available
parse analysis/planning hooks. Would any of those mechanisms get a
sufficiently global view of the query to do that kind of rewriting?

Regards,
-Chap

[1]: https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary
https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary

#16Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

I considered it, but I prefer two separate simple arrays for two reasons:

a) more pedagogic, it's at least then obvious what values are start and end positions,
then you only have to understand what the values means.

b) 2-D doesn't arrays don't work well with unnest().
If you would unnest() the 2-D array you couldn't separate the start positions from the end positions,
whereas with two separate, you could do:

SELECT unnest(start_pos) AS start_pos, unnest(end_pos) AS end_pos FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
-----------+---------
3 | 6
6 | 11
11 | 16
16 | 20
(4 rows)

Can give some details on your unpleasant experiences of parallel arrays?

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

Yes, I saw it, it was sent shortly after my proposal, so I couldn't take it into account.
Seems useful, except regexp_instr() seems redundant, I would rather have regexp_positions(),
but maybe regexp_instr() should also be added for compatibility reasons.

/Joel

#17Gilles Darold
gilles@darold.net
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Le 04/03/2021 à 16:40, Tom Lane a écrit :

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

The object of regexp_position() is to return all start+end of captured
substrings, it overlaps a little with regexp_instr() in the way that
this function returns the start or end position of a specific captured
substring. I think it is a good idea to have a function that returns all
positions instead of a single one like regexp_instr(), this is not the
same usage. Actually regexp_position() is exactly the same as
regexp_matches() except that it return positions instead of substrings.

I also think that it should return a setof 2-D integer array, an other
solution is to return all start/end positions of an occurrence chained
in an integer array {start1,end1,start2,end2,..}.

Regards,

--
Gilles Darold

#18Joel Jacobson
joel@compiler.org
In reply to: Gilles Darold (#17)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Mar 4, 2021, at 17:55, Gilles Darold wrote:

I also think that it should return a setof 2-D integer array, an other
solution is to return all start/end positions of an occurrence chained
in an integer array {start1,end1,start2,end2,..}.

Hmm. Seems like we've in total managed to come up with three flawed ideas.

Pros/cons I see:

Idea #1: setof 2-D integer array
+ Packs the values into one single value.
- Difficult to work with 2-D arrays, doesn't work well with unnest(), has to inspect the dims and use for loops to extract values.
- Looking at a 2-D value, it's not obvious what the integer values means in it means. Which one is "startpos" and do we have "length" or "endpos" values?
Idea #2: setof (start_pos integer[], end_pos integer[])
+ It's obvious to the user what type of integers "start_pos" and "end_pos" contain.
- Decouples the values into two separate values.
- Tom mentioned some bad experiences with separate array values. (Details on this would be interesting.)

Idea #3: chained integer array {start1,end1,start2,end2,..}
- Mixes different values into the same value
- Requires maths (although simple calculations) to extract values

I think all three ideas (including mine) are ugly. None of them is wart free.

Idea #4: add a new composite built-in type.

A simple composite type with two int8 fields.

The field names seems to vary a lot between languages:

Rust: "start", "end" [1]https://doc.rust-lang.org/std/ops/struct.Range.html
C++: "begin", "end" [2]https://en.cppreference.com/w/cpp/ranges
Python: "start", "stop" [3]https://www.w3schools.com/python/ref_func_range.asp

Such a simple composite type, could then always be used,
when you want to represent simple integer ranges,
between two exact values, arguably a very common need.

Such type could be converted to/from int8range,
but would have easily accessible field names,
which is simpler than using lower() and upper(),
since upper() always returns the canonical
exclusive upper bound for discrete types,
which is not usually what you want when
dealing with "start" and "end" integer ranges.

Since there is no type named just "range", why not just use this name?

Since "end" is a keyword, I suggest the "stop" name:

PoC:

CREATE TYPE range AS (start int8, stop int8);

A real implementation would of course also verify CHECK (start <= stop),
and would add conversions to/from int8range.

I realise this is probably a controversial idea.
But, I think this is a general common problem that deserves a clean general solution.

Thoughts? More ideas?

[1]: https://doc.rust-lang.org/std/ops/struct.Range.html
[2]: https://en.cppreference.com/w/cpp/ranges
[3]: https://www.w3schools.com/python/ref_func_range.asp

#19Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#18)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Idea #5:

Allow disabling canonicalization via optional parameter to range constructor functions.

This would then allow using the range type,
to create inclusive/inclusive integer ranges,
where lower() and upper() would return what you expect.

/Joel

#20Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#19)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Hi

pá 5. 3. 2021 v 13:44 odesílatel Joel Jacobson <joel@compiler.org> napsal:

Idea #5:

Allow disabling canonicalization via optional parameter to range
constructor functions.

I think so rules describing ranges and multirages are long enough, so
increasing functionality doesn't look like a practical idea.

I prefere special simple composite type like you described in the previous
email (start, stop) or (start, length). It can be used more times when
using range or multi range is not practical.

The composite types are more natural for this purpose than 2D arrays.

Regards

Pavel

Show quoted text

This would then allow using the range type,
to create inclusive/inclusive integer ranges,
where lower() and upper() would return what you expect.

/Joel

#21Andreas Karlsson
andreas@proxel.se
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 3/4/21 4:40 PM, Tom Lane wrote:

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

Hm, I can see your point but on the other hand I can't say my
experiences working with 2-D arrays have been that pleasant either. The
main issue being how there is no simple way to unnest just one dimension
of the array. Maybe it would be worth considering implementing a
function for that.

As far as I know to unnest just one dimension you would need to use
generate_series() or something like the query below. Please correct me
if I am wrong and there is some more ergonomic way to do it.

WITH d (a) AS (SELECT '{{2,3},{4,5}}'::int[])
SELECT array_agg(unnest) FROM d, unnest(a) WITH ORDINALITY GROUP BY
(ordinality - 1) / array_length(a, 2);

Andreas

#22Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#2)
2 attachment(s)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:

I like the idea so I did a bit of testing. I think the following should not error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound

Doh! How stupid of me. I realize now I had a off-by-one thinko in my 0001 patch using int4range.

I didn't use the raw "so" and "eo" values in regexp.c like I should have,
instead, I incorrectly used (so + 1) as the startpos,
and just eo as the endpos.

This is what caused all the problems.

The fix is simple:
-  lower.val = Int32GetDatum(so + 1);
+ lower.val = Int32GetDatum(so);

The example that gave the error now works properly:

SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
regexp_positions
------------------
{"[6,7)"}
(1 row)

I've also created a SQL PoC of the composite range type idea,
and convenience wrapper functions for int4range and int8range.

CREATE TYPE range AS (start int8, stop int8);

Helper functions:
range(start int8, stop int8) -> range
range(int8range) -> range
range(int4range) -> range
range(int8range[]) -> range[]
range(int4range[]) -> range[]

Demo:

regexp_positions() returns setof int4range[]:

SELECT r FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
r
-----------------------
{"[3,7)","[6,12)"}
{"[11,17)","[16,21)"}
(2 rows)

Convert int4range[] -> range[]:

SELECT range(r) FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
range
-----------------------
{"(3,6)","(6,11)"}
{"(11,16)","(16,20)"}
(2 rows)

"start" and "stop" fields:

SELECT (range(r[1])).* FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
start | stop
-------+------
3 | 6
11 | 16
(2 rows)

zero-length match at beginning:

SELECT r FROM regexp_positions('','^','g') AS r;
r
-----------
{"[0,1)"}
(1 row)

SELECT (range(r[1])).* FROM regexp_positions('','^','g') AS r;
start | stop
-------+------
0 | 0
(1 row)

My conclusion is that we should use setof int4range[] as the return value for regexp_positions().

New patch attached.

The composite range type and helper functions are of course not at all necessary,
but I think they would be a nice addition, to make it easier to work with ranges
for composite types. I intentionally didn't create anyrange versions of them,
since they can only support composite types,
since they don't require the inclusive/exclusive semantics.

/Joel

Attachments:

range.sqlapplication/octet-stream; name=range.sqlDownload
0003-regexp-positions.patchapplication/octet-stream; name=0003-regexp-positions.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index fee0561961..19a1278747 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3140,6 +3140,29 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
       </row>
 
       <row>
+        <entry role="func_table_entry"><para role="func_signature">
+         <indexterm>
+          <primary>regexp_positions</primary>
+         </indexterm>
+         <function>regexp_positions</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+         <returnvalue>setof int4range[]</returnvalue>
+        </para>
+        <para>
+         Returns start and end positions of captured substring(s) resulting from matching a POSIX regular
+         expression to the <parameter>string</parameter>; see
+         <xref linkend="functions-posix-regexp"/>.
+        </para>
+        <para>
+         <literal>regexp_positions('foobarbequebaz', 'ba.', 'g')</literal>
+         <returnvalue></returnvalue>
+ <programlisting>
+  {"[3,7)"}
+  {"[11,15)"}
+ </programlisting>
+        </para></entry>
+       </row>
+ 
+       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
          <primary>regexp_replace</primary>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index a32c5c82ab..aa60151af0 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -37,6 +37,7 @@
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/varlena.h"
+#include "utils/rangetypes.h"
 
 #define PG_GETARG_TEXT_PP_IF_EXISTS(_n) \
 	(PG_NARGS() > (_n) ? PG_GETARG_TEXT_PP(_n) : NULL)
@@ -118,6 +119,7 @@ static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
 												bool ignore_degenerate,
 												bool fetching_unmatched);
 static ArrayType *build_regexp_match_result(regexp_matches_ctx *matchctx);
+static ArrayType *build_regexp_positions_result(regexp_matches_ctx *matchctx);
 static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
 
 
@@ -1056,6 +1058,58 @@ regexp_matches(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * regexp_positions()
+ *		Return a table of ranges where a pattern matches within a string.
+ */
+Datum
+regexp_positions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	regexp_matches_ctx *matchctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *pattern = PG_GETARG_TEXT_PP(1);
+		text	   *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+		pg_re_flags re_flags;
+		MemoryContext oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* Determine options */
+		parse_re_flags(&re_flags, flags);
+
+		/* be sure to copy the input string into the multi-call ctx */
+		matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
+										&re_flags,
+										PG_GET_COLLATION(),
+										true, false, false);
+
+		/* Pre-create workspace that build_regexp_match_result needs */
+		matchctx->elems = (Datum *) palloc(sizeof(Datum) * matchctx->npatterns);
+		matchctx->nulls = (bool *) palloc(sizeof(bool) * matchctx->npatterns);
+
+		MemoryContextSwitchTo(oldcontext);
+		funcctx->user_fctx = (void *) matchctx;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	matchctx = (regexp_matches_ctx *) funcctx->user_fctx;
+
+	if (matchctx->next_match < matchctx->nmatches)
+	{
+		ArrayType  *result_ary;
+
+		result_ary = build_regexp_positions_result(matchctx);
+		matchctx->next_match++;
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(result_ary));
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
 /* This is separate to keep the opr_sanity regression test from complaining */
 Datum
 regexp_matches_no_flags(PG_FUNCTION_ARGS)
@@ -1063,6 +1117,13 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 	return regexp_matches(fcinfo);
 }
 
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_positions_no_flags(PG_FUNCTION_ARGS)
+{
+	return regexp_positions(fcinfo);
+}
+
 /*
  * setup_regexp_matches --- do the initial matching for regexp_match
  *		and regexp_split functions
@@ -1332,6 +1393,64 @@ build_regexp_match_result(regexp_matches_ctx *matchctx)
 							  TEXTOID, -1, false, TYPALIGN_INT);
 }
 
+/*
+ * build_regexp_positions_result - build output array for current match
+ */
+static ArrayType *
+build_regexp_positions_result(regexp_matches_ctx *matchctx)
+{
+	Datum	   *elems = matchctx->elems;
+	bool	   *nulls = matchctx->nulls;
+	int			dims[1];
+	int			lbs[1];
+	int			loc;
+	int			i;
+	RangeType  *range;
+	TypeCacheEntry *typcache;
+	RangeBound	lower;
+	RangeBound	upper;
+
+	typcache = lookup_type_cache(INT4RANGEOID, TYPECACHE_RANGE_INFO);
+
+	/* Extract matching substrings from the original string */
+	loc = matchctx->next_match * matchctx->npatterns * 2;
+	for (i = 0; i < matchctx->npatterns; i++)
+	{
+		int			so = matchctx->match_locs[loc++];
+		int			eo = matchctx->match_locs[loc++];
+
+		if (so < 0 || eo < 0)
+		{
+			elems[i] = (Datum) 0;
+			nulls[i] = true;
+		}
+		else
+		{
+			lower.val = Int32GetDatum(so);
+			lower.infinite = false;
+			lower.inclusive = true;
+			lower.lower = true;
+
+			upper.val = Int32GetDatum(eo);
+			upper.infinite = false;
+			upper.inclusive = true;
+			upper.lower = false;
+
+			range = make_range(typcache, &lower, &upper, false);
+
+			elems[i] = RangeTypePGetDatum(range);
+			nulls[i] = false;
+		}
+	}
+
+	/* And form an array */
+	dims[0] = matchctx->npatterns;
+	lbs[0] = 1;
+	/* XXX: this hardcodes assumptions about the text type */
+	return construct_md_array(elems, nulls, 1, dims, lbs,
+							  INT4RANGEOID, -1, false, TYPALIGN_INT);
+}
+
 /*
  * regexp_split_to_table()
  *		Split the string at matches of the pattern, returning the
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 59d2b71ca9..fef0ffd6cd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3560,6 +3560,14 @@
   proname => 'regexp_matches', prorows => '10', proretset => 't',
   prorettype => '_text', proargtypes => 'text text text',
   prosrc => 'regexp_matches' },
+{ oid => '8104', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10', proretset => 't',
+  prorettype => '_int4range', proargtypes => 'text text text',
+  prosrc => 'regexp_positions' },
+{ oid => '8105', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10', proretset => 't',
+  prorettype => '_int4range', proargtypes => 'text text',
+  prosrc => 'regexp_positions_no_flags' },
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index fb4573d85f..6f9e7d5996 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -601,6 +601,12 @@ SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
  {bar,beque}
 (1 row)
 
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+  regexp_positions  
+--------------------
+ {"[3,7)","[6,12)"}
+(1 row)
+
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
  regexp_matches 
@@ -616,6 +622,13 @@ SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g')
  {bazil,barf}
 (2 rows)
 
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+   regexp_positions    
+-----------------------
+ {"[3,7)","[6,12)"}
+ {"[11,17)","[16,21)"}
+(2 rows)
+
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
  regexp_matches 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 57a48c9d0b..aa8b0553f0 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -198,12 +198,14 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
 
 -- return all matches from regexp
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
 
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
 
 -- global option - more than one match
 SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
 
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
#23Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#22)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Fri, Mar 5, 2021, at 20:46, Joel Jacobson wrote:

My conclusion is that we should use setof int4range[] as the return value for regexp_positions().

If acceptable by the project, it be even nicer if we could just return the suggested composite type.

I don't see any existing catalog functions returning composite types though?
Is this due to some policy of not wanting composite types as return values for built-ins or just a coincidence?

Example on regexp_positions -> setof range[]
where range is:

CREATE TYPE range AS (start int8, stop int8);

SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
regexp_positions
------------------
{"(6,6)"}
(1 row)

SELECT r FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
r
-----------------------
{"(3,6)","(6,11)"}
{"(11,16)","(16,20)"}
(2 rows)

SELECT r[1].*, r[2].* FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g') AS r;
start | stop | start | stop
-------+------+-------+------
3 | 6 | 6 | 11
11 | 16 | 16 | 20
(2 rows)

SELECT r[1].* FROM regexp_positions('','^','g') AS r;
start | stop
-------+------
0 | 0
(1 row)

Thoughts?

/Joel

#24Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#22)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 5, 2021, at 11:46 AM, Joel Jacobson <joel@compiler.org> wrote:

/Joel
<range.sql><0003-regexp-positions.patch>

I did a bit more testing:

+SELECT regexp_positions('foobarbequebaz', 'b', 'g');
+ regexp_positions 
+------------------
+ {"[3,5)"}
+ {"[6,8)"}
+ {"[11,13)"}
+(3 rows)
+

I understand that these ranges are intended to be read as one character long matches starting at positions 3, 6, and 11, but they look like they match either two or three characters, depending on how you read them, and users will likely be confused by that.

+SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
+ regexp_positions 
+------------------
+ {"[6,7)"}
+(1 row)
+

This is a zero length match. As above, it might be confusing that a zero length match reads this way.

+SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
+ regexp_positions 
+------------------
+ {"[14,15)"}
+(1 row)
+

Same here, except this time position 15 is referenced, which is beyond the end of the string.

I think a zero length match at the end of this string should read as {"[14,14)"}, and you have been forced to add one to avoid that collapsing down to "empty", but I'd rather you found a different datatype rather than abuse the definition of int4range.

It seems that you may have reached a similar conclusion down-thread?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#24)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mon, Mar 8, 2021, at 17:20, Mark Dilger wrote:

On Mar 5, 2021, at 11:46 AM, Joel Jacobson <joel@compiler.org> wrote:
<range.sql><0003-regexp-positions.patch>

I did a bit more testing:

+SELECT regexp_positions('foobarbequebaz', 'b', 'g');
+ regexp_positions 
+------------------
+ {"[3,5)"}
+ {"[6,8)"}
+ {"[11,13)"}
+(3 rows)
+

I understand that these ranges are intended to be read as one character long matches starting at positions 3, 6, and 11, but they look like they match either two or three characters, depending on how you read them, and users will likely be confused by that.

+SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
+ regexp_positions 
+------------------
+ {"[6,7)"}
+(1 row)
+

This is a zero length match. As above, it might be confusing that a zero length match reads this way.

+SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
+ regexp_positions 
+------------------
+ {"[14,15)"}
+(1 row)
+

Same here, except this time position 15 is referenced, which is beyond the end of the string.

I think a zero length match at the end of this string should read as {"[14,14)"}, and you have been forced to add one to avoid that collapsing down to "empty", but I'd rather you found a different datatype rather than abuse the definition of int4range.

It seems that you may have reached a similar conclusion down-thread?

This is due to the, in my opinion, unfortunate decision of using inclusive/exclusive as the canonical form for discrete types.
Probably not much we can do about that, but that's what we have, so I think it's fine.

[6,7) is exactly the same thing as [6,6] for discrete types, it simply means the startpos and endpos both are 6.

I prefer to think of a match as two points. If the points are at the same position, it's a zero length match.

In the example, the startpos and endpos are both at 6, so it's a zero length match.,

This was very confusing to me at first. I wrongly thought I needed an empty int4range and had the perverse idea of hacking the range type to allow setting lower and upper even though it was empty. This was a really really bad idea which I feel stupid of even considering. It was before I understood a zero length match should actually *not* be represented as an empty int4range, but as an int4range covering exactly one single integer, since startpos=endpos. This was due to my off-by-one error. With that fixed, the only problem is the (in my opinion) unnatural canonical form for discrete types, since in this context it's just silly to talk about inclusive/exclusive. I think inclusive/inclusive would have been much more SQL idiomatic, since that's the semantics for BETWEEN in SQL, it's inclusive/inclusive. So is most other programming environments I've seen.

However, not much we can do about that for int4range/int8range,
but maybe multirange could change the canonical form going forward.

Even if not changed, I think int4range works just fine. It just requires a bit more mental effort to understand what the values mean. Probably an annoyance for users at first, but I think they easily will understand they should just do "-1" on the upper() value (but only if upper_inc() is FALSE, but you know that for sure for int4ranges, so it is really necessary, one might wonder).

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial.

Since there are currently zero composite type returning catalog functions, I can see why the idea of returning a "range" with two "start" and "stop" fields is controversial. There are probably good reasons that I fail to see why there are no composite type returning functions in the catalogs. Ideas on why this is the case, anyone?

/Joel

#26Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#25)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 8, 2021, at 9:05 AM, Joel Jacobson <joel@compiler.org> wrote:

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial.

How about proposing some array functions to go along with the regexp_positions, and then do it that way?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#26)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mon, Mar 8, 2021, at 18:11, Mark Dilger wrote:

On Mar 8, 2021, at 9:05 AM, Joel Jacobson <joel@compiler.org> wrote:

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial.

How about proposing some array functions to go along with the regexp_positions, and then do it that way?

Sounds like a nice solution. That would be a huge win when dealing with multidimensional arrays in general.

Do we have strong support on the list for such a function? If so, I can make an attempt implementing it, unless some more experienced hacker wants to do it.

/Joel

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#25)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

I prefer to think of a match as two points. If the points are at the same position, it's a zero length match.

FWIW, I personally think that returning a start position and a length
would be the most understandable way to operate. If you report start
position and end position then there is always going to be confusion
over whether the end position is inclusive or exclusive (that is,
some code including our regex library thinks of the "end" as being
"first character after the match"). This is indeed the same
definitional issue you're contending with vis-a-vis range endpoints,
only now you lack any pre-existing definition that people might rely on
to know what you meant.

Since there are currently zero composite type returning catalog functions, I can see why the idea of returning a "range" with two "start" and "stop" fields is controversial. There are probably good reasons that I fail to see why there are no composite type returning functions in the catalogs. Ideas on why this is the case, anyone?

Yeah: it's hard. The amount of catalog infrastructure needed by a
composite type is dauntingly large, and genbki.pl doesn't offer any
support for building composite types that aren't tied to catalogs.
(I suppose if you don't mind hacking Perl, you could try to refactor
it to improve that.) Up to now we've avoided the need for that,
since a function can be declared to return an anonymous record type
by giving it some OUT parameters. However, if I'm understanding
things correctly "regexp_positions(IN ..., OUT match_start integer,
OUT match_length integer) RETURNS SETOF record" wouldn't be enough
for you, because you really need a 2-D tableau of match data to
handle the case of multiple capturing parens plus 'g' mode. It
seems like you need it to return setof array(s), so the choices are
array of composite, 2-D array, or two parallel arrays. I'm not sure
the first of those is so much better than the others that it's worth
the pain involved to set up the initial catalog data that way.

BTW, I don't know if you know the history here, but regexp_matches()
is way older than regexp_match(); we eventually invented the latter
because the former was just too hard to use for easy non-'g' cases.
I'm inclined to think we should learn from that and provide equivalent
variants regexp_position[s] right off the bat.

regards, tom lane

#29Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#28)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 03/08/21 12:30, Tom Lane wrote:

I'm inclined to think we should learn from that and provide equivalent
variants regexp_position[s] right off the bat.

I think the s-free version is exactly the regexp_instr included in
the other concurrent proposal [1], which closely corresponds to the
ISO position_regex() except for the ISO one using XQuery regex syntax.

I gather from [1] that the name regexp_instr is chosen in solidarity
with other DBMSs that de facto have it. Would it be weirder to have the
singular form be regexp_instr and the plural be regexp_positions?
Or to diverge from the other systems' de facto convention and name
the singular form regexp_position? (Or the plural form regexp_instrs?
That sounds to me like a disassembler for regexps. Or regexps_instr,
like attorneys general? Never mind.)

Regards,
-Chap

#30Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#29)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 03/08/21 13:29, Chapman Flack wrote:

I think the s-free version is exactly the regexp_instr included in
the other concurrent proposal [1]

sorry.

[1]: /messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net
/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

#31Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#27)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 8, 2021, at 9:20 AM, Joel Jacobson <joel@compiler.org> wrote:

On Mon, Mar 8, 2021, at 18:11, Mark Dilger wrote:

On Mar 8, 2021, at 9:05 AM, Joel Jacobson <joel@compiler.org> wrote:

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial.

How about proposing some array functions to go along with the regexp_positions, and then do it that way?

Sounds like a nice solution. That would be a huge win when dealing with multidimensional arrays in general.

Do we have strong support on the list for such a function? If so, I can make an attempt implementing it, unless some more experienced hacker wants to do it.

That's a hard question to answer in advance. Typically, you need to propose a solution, and then get feedback. You wouldn't need to post a patch, but perhaps some examples of how you would expect it to work, like

+SELECT slice('{{1,2,3,4},{5,6,7,8},{9,10,11,12},{13,14,15,16}}'::integer[][], '[2,4)'::int4range);
+   slice   
+-----------
+ {{2,3}}
+ {{6,7}}
+ {{10,11}}
+ {{14,15}}
+(4 rows)
+
+SELECT slice('{{{1,2,3},{4,5,6},{7,8,9}},{{10,11,12},{13,14,15},{16,17,18}}}'::integer[][][], '[2,4)'::int4range);
+           slice           
+---------------------------
+ {{{4,5,6},{7,8,9}}}
+ {{{13,14,15},{16,17,18}}}
+(2 rows)
+

and then people can tell you why they hate that choice of interface.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#28)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mon, Mar 8, 2021, at 18:30, Tom Lane wrote:

FWIW, I personally think that returning a start position and a length
would be the most understandable way to operate.

Very good point. I agree. (And then ranges cannot be used, regardless of canonical form.)

Yeah: it's hard. The amount of catalog infrastructure needed by a
composite type is dauntingly large, and genbki.pl doesn't offer any
support for building composite types that aren't tied to catalogs.
(I suppose if you don't mind hacking Perl, you could try to refactor
it to improve that.)

I haven't studied genbki.pl in detail, but seen its name on the list many times,
maybe I should go through it to understand it in detail.

On the topic of Perl.
I've written a lot of Perl code over the years.
Trustly was initially a Perl+PostgreSQL microservice project, with different components
written in Perl run as daemons, communicating with each other over TCP/IP,
via JSON-RPC. We had lots of strange problems difficult to debug.
In the end, we moved all the business logics from Perl into database functions in PostgreSQL,
and all problems went away. The biggest win was the nice UTF-8 support,
which was really awkward in Perl. It's kind of UTF-8, but not really and not always.

Most programming languages/compilers are obsessed
with the concept of "bootstrapping"/"dogfooding".

Thinking of PostgreSQL as a language/compiler, that would mean we should be obsessed with the idea
of implementing PostgreSQL in SQL or PL/pgSQL. That would be quite a challenge of course.

However, for certain tasks, when a high-level language is preferred,
and when the raw performance of C isn't necessary, then maybe SQL/PLpgSQL
could be a serious alternative to Perl?

If I understand it correctly, we don't need to run genbki.pl to compile PostgreSQL,
so someone wanting to compile PostgreSQL without having a running PostgreSQL-instance
could do so without problems.

A dependency on having a PostgreSQL instance running,
is perhaps acceptable for hackers developing PostgreSQL?
But of course not for normal users just wanting to compile PostgreSQL.

If we think there is at least a 1% chance this is a feasible idea,
I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.
Would be a fun hack, but not if it's guaranteed time-waste.

It seems like you need it to return setof array(s), so the choices are
array of composite, 2-D array, or two parallel arrays. I'm not sure
the first of those is so much better than the others that it's worth
the pain involved to set up the initial catalog data that way.

I agree, I like the 2-D array version, but only if a we could provide a C-function
to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are there
reasons why it cannot be done easily? I could give it a try, if we think it's a good idea.

BTW, I don't know if you know the history here, but regexp_matches()
is way older than regexp_match(); we eventually invented the latter
because the former was just too hard to use for easy non-'g' cases.
I'm inclined to think we should learn from that and provide equivalent
variants regexp_position[s] right off the bat.

I remember! regexp_match() was a very welcomed addition.
I agree both regexp_position[s] variants would be good for same reasons.

/Joel

#33Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#32)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mon, Mar 8, 2021, at 19:46, Joel Jacobson wrote:

However, for certain tasks, when a high-level language is preferred,
and when the raw performance of C isn't necessary, then maybe SQL/PLpgSQL
could be a serious alternative to Perl?

Before we had jsonb, this would have been totally unrealistic.

But with jsonb, I think we actually have complete coverage of Perl's data types:

Perl array <=> jsonb array
Perl hash <=> jsonb object
Perl scalar <=> jsonb string/boolean/number

I've been using jsonb with great success for code generation.
ASTs are nicely represented as nested jsonb arrays.

/Joel

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#32)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

If I understand it correctly, we don't need to run genbki.pl to compile PostgreSQL,
so someone wanting to compile PostgreSQL without having a running PostgreSQL-instance
could do so without problems.
A dependency on having a PostgreSQL instance running,
is perhaps acceptable for hackers developing PostgreSQL?
But of course not for normal users just wanting to compile PostgreSQL.
If we think there is at least a 1% chance this is a feasible idea,
I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.

No, I think this is a non-starter. Bootstrapping from just the
contents of the git repo is something developers do all the time
(and indeed the buildfarm does it in every run). We do not want to
need a running PG instance in advance of doing that.

Yeah, we could make it work if we started treating all the genbki
output files as things to include in the git repo, but I don't think
anybody wants to go there.

I understand some folks' distaste for Perl, and indeed I don't like it
that much myself. If we were starting over from scratch I'm sure
we'd choose a different language for our build/test infrastructure.
But that's where we are, and I would not be in favor of having more
than one scripting language as build requirements. So Perl is going
to be it unless somebody gets ambitious enough to replace all the Perl
scripts at once, which seems unlikely to happen.

I agree, I like the 2-D array version, but only if a we could provide a C-function
to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are there
reasons why it cannot be done easily? I could give it a try, if we think it's a good idea.

+1, I think this need has come up before. My guess is that the
hardest part of that will be choosing a function name that will
satisfy everybody ;-).

Could there be any value in allowing unnesting a variable number
of levels? If so, we could dodge the naming issue by inventing
"unnest(anyarray, int) returns anyarray" where the second argument
specifies the number of subscript levels to remove, or perhaps
the number to keep.

regards, tom lane

#35Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#34)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

po 8. 3. 2021 v 21:12 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

"Joel Jacobson" <joel@compiler.org> writes:

If I understand it correctly, we don't need to run genbki.pl to compile

PostgreSQL,

so someone wanting to compile PostgreSQL without having a running

PostgreSQL-instance

could do so without problems.
A dependency on having a PostgreSQL instance running,
is perhaps acceptable for hackers developing PostgreSQL?
But of course not for normal users just wanting to compile PostgreSQL.
If we think there is at least a 1% chance this is a feasible idea,
I'm willing to try implementing a SQL/PLpgSQL-version of genbki.pl.

No, I think this is a non-starter. Bootstrapping from just the
contents of the git repo is something developers do all the time
(and indeed the buildfarm does it in every run). We do not want to
need a running PG instance in advance of doing that.

Yeah, we could make it work if we started treating all the genbki
output files as things to include in the git repo, but I don't think
anybody wants to go there.

I understand some folks' distaste for Perl, and indeed I don't like it
that much myself. If we were starting over from scratch I'm sure
we'd choose a different language for our build/test infrastructure.
But that's where we are, and I would not be in favor of having more
than one scripting language as build requirements. So Perl is going
to be it unless somebody gets ambitious enough to replace all the Perl
scripts at once, which seems unlikely to happen.

I agree, I like the 2-D array version, but only if a we could provide a

C-function

to allow unnesting N+1 dims to N dims. Is that a fruitful idea, or are

there

reasons why it cannot be done easily? I could give it a try, if we think

it's a good idea.

+1, I think this need has come up before. My guess is that the
hardest part of that will be choosing a function name that will
satisfy everybody ;-).

Could there be any value in allowing unnesting a variable number
of levels? If so, we could dodge the naming issue by inventing
"unnest(anyarray, int) returns anyarray" where the second argument
specifies the number of subscript levels to remove, or perhaps
the number to keep.

so what about?

CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
RETURNS SETOF anyarray AS $$
DECLARE r $1%type;
BEGIN
FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
LOOP
RETURN NEXT r;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Regards

Pavel

regards, tom lane

Show quoted text
#36Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#35)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mon, Mar 8, 2021, at 21:46, Pavel Stehule wrote:

so what about?

CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
RETURNS SETOF anyarray AS $$
DECLARE r $1%type;
BEGIN
FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
LOOP
RETURN NEXT r;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Not sure I understand. Is the suggestion to add "SLICE" as syntactic sugar in PL/pgSQL to invoke the proposed two-argument C-version of unnest()?

/Joel

#37Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#36)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

út 9. 3. 2021 v 7:57 odesílatel Joel Jacobson <joel@compiler.org> napsal:

On Mon, Mar 8, 2021, at 21:46, Pavel Stehule wrote:

so what about?

CREATE OR REPLACE FUNCTION unnest_slice(anyarray, int)
RETURNS SETOF anyarray AS $$
DECLARE r $1%type;
BEGIN
FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be constant
LOOP
RETURN NEXT r;
END LOOP;
END;
$$ LANGUAGE plpgsql;

Not sure I understand. Is the suggestion to add "SLICE" as syntactic sugar
in PL/pgSQL to invoke the proposed two-argument C-version of unnest()?

there are two ideas:

1. the behaviour can be same like SLICE clause of FOREACH statement

2. use unnest_slice as name - the function "unnest" is relatively rich
today and using other overloading doesn't look too practical. But this is
just an idea. I can imagine more forms of slicing or unnesting, so it can
be practical to use different names than just "unnest".

Personally I don't like too much using 2D arrays for this purpose. The
queries over this functionality will be harder to read (it is like fortran
77). I understand so now, there is no other possibility, because pg cannot
build array type from function signature. So it is harder to build an array
of record types.

We can make an easy tuple store of records - like FUNCTION fx(OUT a int,
OUT b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work,
the simple expression evaluation is significantly faster than SQL
evaluation. So using any SRF function has performance impact. What I miss
is the possibility to write functions like FUNCTION fx(OUT a int, OUT b
int) RETURNS ARRAY. With this possibility is easy to write functions that
you need, and is not necessary to use 2d arrays. If the result of regexp
functions will be arrays of records, then a new unnest function is not
necessary. So this is not a good direction. Instead of fixing core issues,
we design workarounds. There can be more wide usage of arrays of composites.

Regards

Pavel

Show quoted text

/Joel

#38Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#37)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 9, 2021, at 08:26, Pavel Stehule wrote:

there are two ideas:

1. the behaviour can be same like SLICE clause of FOREACH statement

Hm, I'm sorry I don't understand, is there an existing SLICE clause?
I get syntax error in HEAD:

ERROR: syntax error at or near "$2"
LINE 5: FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be consta...

Or do you mean you suggest adding such a clause?

2. use unnest_slice as name - the function "unnest" is relatively rich today and using other overloading doesn't look too practical.

Hm, rich in what way? There is currently only one version for arrays, and a different one for tsvector.

But this is just an idea. I can imagine more forms of slicing or unnesting, so it can be practical to use different names than just "unnest".

Personally I don't like too much using 2D arrays for this purpose. The queries over this functionality will be harder to read (it is like fortran 77). I understand so now, there is no other possibility, because pg cannot build array type from function signature. So it is harder to build an array of record types.

We can make an easy tuple store of records - like FUNCTION fx(OUT a int, OUT b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work, the simple expression evaluation is significantly faster than SQL evaluation. So using any SRF function has performance impact. What I miss is the possibility to write functions like FUNCTION fx(OUT a int, OUT b int) RETURNS ARRAY. With this possibility is easy to write functions that you need, and is not necessary to use 2d arrays. If the result of regexp functions will be arrays of records, then a new unnest function is not necessary. So this is not a good direction. Instead of fixing core issues, we design workarounds. There can be more wide usage of arrays of composites.

Hm, I struggle to understand what your point is.
2D arrays already exist, and when having to deal with them, I think unnest(anyarray,int) would improve the situation.
Now, there might be other situations like you describe where something else than 2D arrays are preferred.
But this doesn't change the fact you sometimes have to deal with 2D arrays, in which case the proposed unnest(anyarray,int) would improve the user-experience a lot, when wanting to unnest just one level (or N levels).

Sounds like you are suggesting some other improvements, in addition to the proposed unnest(anyarray,int)? Correct?

A regexp_positions() returning setof 2-D array[] would not be a workaround, in my opinion,
it would be what I actually want, but only if I also get unnest(anyarray,int), then I'm perfectly happy.

/Joel

#39Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#38)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

út 9. 3. 2021 v 9:01 odesílatel Joel Jacobson <joel@compiler.org> napsal:

On Tue, Mar 9, 2021, at 08:26, Pavel Stehule wrote:

there are two ideas:

1. the behaviour can be same like SLICE clause of FOREACH statement

Hm, I'm sorry I don't understand, is there an existing SLICE clause?
I get syntax error in HEAD:

ERROR: syntax error at or near "$2"
LINE 5: FOREACH r SLICE $2 IN ARRAY $1 --- now $2 should be consta...

Or do you mean you suggest adding such a clause?

https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY

but the SLICE argument should be constant. But this limit is artificial,
just for implementation simplicity. Important is behaviour.

2. use unnest_slice as name - the function "unnest" is relatively rich
today and using other overloading doesn't look too practical.

Hm, rich in what way? There is currently only one version for arrays, and
a different one for tsvector.

no, there is possible to unnest more arrays once

But this is just an idea. I can imagine more forms of slicing or
unnesting, so it can be practical to use different names than just "unnest".

Personally I don't like too much using 2D arrays for this purpose. The
queries over this functionality will be harder to read (it is like fortran
77). I understand so now, there is no other possibility, because pg cannot
build array type from function signature. So it is harder to build an array
of record types.

We can make an easy tuple store of records - like FUNCTION fx(OUT a int,
OUT b int) RETURNS SETOF RECORD. But now, thanks to Tom and Amit's work,
the simple expression evaluation is significantly faster than SQL
evaluation. So using any SRF function has performance impact. What I miss
is the possibility to write functions like FUNCTION fx(OUT a int, OUT b
int) RETURNS ARRAY. With this possibility is easy to write functions that
you need, and is not necessary to use 2d arrays. If the result of regexp
functions will be arrays of records, then a new unnest function is not
necessary. So this is not a good direction. Instead of fixing core issues,
we design workarounds. There can be more wide usage of arrays of composites.

Hm, I struggle to understand what your point is.
2D arrays already exist, and when having to deal with them, I think
unnest(anyarray,int) would improve the situation.

I cannot find any function in Postgres that returns a 2D array now.

For me - using 2D arrays is not a win. It is not a bad solution, but I
cannot say, so I like it, because it is not a good solution. For example,
you cannot enhance this functionality about returning searched substring.
So you need to repeat searching. I have bad experience with using arrays in
this style. Sometimes it is necessary, because external interfaces cannot
work with composites, but the result is unreadable. So this is the reason
for my opinion.

about unnest_2d .. probably it can be used for some cases when users cannot
use composites on the client side. But now, because they can use FOREACH
SLICE is not problem to write any custom function like they exactly need.
And in this case there is very low overhead of plpgsql. But it is true, so
this function can be used for some vector unnesting.

Now, there might be other situations like you describe where something else

than 2D arrays are preferred.
But this doesn't change the fact you sometimes have to deal with 2D
arrays, in which case the proposed unnest(anyarray,int) would improve the
user-experience a lot, when wanting to unnest just one level (or N levels).

Sounds like you are suggesting some other improvements, in addition to the
proposed unnest(anyarray,int)? Correct?

A regexp_positions() returning setof 2-D array[] would not be a
workaround, in my opinion,
it would be what I actually want, but only if I also get
unnest(anyarray,int), then I'm perfectly happy.

We are talking about design, not about usage. try to write some examples of
usage, please?

Regards

Pavel

Show quoted text

/Joel

#40Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#39)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 9, 2021, at 09:29, Pavel Stehule wrote:

https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY

but the SLICE argument should be constant. But this limit is artificial, just for implementation simplicity. Important is behaviour.

I see now what you mean. Yes, being able to specify the SLICE argument as a variable instead of a constant would be a good improvement. Maybe the SLICE implementation from PL/pgSQL could be modified and used for both cases? (Both in the C-version unnest() and in PL/pgSQL to allow variables and not just constants to SLICE)

2. use unnest_slice as name - the function "unnest" is relatively rich today and using other overloading doesn't look too practical.

Hm, rich in what way? There is currently only one version for arrays, and a different one for tsvector.

no, there is possible to unnest more arrays once

What do you mean?
More than one unnest() in the same query, e.g. SELECT unnest(..), unnest(..)?

I cannot find any function in Postgres that returns a 2D array now.

For me - using 2D arrays is not a win. It is not a bad solution, but I cannot say, so I like it, because it is not a good solution. For example, you cannot enhance this functionality about returning searched substring. So you need to repeat searching. I have bad experience with using arrays in this style. Sometimes it is necessary, because external interfaces cannot work with composites, but the result is unreadable. So this is the reason for my opinion.

Even if it would return arrays of a range record with "start" and "stop" field, I don't see how we could enhance it to later return searched substring without changing the return type? Doing so would break any code using the function anyway.

Repeating searching if you want something else than positions, seems like the most SQL-idiomatic solution.

/Joel

#41Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#40)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

út 9. 3. 2021 v 10:01 odesílatel Joel Jacobson <joel@compiler.org> napsal:

On Tue, Mar 9, 2021, at 09:29, Pavel Stehule wrote:

https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY

but the SLICE argument should be constant. But this limit is artificial,
just for implementation simplicity. Important is behaviour.

I see now what you mean. Yes, being able to specify the SLICE argument as
a variable instead of a constant would be a good improvement. Maybe the
SLICE implementation from PL/pgSQL could be modified and used for both
cases? (Both in the C-version unnest() and in PL/pgSQL to allow variables
and not just constants to SLICE)

probably

2. use unnest_slice as name - the function "unnest" is relatively rich
today and using other overloading doesn't look too practical.

Hm, rich in what way? There is currently only one version for arrays, and
a different one for tsvector.

no, there is possible to unnest more arrays once

What do you mean?
More than one unnest() in the same query, e.g. SELECT unnest(..),
unnest(..)?

you can do unnest(array1, array2, ...)

I cannot find any function in Postgres that returns a 2D array now.

For me - using 2D arrays is not a win. It is not a bad solution, but I
cannot say, so I like it, because it is not a good solution. For example,
you cannot enhance this functionality about returning searched substring.
So you need to repeat searching. I have bad experience with using arrays in
this style. Sometimes it is necessary, because external interfaces cannot
work with composites, but the result is unreadable. So this is the reason
for my opinion.

Even if it would return arrays of a range record with "start" and "stop"
field, I don't see how we could enhance it to later return searched
substring without changing the return type? Doing so would break any code
using the function anyway.

you can have composite (position int, value text) or (position int,
offset_bytes int, size_char int, size_bytes int), ... just there are more
possibilities

Repeating searching if you want something else than positions, seems like
the most SQL-idiomatic solution.

:)

yes, but usually you can use index on bigger data. Substring searching is
slower, and we use mostly UTF8, so if start is not in bytes, then you have
to iterate from start of string to find start of substring.

Show quoted text

/Joel

#42Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#41)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:

What do you mean?

More than one unnest() in the same query, e.g. SELECT unnest(..), unnest(..)?

you can do unnest(array1, array2, ...)

Right, I had forgotten about that variant.

But isn't this a bit surprising then:

\df unnest
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+--------+------------------+----------------------------------------------------------------------------------+------
pg_catalog | unnest | SETOF anyelement | anyarray | func
pg_catalog | unnest | SETOF record | tsvector tsvector, OUT lexeme text, OUT positions smallint[], OUT weights text[] | func
(2 rows)

Should there be an entry there showing the VARIADIC anyelement version as well?

I know it's a documented feature, but \df seems out-of-sync with the docs.

/Joel

#43Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:

My experience with working with parallel arrays in SQL has been unpleasant.

Could you please give an example on such an unpleasant experience?

I can see a problem if the arrays could possibly have difference dimensionality/cardinality,
but regexp_positions() could guarantee they won't, so I don't see a problem here,
but there is probably something I'm missing here?

/Joel

#44Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#43)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson <joel@compiler.org> napsal:

On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:

My experience with working with parallel arrays in SQL has been unpleasant.

Could you please give an example on such an unpleasant experience?

it was more complex application with 3D data of some points in 2D array.
Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x,
...

I can see a problem if the arrays could possibly have difference
dimensionality/cardinality,
but regexp_positions() could guarantee they won't, so I don't see a
problem here,
but there is probably something I'm missing here?

I think so the functions based on arrays can work, why not. But the
semantic is lost.

Show quoted text

/Joel

#45Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#44)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 9, 2021, at 13:16, Pavel Stehule wrote:

út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson <joel@compiler.org> napsal:

__On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:

My experience with working with parallel arrays in SQL has been unpleasant.

Could you please give an example on such an unpleasant experience?

it was more complex application with 3D data of some points in 2D array. Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x, ...

Not sure I understand, but my question was directed to Tom, who wrote about his experiences in a previous message up thread.

Tom - can you please give details on your unpleasant experiences with parallel arrays?

/Joel

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#42)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:

you can do unnest(array1, array2, ...)

Right, I had forgotten about that variant.
But isn't this a bit surprising then:
...
Should there be an entry there showing the VARIADIC anyelement version as well?

No, because there's no such pg_proc entry. Multi-argument UNNEST is
special-cased by the parser, cf transformRangeFunction().

(Which is something I'd momentarily forgotten. Forget my suggestion
that we could define unnest(anyarray, int) ... it has to be another
name.)

regards, tom lane

#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#45)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

Tom - can you please give details on your unpleasant experiences with parallel arrays?

The problems I can recall running into were basically down to not having
an easy way to iterate through parallel arrays. There are ways to do
that in SQL, certainly, but they all constrain how you write the query,
and usually force ugly stuff like splitting it into sub-selects.

As an example, presuming that regexp_positions is defined along the
lines of

regexp_positions(str text, pat text, out starts int[], out lengths int[])
returns setof record

then to actually get the identified substrings you'd have to do something
like

select
substring([input string] from starts[i] for lengths[i])
from
regexp_positions([input string], [pattern]) r,
lateral
generate_series(1, array_length(starts, 1)) i;

I think the last time I confronted this, we didn't have multi-array
UNNEST. Now that we do, we can get rid of the generate_series(),
but it's still not beautiful:

select
substring([input string] from s for l)
from
regexp_positions([input string], [pattern]) r,
lateral
unnest(starts, lengths) u(s,l);

Having said that, the other alternative with a 2-D array:

regexp_positions(str text, pat text) returns setof int[]

seems to still need UNNEST, though now it's not the magic multi-array
UNNEST but this slicing version:

select
substring([input string] from u[1] for u[2])
from
regexp_positions([input string], [pattern]) r,
lateral
unnest_slice(r, 1) u;

Anyway, I'd counsel trying to write out SQL implementations
of regexp_matches() and other useful things based on any
particular regexp_positions() API you might be thinking about.
Can we do anything useful without a LATERAL UNNEST thingie?
Are some of them more legible than others?

regards, tom lane

#48Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#47)
1 attachment(s)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Tom - can you please give details on your unpleasant experiences with parallel arrays?

The problems I can recall running into were basically down to not having
an easy way to iterate through parallel arrays. There are ways to do
that in SQL, certainly, but they all constrain how you write the query,
and usually force ugly stuff like splitting it into sub-selects.

I see now what you mean, many thanks for explaining.

As an example, presuming that regexp_positions is defined along the
lines of

regexp_positions(str text, pat text, out starts int[], out lengths int[])
returns setof record

+1

I think this is the most feasible best option so far.

Attached is a patch implementing it this way.

I changed the start to begin at 1, since this is how position ( substring text IN string text ) → integer works.

SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
--------+---------
{1} | {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
--------+---------
{4} | {3}
{12} | {3}
(2 rows)

Mark's examples:

SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
--------+---------
{7} | {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
--------+---------
{15} | {0}
(1 row)

I've also tested your template queries:

then to actually get the identified substrings you'd have to do something
like

select
substring([input string] from starts[i] for lengths[i])
from
regexp_positions([input string], [pattern]) r,
lateral
generate_series(1, array_length(starts, 1)) i;

select
substring('foobarbequebaz' from starts[i] for lengths[i])
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
generate_series(1, array_length(starts, 1)) i;

substring
-----------
bar
baz
(2 rows)

I think the last time I confronted this, we didn't have multi-array
UNNEST. Now that we do, we can get rid of the generate_series(),
but it's still not beautiful:

select
substring([input string] from s for l)
from
regexp_positions([input string], [pattern]) r,
lateral
unnest(starts, lengths) u(s,l);

select
substring('foobarbequebaz' from s for l)
from
regexp_positions('foobarbequebaz', 'ba.', 'g') r,
lateral
unnest(starts, lengths) u(s,l);

substring
-----------
bar
baz
(2 rows)

Having said that, the other alternative with a 2-D array:

regexp_positions(str text, pat text) returns setof int[]

seems to still need UNNEST, though now it's not the magic multi-array
UNNEST but this slicing version:

select
substring([input string] from u[1] for u[2])
from
regexp_positions([input string], [pattern]) r,
lateral
unnest_slice(r, 1) u;

Unable to test this one since there is no unnest_slice() (yet)

Anyway, I'd counsel trying to write out SQL implementations
of regexp_matches() and other useful things based on any
particular regexp_positions() API you might be thinking about.
Can we do anything useful without a LATERAL UNNEST thingie?
Are some of them more legible than others?

Hmm, I cannot think of a way.

/Joel

Attachments:

0004-regexp-positions.patchapplication/octet-stream; name=0004-regexp-positions.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ece09699ef..848037a7d8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3139,6 +3139,48 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>
 
+      <row>
+        <entry role="func_table_entry"><para role="func_signature">
+         <indexterm>
+          <primary>regexp_positions</primary>
+         </indexterm>
+         <function>regexp_positions</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>setof record</returnvalue>
+        ( <parameter>starts</parameter> <type>integer[]</type>,
+        <parameter>lengths</parameter> <type>integer[]</type> )
+        </para>
+        <para>
+         Returns start positions and lengths of captured substring(s) resulting from matching a POSIX regular
+         expression to the <parameter>string</parameter>; see
+         <xref linkend="functions-posix-regexp"/>.
+        </para>
+        <para>
+         <literal>regexp_positions('foobarbequebaz', 'ba.', 'g')</literal>
+         <returnvalue></returnvalue>
+ <programlisting>
+  ({4},{3})
+  ({12},{3})
+ </programlisting>
+        </para></entry>
+       </row>
+       <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_replace</primary>
+        </indexterm>
+        <function>regexp_replace</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type>, <parameter>replacement</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Replaces substring(s) matching a POSIX regular expression; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_replace('Thomas', '.[mN]a.', 'M')</literal>
+        <returnvalue>ThM</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index a32c5c82ab..e9cf41d09d 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -118,6 +118,7 @@ static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
 												bool ignore_degenerate,
 												bool fetching_unmatched);
 static ArrayType *build_regexp_match_result(regexp_matches_ctx *matchctx);
+static Datum build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc tupdesc);
 static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
 
 
@@ -1056,6 +1057,66 @@ regexp_matches(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * regexp_positions()
+ *		Return a setof record of positions where a pattern matches within a string.
+ */
+Datum
+regexp_positions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	regexp_matches_ctx *matchctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *pattern = PG_GETARG_TEXT_PP(1);
+		text	   *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+		pg_re_flags re_flags;
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(2);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starts",
+						   INT4ARRAYOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "lengths",
+						   INT4ARRAYOID, -1, 0);
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/* Determine options */
+		parse_re_flags(&re_flags, flags);
+
+		/* be sure to copy the input string into the multi-call ctx */
+		matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
+										&re_flags,
+										PG_GET_COLLATION(),
+										true, false, false);
+
+		/* Pre-create workspace that build_regexp_positions_result needs */
+		matchctx->elems = (Datum *) palloc(sizeof(Datum) * matchctx->npatterns * 2);
+		matchctx->nulls = (bool *) palloc(sizeof(bool) * matchctx->npatterns);
+
+		MemoryContextSwitchTo(oldcontext);
+		funcctx->user_fctx = (void *) matchctx;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	matchctx = (regexp_matches_ctx *) funcctx->user_fctx;
+
+	if (matchctx->next_match < matchctx->nmatches)
+	{
+		Datum tuple;
+
+		tuple = build_regexp_positions_result(matchctx, funcctx->tuple_desc);
+		matchctx->next_match++;
+		SRF_RETURN_NEXT(funcctx, tuple);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
 /* This is separate to keep the opr_sanity regression test from complaining */
 Datum
 regexp_matches_no_flags(PG_FUNCTION_ARGS)
@@ -1063,6 +1124,13 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 	return regexp_matches(fcinfo);
 }
 
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_positions_no_flags(PG_FUNCTION_ARGS)
+{
+	return regexp_positions(fcinfo);
+}
+
 /*
  * setup_regexp_matches --- do the initial matching for regexp_match
  *		and regexp_split functions
@@ -1332,6 +1400,59 @@ build_regexp_match_result(regexp_matches_ctx *matchctx)
 							  TEXTOID, -1, false, TYPALIGN_INT);
 }
 
+/*
+ * build_regexp_positions_result - build output array for current match
+ */
+static Datum
+build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc	tupdesc)
+{
+	Datum	   *elems = matchctx->elems;
+	bool	   *nulls = matchctx->nulls;
+	int			dims[1];
+	int			lbs[1];
+	int			loc;
+	int			i;
+	ArrayType  *so_ary;
+	ArrayType  *eo_ary;
+	Datum		values[2];
+	bool		tuple_nulls[2] = {false, false};
+	HeapTuple	tuple;
+
+	/* Extract matching positions from the original string */
+	loc = matchctx->next_match * matchctx->npatterns * 2;
+	for (i = 0; i < matchctx->npatterns; i++)
+	{
+		int			so = matchctx->match_locs[loc++];
+		int			eo = matchctx->match_locs[loc++];
+
+		if (so < 0 || eo < 0)
+		{
+			elems[i] = (Datum) 0;
+			elems[i + matchctx->npatterns] = (Datum) 0;
+			nulls[i] = true;
+		}
+		else
+		{
+			elems[i] = Int32GetDatum(so + 1);
+			elems[i + matchctx->npatterns] = Int32GetDatum(eo - so);
+			nulls[i] = false;
+		}
+	}
+
+	/* And form two arrays */
+	dims[0] = matchctx->npatterns;
+	lbs[0] = 1;
+	so_ary = construct_md_array(elems, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	eo_ary = construct_md_array(elems + matchctx->npatterns, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	values[0] = PointerGetDatum(so_ary);
+	values[1] = PointerGetDatum(eo_ary);
+	tuple = heap_form_tuple(tupdesc, values, tuple_nulls);
+	return HeapTupleGetDatum(tuple);
+
+}
+
 /*
  * regexp_split_to_table()
  *		Split the string at matches of the pattern, returning the
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index c7619f8cd3..055dd352f4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3560,6 +3560,23 @@
   proname => 'regexp_matches', prorows => '10', proretset => 't',
   prorettype => '_text', proargtypes => 'text text text',
   prosrc => 'regexp_matches' },
+
+{ oid => '8104', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text', proallargtypes => '{text,text,_int4,_int4}',
+  proargmodes => '{i,i,o,o}',
+  proargnames => '{string,pattern,starts,lengths}',
+  prosrc => 'regexp_positions_no_flags' },
+
+{ oid => '8105', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text text', proallargtypes => '{text,text,text,_int4,_int4}',
+  proargmodes => '{i,i,i,o,o}',
+  proargnames => '{string,pattern,flags,starts,lengths}',
+  prosrc => 'regexp_positions' },
+
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index fb4573d85f..6349044950 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -601,6 +601,12 @@ SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
  {bar,beque}
 (1 row)
 
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+-------------------
+ ("{4,7}","{3,5}")
+(1 row)
+
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
  regexp_matches 
@@ -616,6 +622,13 @@ SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g')
  {bazil,barf}
 (2 rows)
 
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+  regexp_positions   
+---------------------
+ ("{4,7}","{3,5}")
+ ("{12,17}","{5,4}")
+(2 rows)
+
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
  regexp_matches 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 57a48c9d0b..aa8b0553f0 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -198,12 +198,14 @@ SELECT regexp_replace('AAA aaa', 'A+', 'Z', 'z');
 
 -- return all matches from regexp
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
 
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
 
 -- global option - more than one match
 SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
 
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
#49Daniel Gustafsson
daniel@yesql.se
In reply to: Joel Jacobson (#48)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:

Attached is a patch implementing it this way.

This patch no longer applies, can you please submit a rebased version?

--
Daniel Gustafsson https://vmware.com/

#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#49)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Daniel Gustafsson <daniel@yesql.se> writes:

On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:
Attached is a patch implementing it this way.

This patch no longer applies, can you please submit a rebased version?

Also, since 642433707 ("This patch adds new functions regexp_count(),
regexp_instr(), regexp_like(), and regexp_substr(), and extends
regexp_replace() with some new optional arguments") is already in,
we need to think about how this interacts with that. Do we even
still need any more functionality in this area? Should we try to
align the APIs?

Those new function APIs have some Oracle-isms that I don't especially
care for, like use of int for what should be a boolean. Still, users
aren't going to give us a pass for wildly inconsistent APIs just because
some functions came from Oracle and some didn't.

regards, tom lane

#51Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#50)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 1 Sep 2021, at 16:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Daniel Gustafsson <daniel@yesql.se> writes:

On 9 Mar 2021, at 20:30, Joel Jacobson <joel@compiler.org> wrote:
Attached is a patch implementing it this way.

This patch no longer applies, can you please submit a rebased version?

On a brief skim, this patch includes the doc stanza for regexp_replace which I
assume is a copy/pasteo.

+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starts”,
While “start_positions” is awfully verbose, just “starts” doesn’t really roll
off the tongue. Perhaps “positions” would be more explanatory?

Also, since 642433707 ("This patch adds new functions regexp_count(),
regexp_instr(), regexp_like(), and regexp_substr(), and extends
regexp_replace() with some new optional arguments") is already in,
we need to think about how this interacts with that. Do we even
still need any more functionality in this area? Should we try to
align the APIs?

I can see value in a function like this one, and the API is AFAICT fairly
aligned with what I as a user would expect it to be given what we already have.

--
Daniel Gustafsson https://vmware.com/

#52Joel Jacobson
joel@compiler.org
In reply to: Daniel Gustafsson (#51)
1 attachment(s)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Sep 2, 2021, at 00:03, Daniel Gustafsson wrote:

I can see value in a function like this one, and the API is AFAICT fairly
aligned with what I as a user would expect it to be given what we already have.

Good to hear and thanks for looking at this patch.

I've fixed the problem due to the recent change in setup_regexp_matches(),
which added a new int parameter "start_search".
I pass 0 as start_search, which I think should give the same behaviour as before.

I also changed the assigned oid values in pg_proc.dat for the two new regexp_positions() catalog functions.

$ make check

=======================
All 209 tests passed.
=======================

/Joel

Attachments:

0005-regexp-positions.patchapplication/octet-stream; name=0005-regexp-positions.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..1cee8bdabe 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3130,6 +3130,48 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>
 
+      <row>
+        <entry role="func_table_entry"><para role="func_signature">
+         <indexterm>
+          <primary>regexp_positions</primary>
+         </indexterm>
+         <function>regexp_positions</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>setof record</returnvalue>
+        ( <parameter>starts</parameter> <type>integer[]</type>,
+        <parameter>lengths</parameter> <type>integer[]</type> )
+        </para>
+        <para>
+         Returns start positions and lengths of captured substring(s) resulting from matching a POSIX regular
+         expression to the <parameter>string</parameter>; see
+         <xref linkend="functions-posix-regexp"/>.
+        </para>
+        <para>
+         <literal>regexp_positions('foobarbequebaz', 'ba.', 'g')</literal>
+         <returnvalue></returnvalue>
+ <programlisting>
+  ({4},{3})
+  ({12},{3})
+ </programlisting>
+        </para></entry>
+       </row>
+       <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>regexp_replace</primary>
+        </indexterm>
+        <function>regexp_replace</function> ( <parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type>, <parameter>replacement</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type> ] )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Replaces substring(s) matching a POSIX regular expression; see
+        <xref linkend="functions-posix-regexp"/>.
+       </para>
+       <para>
+        <literal>regexp_replace('Thomas', '.[mN]a.', 'M')</literal>
+        <returnvalue>ThM</returnvalue>
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 3b7a607437..6189a815a3 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -119,6 +119,7 @@ static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern,
 												bool ignore_degenerate,
 												bool fetching_unmatched);
 static ArrayType *build_regexp_match_result(regexp_matches_ctx *matchctx);
+static Datum build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc tupdesc);
 static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
 
 
@@ -1371,6 +1372,66 @@ regexp_matches(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * regexp_positions()
+ *		Return a setof record of positions where a pattern matches within a string.
+ */
+Datum
+regexp_positions(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	regexp_matches_ctx *matchctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		text	   *pattern = PG_GETARG_TEXT_PP(1);
+		text	   *flags = PG_GETARG_TEXT_PP_IF_EXISTS(2);
+		pg_re_flags re_flags;
+		MemoryContext oldcontext;
+		TupleDesc	tupdesc;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		tupdesc = CreateTemplateTupleDesc(2);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "starts",
+						   INT4ARRAYOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "lengths",
+						   INT4ARRAYOID, -1, 0);
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/* Determine options */
+		parse_re_flags(&re_flags, flags);
+
+		/* be sure to copy the input string into the multi-call ctx */
+		matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
+										&re_flags, 0,
+										PG_GET_COLLATION(),
+										true, false, false);
+
+		/* Pre-create workspace that build_regexp_positions_result needs */
+		matchctx->elems = (Datum *) palloc(sizeof(Datum) * matchctx->npatterns * 2);
+		matchctx->nulls = (bool *) palloc(sizeof(bool) * matchctx->npatterns);
+
+		MemoryContextSwitchTo(oldcontext);
+		funcctx->user_fctx = (void *) matchctx;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	matchctx = (regexp_matches_ctx *) funcctx->user_fctx;
+
+	if (matchctx->next_match < matchctx->nmatches)
+	{
+		Datum tuple;
+
+		tuple = build_regexp_positions_result(matchctx, funcctx->tuple_desc);
+		matchctx->next_match++;
+		SRF_RETURN_NEXT(funcctx, tuple);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
 /* This is separate to keep the opr_sanity regression test from complaining */
 Datum
 regexp_matches_no_flags(PG_FUNCTION_ARGS)
@@ -1378,6 +1439,13 @@ regexp_matches_no_flags(PG_FUNCTION_ARGS)
 	return regexp_matches(fcinfo);
 }
 
+/* This is separate to keep the opr_sanity regression test from complaining */
+Datum
+regexp_positions_no_flags(PG_FUNCTION_ARGS)
+{
+	return regexp_positions(fcinfo);
+}
+
 /*
  * setup_regexp_matches --- do the initial matching for regexp_match,
  *		regexp_split, and related functions
@@ -1653,6 +1721,59 @@ build_regexp_match_result(regexp_matches_ctx *matchctx)
 							  TEXTOID, -1, false, TYPALIGN_INT);
 }
 
+/*
+ * build_regexp_positions_result - build output array for current match
+ */
+static Datum
+build_regexp_positions_result(regexp_matches_ctx *matchctx, TupleDesc	tupdesc)
+{
+	Datum	   *elems = matchctx->elems;
+	bool	   *nulls = matchctx->nulls;
+	int			dims[1];
+	int			lbs[1];
+	int			loc;
+	int			i;
+	ArrayType  *so_ary;
+	ArrayType  *eo_ary;
+	Datum		values[2];
+	bool		tuple_nulls[2] = {false, false};
+	HeapTuple	tuple;
+
+	/* Extract matching positions from the original string */
+	loc = matchctx->next_match * matchctx->npatterns * 2;
+	for (i = 0; i < matchctx->npatterns; i++)
+	{
+		int			so = matchctx->match_locs[loc++];
+		int			eo = matchctx->match_locs[loc++];
+
+		if (so < 0 || eo < 0)
+		{
+			elems[i] = (Datum) 0;
+			elems[i + matchctx->npatterns] = (Datum) 0;
+			nulls[i] = true;
+		}
+		else
+		{
+			elems[i] = Int32GetDatum(so + 1);
+			elems[i + matchctx->npatterns] = Int32GetDatum(eo - so);
+			nulls[i] = false;
+		}
+	}
+
+	/* And form two arrays */
+	dims[0] = matchctx->npatterns;
+	lbs[0] = 1;
+	so_ary = construct_md_array(elems, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	eo_ary = construct_md_array(elems + matchctx->npatterns, nulls, 1, dims, lbs,
+							  INT4OID, 4, true, TYPALIGN_INT);
+	values[0] = PointerGetDatum(so_ary);
+	values[1] = PointerGetDatum(eo_ary);
+	tuple = heap_form_tuple(tupdesc, values, tuple_nulls);
+	return HeapTupleGetDatum(tuple);
+
+}
+
 /*
  * regexp_split_to_table()
  *		Split the string at matches of the pattern, returning the
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d068d6532e..8c2daa20da 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3643,6 +3643,20 @@
 { oid => '9629', descr => 'extract substring that matches regexp',
   proname => 'regexp_substr', prorettype => 'text',
   proargtypes => 'text text int4 int4 text int4', prosrc => 'regexp_substr' },
+{ oid => '9630', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text', proallargtypes => '{text,text,_int4,_int4}',
+  proargmodes => '{i,i,o,o}',
+  proargnames => '{string,pattern,starts,lengths}',
+  prosrc => 'regexp_positions_no_flags' },
+{ oid => '9631', descr => 'find matching position(s) for regexp',
+  proname => 'regexp_positions', prorows => '10',
+  proretset => 't', prorettype => 'record',
+  proargtypes => 'text text text', proallargtypes => '{text,text,text,_int4,_int4}',
+  proargmodes => '{i,i,i,o,o}',
+  proargnames => '{string,pattern,flags,starts,lengths}',
+  prosrc => 'regexp_positions' },
 { oid => '2088', descr => 'split string by field_sep and return field_num',
   proname => 'split_part', prorettype => 'text',
   proargtypes => 'text text int4', prosrc => 'split_part' },
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 0f95b9400b..badcebcdb0 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -991,6 +991,12 @@ SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
  {bar,beque}
 (1 row)
 
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+-------------------
+ ("{4,7}","{3,5}")
+(1 row)
+
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
  regexp_matches 
@@ -1006,6 +1012,13 @@ SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g')
  {bazil,barf}
 (2 rows)
 
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+  regexp_positions   
+---------------------
+ ("{4,7}","{3,5}")
+ ("{12,17}","{5,4}")
+(2 rows)
+
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
  regexp_matches 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 8c379182cb..a99bc73ca3 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -292,12 +292,14 @@ SELECT regexp_substr('abcabcabc', 'a.c', 1, 1, '', -1);
 
 -- return all matches from regexp
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(beque)$re$);
+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
 
 -- test case insensitive
 SELECT regexp_matches('foObARbEqUEbAz', $re$(bar)(beque)$re$, 'i');
 
 -- global option - more than one match
 SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
+SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
 
 -- empty capture group (matched empty string)
 SELECT regexp_matches('foobarbequebaz', $re$(bar)(.*)(beque)$re$);
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#52)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

[ 0005-regexp-positions.patch ]

I took a quick look at this patch. I am on board now with the general
idea of returning match positions, but I think there are still
definitional issues to be discussed.

1. The main idea we might perhaps want to adopt from the Oracle-ish
regexp functions is a search-start-position argument. I'm not sure
that this is exciting --- if we intend this function to be a close
analog of regexp_matches(), it'd be better to leave that off. But
it deserves explicit consideration.

2. It looks like you modeled this on regexp_matches() to the extent
of returning a set and using the 'g' flag to control whether the
set can actually contain more than one row. That's pretty ancient
precedent ... but we have revisited that design more than once
because regexp_matches() is just too much of a pain to use. I think
if we're going to do this, we should learn from history, and provide an
analog to regexp_match() as well as regexp_matches() right off the bat.

3. The API convention you chose (separate start and length arrays)
is perhaps still confusing. When I first looked at the test case

+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+-------------------
+ ("{4,7}","{3,5}")
+(1 row)

I thought it was all wrong because it seemed to be identifying
the substrings 'barbequ' and 'obarb'. If there'd been a different
number of matches than capture groups, maybe I'd not have been
confused, but still... I wonder if we'd be better advised to make
N capture groups produce N two-element arrays, or else mash it all
into one array of N*2 elements. But this probably depends on which
way is the easiest to work with in SQL.

4. Not sure about the handling of sub-matches.
There are various plausible definitions we could use:

* We return the position/length of the overall match, never mind
about any parenthesized subexpressions. This is simple but I think
it loses significant functionality. As an example, you might have
a pattern like 'ab*(c*)d+' where what you actually want to know
is where the 'c's are, but they have to be in the context described
by the rest of the regexp. Without subexpression-match capability
that's painful to do.

* If there's a capturing subexpression, return the position/length
of the first such subexpression, else return the overall match.
This matches the behavior of substring().

* If there are capturing subexpression(s), return the
positions/lengths of those, otherwise return the overall match.
This agrees with the behavior of regexp_match(es), so I'd tend
to lean to this option, but perhaps it's the hardest to use.

* Return the position/length of the overall match *and* those of
each capturing subexpression. This is the most flexible choice,
but I give it low marks since it matches no existing behaviors.

As for comments on the patch itself:

* The documentation includes an extraneous entry for regexp_replace,
and it fails to add the promised paragraph to functions-posix-regexp.

* This bit is evidently copied from regexp_matches:

+        /* be sure to copy the input string into the multi-call ctx */
+        matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,

regexp_matches needs to save the input string so that
build_regexp_match_result can copy parts of that. But
regexp_positions has no such need AFAICS, so I think we
don't need a long-lived copy of the string.

* I wouldn't use these names in build_regexp_positions_result:

+ ArrayType *so_ary;
+ ArrayType *eo_ary;

"so_ary" isn't awful, but it invites confusion with regex's "so"
field, which hasn't got the same semantics (off by one).
"eo_ary" is pretty bad because it isn't an "end offset" at all, but
a length. I'd go for "start_ary" and "length_ary" or some such.

* Test cases seem a bit thin, notably there's no coverage of the
null-subexpression code path.

regards, tom lane

#54Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#53)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

This review has gone unanswered for two months, so I'm marking this patch
Returned with Feedback. Please feel free to resubmit when a new version of the
patch is available.

--
Daniel Gustafsson https://vmware.com/

#55Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#53)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

[ 0005-regexp-positions.patch ]

I took a quick look at this patch. I am on board now with the general
idea of returning match positions, but I think there are still
definitional issues to be discussed.

1. The main idea we might perhaps want to adopt from the Oracle-ish
regexp functions is a search-start-position argument. I'm not sure
that this is exciting --- if we intend this function to be a close
analog of regexp_matches(), it'd be better to leave that off. But
it deserves explicit consideration.

I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.

2. It looks like you modeled this on regexp_matches() to the extent
of returning a set and using the 'g' flag to control whether the
set can actually contain more than one row. That's pretty ancient
precedent ... but we have revisited that design more than once
because regexp_matches() is just too much of a pain to use. I think
if we're going to do this, we should learn from history, and provide an
analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

3. The API convention you chose (separate start and length arrays)
is perhaps still confusing. When I first looked at the test case

+SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
+ regexp_positions  
+-------------------
+ ("{4,7}","{3,5}")
+(1 row)

I thought it was all wrong because it seemed to be identifying
the substrings 'barbequ' and 'obarb'. If there'd been a different
number of matches than capture groups, maybe I'd not have been
confused, but still... I wonder if we'd be better advised to make
N capture groups produce N two-element arrays, or else mash it all
into one array of N*2 elements. But this probably depends on which
way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we stick to it.

4. Not sure about the handling of sub-matches.
There are various plausible definitions we could use:

* We return the position/length of the overall match, never mind
about any parenthesized subexpressions. This is simple but I think
it loses significant functionality. As an example, you might have
a pattern like 'ab*(c*)d+' where what you actually want to know
is where the 'c's are, but they have to be in the context described
by the rest of the regexp. Without subexpression-match capability
that's painful to do.

* If there's a capturing subexpression, return the position/length
of the first such subexpression, else return the overall match.
This matches the behavior of substring().

* If there are capturing subexpression(s), return the
positions/lengths of those, otherwise return the overall match.
This agrees with the behavior of regexp_match(es), so I'd tend
to lean to this option, but perhaps it's the hardest to use.

* Return the position/length of the overall match *and* those of
each capturing subexpression. This is the most flexible choice,
but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.

As for comments on the patch itself:

* The documentation includes an extraneous entry for regexp_replace,
and it fails to add the promised paragraph to functions-posix-regexp.

* This bit is evidently copied from regexp_matches:

+        /* be sure to copy the input string into the multi-call ctx */
+        matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,

regexp_matches needs to save the input string so that
build_regexp_match_result can copy parts of that. But
regexp_positions has no such need AFAICS, so I think we
don't need a long-lived copy of the string.

* I wouldn't use these names in build_regexp_positions_result:

+ ArrayType *so_ary;
+ ArrayType *eo_ary;

"so_ary" isn't awful, but it invites confusion with regex's "so"
field, which hasn't got the same semantics (off by one).
"eo_ary" is pretty bad because it isn't an "end offset" at all, but
a length. I'd go for "start_ary" and "length_ary" or some such.

* Test cases seem a bit thin, notably there's no coverage of the
null-subexpression code path.

Thanks for the comments on the patch itself.
I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour.

/Joel