[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Started by Joel Jacobsonabout 5 years ago55 messageshackers
Jump to latest
#1Joel Jacobson
joel@compiler.org

Hi,

I suggest adding a new function, regexp_positions(),
which works exactly like regexp_matches(),
except it returns int4range[] start/end positions for *where* the matches occurs.

I first thought I could live without this function,
and just get the positions using strpos(),
but as Andreas Karlsson kindly helped me understand,
that naive idea doesn't always work.

Andreas provided this pedagogic example
to demonstrate the problem:

SELECT regexp_matches('caaabaaa', '(?<=b)(a+)', 'g');
regexp_matches
----------------
{aaa}
(1 row)

If we would try to use strpos() to find the position,
based on the returned matched substring,
we would get the wrong answer:

SELECT strpos('caaabaaa','aaa');
strpos
--------
2
(1 row)

Sure, there is "aaa" at position 2,
but that's not where the match occurred,
since the (?<=b) means "positive lookbehind of the character b",
so the match actually occurred at position 6,
where there is a "b" before the "aaa".

Using regexp_positions(), we can now get the correct answer:

SELECT regexp_positions('caaabaaa', '(?<=b)(a+)', 'g');
regexp_positions
------------------
{"[6,9)"}
(1 row)

Some more examples from the regress/sql/strings.sql,
showing both regexp_matches() and regexp_positions()
for the same examples, as they both return the same structure,
but with different types:

SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)

SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
regexp_positions
-----------------------
{"[4,7)","[7,12)"}
{"[12,17)","[17,21)"}
(2 rows)

I've added documentation and tests.

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

A few words on the implementation:
I copied build_regexp_match_result() to a new function build_regexp_positions_result(),
and removed the string parts, replacing it with code to make ranges instead.
Maybe there are common parts that could be put into some helper-function,
but I think not, since the functions are two small for it to be worth it.

Thanks to David Fetter for the idea on using ranges.

Based on HEAD (f5a5773a9dc4185414fe538525e20d8512c2ba35).

/Joel

Attachments:

0001-regexp-positions.patchapplication/octet-stream; name=0001-regexp-positions.patchDownload+165-0
#2Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#1)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 1, 2021, at 9:38 AM, Joel Jacobson <joel@compiler.org> wrote:

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

I like the idea so I did a bit of testing. I think the following should not error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#2)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 01:12, Mark Dilger wrote:

I like the idea so I did a bit of testing. I think the following should not error, but does:

+SELECT regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
+ERROR:  range lower bound must be less than or equal to range upper bound

Thanks for testing!

The bug is due to using an (inclusive,inclusive) range,
so for the example the code tried to construct a (7,6,'[]') range.

When trying to fix, I think I've found a general problem with ranges:

I'll use int4range() to demonstrate the problem:

First the expected error for what the patch tries to do internally using make_range().
This is all good:

# SELECT int4range(7,6,'[]');
ERROR: range lower bound must be less than or equal to range upper bound

I tried to fix this like this:

@ src/backend/utils/adt/regexp.c
-                       lower.val = Int32GetDatum(so + 1);
+                       lower.val = Int32GetDatum(so);
                        lower.infinite = false;
-                       lower.inclusive = true;
+                       lower.inclusive = false;
                        lower.lower = true;

Which would give the same result as doing:

SELECT int4range(6,6,'(]');
int4range
-----------
empty
(1 row)

Hmm. This "empty" value what surprise to me.
I would instead have assumed the canonical form "[7,7)".

If I wanted to know if the range is empty or not,
I would have guessed I should use the isempty() function, like this:

SELECT isempty(int4range(6,6,'(]'));
isempty
---------
t
(1 row)

Since we have this isempty() function, I don't see the point of discarding
the lower/upper vals, since they contain possibly interesting information
on where the empty range exists.

I find it strange two ranges of zero-length with different bounds are considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which would be nice.

/Joel

#4Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#3)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 00:06, Joel Jacobson <joel@compiler.org> wrote:

I find it strange two ranges of zero-length with different bounds are
considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[]
for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which
would be nice.

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be
convenient to have empty ranges at specific locations. Doing this would
introduce numerous questions which would have to be resolved. For example,
where/when is the empty range resulting from an intersection operation?

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#3)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

regards, tom lane

#6Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#4)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:22, Isaac Morland wrote:

On Tue, 2 Mar 2021 at 00:06, Joel Jacobson <joel@compiler.org> wrote:

I find it strange two ranges of zero-length with different bounds are considered equal:

SELECT '[7,7)'::int4range = '[8,8)'::int4range;
?column?
----------
t
(1 row)

This seems like a bug to me. What am I missing here?

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),
if we want to allow returning the positions for zero-length matches, which would be nice.

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be convenient to have empty ranges at specific locations. Doing this would introduce numerous questions which would have to be resolved. For example, where/when is the empty range resulting from an intersection operation?

Hmm, I think I would assume the intersection of two non-overlapping ranges to be isempty()=TRUE,
and for lower() and upper() to continue to return NULL.

But I think a zero-length range created with actual bounds should
return the lower() and upper() values during creation, instead of NULL.

I tried to find some other programming environments with range types.

The first one I found was Ada.

The below example is similar to int4range(7,6,'[]') which is invalid in PostgreSQL:

with Ada.Text_IO; use Ada.Text_IO;
procedure Hello is
type Foo is range 7 .. 6;
begin
Put_Line ( Foo'Image(Foo'First) );
Put_Line ( Foo'Image(Foo'Last) );
end Hello;

$ ./gnatmake hello.adb
$ ./hello
7
6

I Ada, the 'Range of the Empty_String is 1 .. 0
https://en.wikibooks.org/wiki/Ada_Programming/Types/array#Array_Attributes

I think there is a case for allowing access to the the lower/upper vals instead of returning NULL,
since we can do so without changing what isempty() would return for the same values,.

/Joel

#7Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#5)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

I think there is a case to allow creating empty ranges *with* bounds information, e.g. '[6,7)'::int4range,
as well as the current only possibility to create empty ranges *without* bounds information, e.g. 'empty'::int4range

I've had a look at how ranges are implemented,
and I think I've found a way to support this is a simple non-invasive way.

I've outlined the idea in a patch, which I will send separately,
as it's a different feature, possibly useful for purposes other than regexp_positions().

/Joel

#8Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#6)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 00:52, Joel Jacobson <joel@compiler.org> wrote:

Ranges are treated as sets. As such equality is defined by membership.

That being said, I agree that there may be situations in which it would be
convenient to have empty ranges at specific locations. Doing this would
introduce numerous questions which would have to be resolved. For example,
where/when is the empty range resulting from an intersection operation?

Hmm, I think I would assume the intersection of two non-overlapping ranges
to be isempty()=TRUE,
and for lower() and upper() to continue to return NULL.

But I think a zero-length range created with actual bounds should
return the lower() and upper() values during creation, instead of NULL.

I tried to find some other programming environments with range types.

The first one I found was Ada.

Interesting!

Array indices are a bit different than general ranges however.

One question I would have is whether empty ranges are all equal to each
other. If they are, you have an equality that isn’t really equality; if
they aren’t then you would have ranges that are unequal even though they
have exactly the same membership. Although I suppose this is already true
for some types where ends can be specified as open or closed but end up
with the same end element; many range types canonicalize to avoid this but
I don’t think they all do.

Returning to the RE result issue, I wonder how much it actually matters
where any empty matches are. Certainly the actual contents of the match
don’t matter; you don’t need to be able to index into the string to extract
the substring. The only scenario I can see where it could matter is if the
RE is using lookahead or look back to find occurrences before or after
something else. If we stipulate that the result array will be in order,
then you still don’t have the exact location of empty matches but you do at
least have where they are relative to non-empty matches.

#9Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#8)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Hi Isaac,

Many thanks for the comments.

On Tue, Mar 2, 2021, at 14:34, Isaac Morland wrote:

One question I would have is whether empty ranges are all equal to each other. If they are, you have an equality that isn’t really equality; if they aren’t then you would have ranges that are unequal even though they have exactly the same membership. Although I suppose this is already true for some types where ends can be specified as open or closed but end up with the same end element; many range types canonicalize to avoid this but I don’t think they all do.

I thought about this problem too. I don't think there is a perfect solution.
Leaving things as they are is problematic too since it makes the range type useless for some use-cases.
I've sent a patch in a separate thread with the least invasive idea I could come up with.

Returning to the RE result issue, I wonder how much it actually matters where any empty matches are. Certainly the actual contents of the match don’t matter; you don’t need to be able to index into the string to extract the substring. The only scenario I can see where it could matter is if the RE is using lookahead or look back to find occurrences before or after something else.

Hmm, I think it would be ugly to have corner-cases handled differently than the rest.

If we stipulate that the result array will be in order, then you still don’t have the exact location of empty matches but you do at least have where they are relative to non-empty matches.

This part I didn't fully understand. Can you please provide some example on this?

/Joel

#10Isaac Morland
isaac.morland@gmail.com
In reply to: Joel Jacobson (#9)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, 2 Mar 2021 at 08:58, Joel Jacobson <joel@compiler.org> wrote:

If we stipulate that the result array will be in order, then you still
don’t have the exact location of empty matches but you do at least have
where they are relative to non-empty matches.

This part I didn't fully understand. Can you please provide some example
on this?

Suppose the match results are:

[4,8)
[10,10)
[13,16)
[20,20)
[24,24)

Then this gets turned into:

[4,8)
empty
[13,16)
empty
empty

So you know that there are non-empty matches from 4-8 and 13-16, plus an
empty match between them and two empty matches at the end. Given that all
empty strings are identical, I think it's only in pretty rare circumstances
where you need to know exactly where the empty matches are; it would have
to be a matter of identifying empty matches immediately before or after a
specific pattern; in which case I suspect it would usually be just as easy
to match the pattern itself directly.

Does this help?

#11Joel Jacobson
joel@compiler.org
In reply to: Isaac Morland (#10)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 15:05, Isaac Morland wrote:

Suppose the match results are:

[4,8)
[10,10)
[13,16)
[20,20)
[24,24)

Then this gets turned into:

[4,8)
empty
[13,16)
empty
empty

So you know that there are non-empty matches from 4-8 and 13-16, plus an empty match between them and two empty matches at the end. Given that all empty strings are identical, I think it's only in pretty rare circumstances where you need to know exactly where the empty matches are; it would have to be a matter of identifying empty matches immediately before or after a specific pattern; in which case I suspect it would usually be just as easy to match the pattern itself directly.

Does this help?

Thanks, I see what you mean now.

I agree it's probably a corner-case,
but I think I would still prefer a complete solution by just returning setof two integer[] values,
instead of the cuter-but-only-partial solution of using the existing int4range[].

Even better would be if we could fix the range type so it could actually be used in this and other similar situations.

If so, then we could do:

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)a+','g') AS r;
r
-----------
{"[6,9)"}
(1 row)

SELECT r FROM regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
r
---------
{empty}
{empty}
(2 rows)

SELECT lower(r[1]), upper(r[1]) FROM regexp_positions('caaabaaabeee','(?<=b)','g') AS r;
lower | upper
-------+-------
5 | 5
9 | 9
(2 rows)

/Joel

#12Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Isaac Morland (#8)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Mar 2, 2021, at 5:34 AM, Isaac Morland <isaac.morland@gmail.com> wrote:

Returning to the RE result issue, I wonder how much it actually matters where any empty matches are. Certainly the actual contents of the match don’t matter; you don’t need to be able to index into the string to extract the substring. The only scenario I can see where it could matter is if the RE is using lookahead or look back to find occurrences before or after something else. If we stipulate that the result array will be in order, then you still don’t have the exact location of empty matches but you do at least have where they are relative to non-empty matches.

I agree the contents of the match don't matter, because they are always empty. But the position matters. You could intend to split a string in multiple places using lookaheads and lookbehinds to determine the split points.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#5)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Tue, Mar 2, 2021, at 06:31, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Unless fixed, then the way I see it, I don't think we can use int4range[] for regexp_positions(),

Yeah. It's a cute idea, but the semantics aren't quite right.

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

Example:

SELECT * FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
-----------+---------
{3,6} | {6,11}
{11,16} | {16,20}
(2 rows)

Based on HEAD (040af779382e8e4797242c49b93a5a8f9b79c370).

I've updated docs and tests.

/Joel

Attachments:

0002-regexp-positions.patchapplication/octet-stream; name=0002-regexp-positions.patchDownload+178-0
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#13)
Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

regards, tom lane

#15Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On 03/04/21 10:40, Tom Lane wrote:

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

Not only that, the functions in that other proposal are very similar
to the standard's own functions that are specified to use XML Query
regular expression syntax (sample implementations in [1]https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary).

These differently-named (which is good) functions seem to be a de facto
standard where the regexp syntax and semantics are those native to the
DBMS, the correspondence being

de facto ISO XQuery-based
-------------- ------------------
regexp_like like_regex
regexp_count occurrences_regex
regexp_instr position_regex
regexp_substr substring_regex
regexp_replace translate_regex

The regexp_positions proposal highlights an interesting apparent gap in
both the de facto and the ISO specs: the provided functions allow you
to specify which occurrence you're talking about, and get the corresponding
positions or the corresponding substring, but neither set of functions
includes one to just give you all the matching positions at once as
a SETOF something.

What the proposed regexp_positions() returns is pretty much exactly
the notional "list of match vectors" that appears internally throughout
the specs of the ISO functions, but is never directly exposed.

In the LOMV as described in the standard, the position/length arrays
are indexed from zero, and the start and length at index 0 are those
for the overall match as a whole.

Right now, if you have a query that involves, say,

substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also
substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2),

a na�ve implementation like [1]https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary will of course compile and evaluate
the regexp twice and return one group each time. It makes me wonder
whether the standards committee was picturing a clever parse analyzer
and planner that would say "aha! you want group 1 and group 2 from
a single evaluation of this regex!", and that might even explain the
curious rule in the standard that the regex must be an actual literal,
not any other expression. (Still, that strikes me as an awkward way to
have to write it, spelling the regex out as a literal, twice.)

It has also made my idly wonder how close we could get to behaving
that way, perhaps with planner support functions and other available
parse analysis/planning hooks. Would any of those mechanisms get a
sufficiently global view of the query to do that kind of rewriting?

Regards,
-Chap

[1]: https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary
https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary

#16Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

I considered it, but I prefer two separate simple arrays for two reasons:

a) more pedagogic, it's at least then obvious what values are start and end positions,
then you only have to understand what the values means.

b) 2-D doesn't arrays don't work well with unnest().
If you would unnest() the 2-D array you couldn't separate the start positions from the end positions,
whereas with two separate, you could do:

SELECT unnest(start_pos) AS start_pos, unnest(end_pos) AS end_pos FROM regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 'g');
start_pos | end_pos
-----------+---------
3 | 6
6 | 11
11 | 16
16 | 20
(4 rows)

Can give some details on your unpleasant experiences of parallel arrays?

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

Yes, I saw it, it was sent shortly after my proposal, so I couldn't take it into account.
Seems useful, except regexp_instr() seems redundant, I would rather have regexp_positions(),
but maybe regexp_instr() should also be added for compatibility reasons.

/Joel

#17Gilles Darold
gilles@darold.net
In reply to: Tom Lane (#14)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Le 04/03/2021 à 16:40, Tom Lane a écrit :

"Joel Jacobson" <joel@compiler.org> writes:

Having abandoned the cute idea that didn't work,
here comes a new patch with a regexp_positions() instead returning
setof record (start_pos integer[], end_pos integer[]).

I wonder if a 2-D integer array wouldn't be a better idea,
ie {{startpos1,length1},{startpos2,length2},...}. My experience
with working with parallel arrays in SQL has been unpleasant.

Also, did you see

/messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

Seems like there may be some overlap in these proposals.

The object of regexp_position() is to return all start+end of captured
substrings, it overlaps a little with regexp_instr() in the way that
this function returns the start or end position of a specific captured
substring. I think it is a good idea to have a function that returns all
positions instead of a single one like regexp_instr(), this is not the
same usage. Actually regexp_position() is exactly the same as
regexp_matches() except that it return positions instead of substrings.

I also think that it should return a setof 2-D integer array, an other
solution is to return all start/end positions of an occurrence chained
in an integer array {start1,end1,start2,end2,..}.

Regards,

--
Gilles Darold

#18Joel Jacobson
joel@compiler.org
In reply to: Gilles Darold (#17)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

On Thu, Mar 4, 2021, at 17:55, Gilles Darold wrote:

I also think that it should return a setof 2-D integer array, an other
solution is to return all start/end positions of an occurrence chained
in an integer array {start1,end1,start2,end2,..}.

Hmm. Seems like we've in total managed to come up with three flawed ideas.

Pros/cons I see:

Idea #1: setof 2-D integer array
+ Packs the values into one single value.
- Difficult to work with 2-D arrays, doesn't work well with unnest(), has to inspect the dims and use for loops to extract values.
- Looking at a 2-D value, it's not obvious what the integer values means in it means. Which one is "startpos" and do we have "length" or "endpos" values?
Idea #2: setof (start_pos integer[], end_pos integer[])
+ It's obvious to the user what type of integers "start_pos" and "end_pos" contain.
- Decouples the values into two separate values.
- Tom mentioned some bad experiences with separate array values. (Details on this would be interesting.)

Idea #3: chained integer array {start1,end1,start2,end2,..}
- Mixes different values into the same value
- Requires maths (although simple calculations) to extract values

I think all three ideas (including mine) are ugly. None of them is wart free.

Idea #4: add a new composite built-in type.

A simple composite type with two int8 fields.

The field names seems to vary a lot between languages:

Rust: "start", "end" [1]https://doc.rust-lang.org/std/ops/struct.Range.html
C++: "begin", "end" [2]https://en.cppreference.com/w/cpp/ranges
Python: "start", "stop" [3]https://www.w3schools.com/python/ref_func_range.asp

Such a simple composite type, could then always be used,
when you want to represent simple integer ranges,
between two exact values, arguably a very common need.

Such type could be converted to/from int8range,
but would have easily accessible field names,
which is simpler than using lower() and upper(),
since upper() always returns the canonical
exclusive upper bound for discrete types,
which is not usually what you want when
dealing with "start" and "end" integer ranges.

Since there is no type named just "range", why not just use this name?

Since "end" is a keyword, I suggest the "stop" name:

PoC:

CREATE TYPE range AS (start int8, stop int8);

A real implementation would of course also verify CHECK (start <= stop),
and would add conversions to/from int8range.

I realise this is probably a controversial idea.
But, I think this is a general common problem that deserves a clean general solution.

Thoughts? More ideas?

[1]: https://doc.rust-lang.org/std/ops/struct.Range.html
[2]: https://en.cppreference.com/w/cpp/ranges
[3]: https://www.w3schools.com/python/ref_func_range.asp

#19Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#18)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Idea #5:

Allow disabling canonicalization via optional parameter to range constructor functions.

This would then allow using the range type,
to create inclusive/inclusive integer ranges,
where lower() and upper() would return what you expect.

/Joel

#20Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#19)
Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Hi

pá 5. 3. 2021 v 13:44 odesílatel Joel Jacobson <joel@compiler.org> napsal:

Idea #5:

Allow disabling canonicalization via optional parameter to range
constructor functions.

I think so rules describing ranges and multirages are long enough, so
increasing functionality doesn't look like a practical idea.

I prefere special simple composite type like you described in the previous
email (start, stop) or (start, length). It can be used more times when
using range or multi range is not practical.

The composite types are more natural for this purpose than 2D arrays.

Regards

Pavel

Show quoted text

This would then allow using the range type,
to create inclusive/inclusive integer ranges,
where lower() and upper() would return what you expect.

/Joel

#21Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Tom Lane (#14)
#22Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#2)
#23Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#22)
#24Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#22)
#25Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#24)
#26Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#25)
#27Joel Jacobson
joel@compiler.org
In reply to: Mark Dilger (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#25)
#29Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#28)
#30Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#29)
#31Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Joel Jacobson (#27)
#32Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#28)
#33Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#32)
#35Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#34)
#36Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#35)
#37Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#36)
#38Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#37)
#39Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#38)
#40Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#39)
#41Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#40)
#42Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#41)
#43Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#14)
#44Pavel Stehule
pavel.stehule@gmail.com
In reply to: Joel Jacobson (#43)
#45Joel Jacobson
joel@compiler.org
In reply to: Pavel Stehule (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#42)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#45)
#48Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#47)
#49Daniel Gustafsson
daniel@yesql.se
In reply to: Joel Jacobson (#48)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#49)
#51Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#50)
#52Joel Jacobson
joel@compiler.org
In reply to: Daniel Gustafsson (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#52)
#54Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#53)
#55Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#53)