[PATCH] Support empty ranges with bounds information
Hi,
As discussed in the separate thread "[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]" [1]/messages/by-id/5eae8911-241a-4432-accc-80e6ffecedfa@www.fastmail.com
it's currently not possible to create an empty range with bounds information.
This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().
No other semantics have been changed.
All tests passes without any changes.
All examples below give the same result before/after this patch:
SELECT int4range(6,6,'[)');
empty
SELECT isempty(int4range(6,6,'[)'));
TRUE
SELECT int4range(6,6,'[)') = int4range(7,7,'[)');
TRUE
SELECT 'empty'::int4range;
empty
SELECT lower('empty'::int4range);
NULL
SELECT upper('empty'::int4range);
NULL
SELECT isempty('empty'::int4range);
TRUE
SELECT 'empty'::int4range = 'empty'::int4range;
TRUE
SELECT int4range(6,6,'[)') = 'empty'::int4range;
TRUE
The only semantic change is lower() and upper()
now returns the lower and upper bounds
for empty ranges created with bounds:
SELECT lower(int4range(6,6,'[)'));
6
SELECT upper(int4range(6,6,'[)'));
6
Isaac Morland asked an interesting question in the other thread [1]/messages/by-id/5eae8911-241a-4432-accc-80e6ffecedfa@www.fastmail.com:
Doing this would introduce numerous questions which would have to be resolved.
For example, where/when is the empty range resulting from an intersection operation?
The result of intersection is with this patch unchanged,
the resulting empty range has no bounds information, e.g:
SELECT lower(int4range(10,10,'[)') * int4range(20,20,'[)'));
NULL
Patch explained below:
I've made use of the two previously not used null flags:
-#define RANGE_LB_NULL 0x20 /* lower bound is null (NOT USED) */
-#define RANGE_UB_NULL 0x40 /* upper bound is null (NOT USED) */
+#define RANGE_LB_NULL 0x20 /* lower bound is null */
+#define RANGE_UB_NULL 0x40 /* upper bound is null */
I've changed the RANGE_HAS_LBOUND and RANGE_HAS_UBOUND macros
to not look at RANGE_EMPTY:
-#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_EMPTY | \
- RANGE_LB_NULL | \
- RANGE_LB_INF)))
+#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_LB_NULL | RANGE_LB_INF)))
The definition for RANGE_HAS_UBOUND has been changed in the same way.
These NULL-flags are now set to explicitly indicate there is no bounds information,
when parsing a text string containing the "empty" literal in range_parse(),
or when the caller of make_range() passes empty=true:
- flags |= RANGE_EMPTY;
+ flags |= RANGE_EMPTY | RANGE_LB_NULL | RANGE_UB_NULL;
In the range_lower() and range_upper() functions,
the RANGE_HAS_...BOUND() macros are used,
instead of the old hard-coded expression, e.g.:
- if (empty || lower.infinite)
+ if (!RANGE_HAS_LBOUND(flags))
Finally, in range_recv() we must not mask out the NULL flags,
since they are now used:
flags &= (RANGE_EMPTY |
RANGE_LB_INC |
RANGE_LB_INF |
+ RANGE_LB_NULL |
RANGE_UB_INC |
- RANGE_UB_INF);
+ RANGE_UB_INF |
+ RANGE_UB_NULL);
That's all of it.
I think this little change would make range types more intuitive useful in practise.
/Joel
[1]: /messages/by-id/5eae8911-241a-4432-accc-80e6ffecedfa@www.fastmail.com
Attachments:
empty-ranges-with-bounds-information.patchapplication/octet-stream; name=empty-ranges-with-bounds-information.patchDownload
diff --git a/src/backend/utils/adt/rangetypes.c b/src/backend/utils/adt/rangetypes.c
index 815175a654..9ef9d454d6 100644
--- a/src/backend/utils/adt/rangetypes.c
+++ b/src/backend/utils/adt/rangetypes.c
@@ -183,8 +183,10 @@ range_recv(PG_FUNCTION_ARGS)
flags &= (RANGE_EMPTY |
RANGE_LB_INC |
RANGE_LB_INF |
+ RANGE_LB_NULL |
RANGE_UB_INC |
- RANGE_UB_INF);
+ RANGE_UB_INF |
+ RANGE_UB_NULL);
/* receive the bounds ... */
if (RANGE_HAS_LBOUND(flags))
@@ -432,13 +434,14 @@ range_lower(PG_FUNCTION_ARGS)
RangeBound lower;
RangeBound upper;
bool empty;
+ char flags = range_get_flags(r1);
typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
range_deserialize(typcache, r1, &lower, &upper, &empty);
/* Return NULL if there's no finite lower bound */
- if (empty || lower.infinite)
+ if (!RANGE_HAS_LBOUND(flags))
PG_RETURN_NULL();
PG_RETURN_DATUM(lower.val);
@@ -453,13 +456,14 @@ range_upper(PG_FUNCTION_ARGS)
RangeBound lower;
RangeBound upper;
bool empty;
+ char flags = range_get_flags(r1);
typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
range_deserialize(typcache, r1, &lower, &upper, &empty);
/* Return NULL if there's no finite upper bound */
- if (empty || upper.infinite)
+ if (!RANGE_HAS_UBOUND(flags))
PG_RETURN_NULL();
PG_RETURN_DATUM(upper.val);
@@ -1677,7 +1681,7 @@ range_serialize(TypeCacheEntry *typcache, RangeBound *lower, RangeBound *upper,
Assert(!upper->lower);
if (empty)
- flags |= RANGE_EMPTY;
+ flags |= RANGE_EMPTY | RANGE_LB_NULL | RANGE_UB_NULL;
else
{
cmp = range_cmp_bound_values(typcache, lower, upper);
@@ -2188,7 +2192,7 @@ range_parse(const char *string, char *flags, char **lbound_str,
if (pg_strncasecmp(ptr, RANGE_EMPTY_LITERAL,
strlen(RANGE_EMPTY_LITERAL)) == 0)
{
- *flags = RANGE_EMPTY;
+ *flags = RANGE_EMPTY | RANGE_LB_NULL | RANGE_UB_NULL;
*lbound_str = NULL;
*ubound_str = NULL;
diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h
index 04c302c619..b099409048 100644
--- a/src/include/utils/rangetypes.h
+++ b/src/include/utils/rangetypes.h
@@ -40,18 +40,14 @@ typedef struct
#define RANGE_UB_INC 0x04 /* upper bound is inclusive */
#define RANGE_LB_INF 0x08 /* lower bound is -infinity */
#define RANGE_UB_INF 0x10 /* upper bound is +infinity */
-#define RANGE_LB_NULL 0x20 /* lower bound is null (NOT USED) */
-#define RANGE_UB_NULL 0x40 /* upper bound is null (NOT USED) */
+#define RANGE_LB_NULL 0x20 /* lower bound is null */
+#define RANGE_UB_NULL 0x40 /* upper bound is null */
#define RANGE_CONTAIN_EMPTY 0x80 /* marks a GiST internal-page entry whose
* subtree contains some empty ranges */
-#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_EMPTY | \
- RANGE_LB_NULL | \
- RANGE_LB_INF)))
+#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_LB_NULL | RANGE_LB_INF)))
-#define RANGE_HAS_UBOUND(flags) (!((flags) & (RANGE_EMPTY | \
- RANGE_UB_NULL | \
- RANGE_UB_INF)))
+#define RANGE_HAS_UBOUND(flags) (!((flags) & (RANGE_UB_NULL | RANGE_UB_INF)))
#define RangeIsEmpty(r) ((range_get_flags(r) & RANGE_EMPTY) != 0)
#define RangeIsOrContainsEmpty(r) \
"Joel Jacobson" <joel@compiler.org> writes:
As discussed in the separate thread "[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]" [1]
it's currently not possible to create an empty range with bounds information.
This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().
I think this is an actively bad idea. We had a clean set-theoretic
definition of ranges as sets of points, and with this we would not.
We should not be whacking around the fundamental semantics of a
whole class of data types on the basis that it'd be cute to make
regexp_position return its result as int4range rather than int[].
If we did go forward with this, what would the implications be for
multiranges?
regards, tom lane
On Mar 2, 2021, at 5:20 AM, Joel Jacobson <joel@compiler.org> wrote:
it's currently not possible to create an empty range with bounds information.
This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().No other semantics have been changed.
All tests passes without any changes.
I recall this issue of empty ranges not keeping any bounds information being discussed back when range types were developed, and the design choice was intentional. Searching the archives for that discussion, I don't find anything, probably because I'm not searching for the right keywords. Anybody have a link to that discussion?
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021 at 3:28 PM Mark Dilger <mark.dilger@enterprisedb.com>
wrote:
On Mar 2, 2021, at 5:20 AM, Joel Jacobson <joel@compiler.org> wrote:
it's currently not possible to create an empty range with bounds
information.
This patch tries to improve the situation by keeping the bounds
information,
and allow accessing it via lower() and upper().
No other semantics have been changed.
All tests passes without any changes.I recall this issue of empty ranges not keeping any bounds information
being discussed back when range types were developed, and the design choice
was intentional. Searching the archives for that discussion, I don't find
anything, probably because I'm not searching for the right keywords.
Anybody have a link to that discussion?—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Marc, perhaps you were referring to this discussion?
/messages/by-id/4D5534D0020000250003A87E@gw.wicourts.gov
On Mar 2, 2021, at 8:51 AM, Pantelis Theodosiou <ypercube@gmail.com> wrote:
On Tue, Mar 2, 2021 at 3:28 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Mar 2, 2021, at 5:20 AM, Joel Jacobson <joel@compiler.org> wrote:
it's currently not possible to create an empty range with bounds information.
This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().No other semantics have been changed.
All tests passes without any changes.I recall this issue of empty ranges not keeping any bounds information being discussed back when range types were developed, and the design choice was intentional. Searching the archives for that discussion, I don't find anything, probably because I'm not searching for the right keywords. Anybody have a link to that discussion?
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyMarc, perhaps you were referring to this discussion?
/messages/by-id/4D5534D0020000250003A87E@gw.wicourts.gov
Yes, I believe so. Thank you for the link.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021 at 4:57 PM Mark Dilger <mark.dilger@enterprisedb.com>
wrote:
On Mar 2, 2021, at 8:51 AM, Pantelis Theodosiou <ypercube@gmail.com>
wrote:
On Tue, Mar 2, 2021 at 3:28 PM Mark Dilger <mark.dilger@enterprisedb.com>
wrote:
On Mar 2, 2021, at 5:20 AM, Joel Jacobson <joel@compiler.org> wrote:
it's currently not possible to create an empty range with bounds
information.
This patch tries to improve the situation by keeping the bounds
information,
and allow accessing it via lower() and upper().
No other semantics have been changed.
All tests passes without any changes.I recall this issue of empty ranges not keeping any bounds information
being discussed back when range types were developed, and the design choice
was intentional. Searching the archives for that discussion, I don't find
anything, probably because I'm not searching for the right keywords.
Anybody have a link to that discussion?—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyMarc, perhaps you were referring to this discussion?
/messages/by-id/4D5534D0020000250003A87E@gw.wicourts.gov
Yes, I believe so. Thank you for the link.
Welcome. Also this message, where I found the link and has an overview of
the different views at the time (and more links):
/messages/by-id/1299865026.3474.58.camel@jdavis
Show quoted text
On Fri, 2011-03-11 at 08:37 -0500, Bruce Momjian wrote:
Where are we on this? The options are: 1. Rip out empty ranges. Several
people have been skeptical of their
usefulness, but I don't recall anyone directly saying that they should
be removed. Robert Haas made the point that range types aren't closed
under UNION:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01045.php So the
additional nice mathematical properties provided by empty ranges
are not as important (because it wouldn't be perfect anyway). 2. Change
the semantics. Erik Rijkers suggested that we define all
operators for empty ranges, perhaps using NULL semantics:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg00942.php And
Kevin Grittner suggested that there could be discrete ranges of zero
length yet a defined starting point:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01042.php 3.
Leave empty ranges with the existing "empty set" semantics. Nathan
Boley made a good point here:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01108.php Right
now it's #3, and I lean pretty strongly toward keeping it. Without
#3, people will get confused when fairly simple operations fail in a
data-dependent way (at runtime). With #3, people will run into problems
only in situations where it is fairly dubious to have an empty range
anyway (and therefore likely a real error), such as finding ranges "left
of" an empty range. Otherwise, I'd prefer #1 to #2. I think #2 is a bad
path to take, and
we'll end up with a lot of unintuitive and error-prone operators. Regards,
Jeff Davis
On Tue, Mar 2, 2021, at 15:42, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
As discussed in the separate thread "[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]" [1]
it's currently not possible to create an empty range with bounds information.This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().I think this is an actively bad idea. We had a clean set-theoretic
definition of ranges as sets of points, and with this we would not.
We should not be whacking around the fundamental semantics of a
whole class of data types on the basis that it'd be cute to make
regexp_position return its result as int4range rather than int[].
I think there are *lots* of other use-cases where the current semantics of range types are very problematic.
The regexp_positions() patch just demonstrates one concrete example
on when real-life zero-length ranges can definitively have positions,
regardless of what the mathematicians thinks.
(I use the term "position" here since if lower=upper bound,
then we're talking about a position, since it has zero length.)
I think there is a risk lots of users coming from other programming environments
will misunderstand how ranges work, start implementing something using them,
only to later have to rewrite all their code using ranges due to eventually encountering the
zero-length corner-case for which there is no work-around (except not using ranges).
Imagine e.g. a Rust user, who has learned how ranges work in Rust,
and thinks the program below is valid and and expects it to output
"start 3 end 3 is_empty true".
fn main() {
let r = std::ops::Range { start: 3, end: 3 };
println!(
"start {} end {} is_empty {}",
r.start,
r.end,
r.is_empty()
);
}
I think it would be a challenge to explain how PostgreSQL's range semantics
to this user, why you get NULL when trying to
access the start and end values for this range.
I feel this is a perfect example of when theory and practise has since long parted ways,
and the theory is only cute until you face the ugly reality.
That said, subtle changes are of course possibly dangerous,
and since I'm not a huge range type user myself,
I can't have an opinion on how many rely on the current null semantics for lower() and upper().
Argh! I wish we would have access to a large set of real-life real-time statistics on PostgreSQL SQL queries
and result sets from lots of different users around the world, similar to the regex test corpus.
It's very unfair e.g. Amazon with their Aurora could in theory collect such statistics on all their users,
so their Aurora-hackers could answers questions like
"Do lower() and upper() ever return NULL in real-life for ranges?"
While all we can do is to rely on user reports and our imagination.
Maybe we should allow opting-in to contribute with statistics from production servers,
to help us better understand how PostgreSQL is used in real-life?
I see lots of problems with data privacy, business secrets etc, but perhaps there are something that can be done.
Oh well. At least it was fun to learn about how ranges are implemented behind the scenes.
If we cannot do a subtle change, then I think we should consider an entirely new range class,
just like multi-ranges are added in v14. Maybe "negrange" could be a good name?
If we did go forward with this, what would the implications be for
multiranges?
None. It would only affect lower()/upper() for a single range created with bounds.
Before patch:
SELECT numrange(3,4) + numrange(5,5);
[3,4)
SELECT upper(numrange(3,4) + numrange(5,5));
4
SELECT numrange(5,5);
empty
SELECT upper(numrange(5,5));
NULL
After patch:
SELECT numrange(3,4) + numrange(5,5);
[3,4)
SELECT upper(numrange(3,4) + numrange(5,5));
4
SELECT numrange(5,5);
empty
SELECT upper(numrange(5,5));
5
At the very least, I think we should in any case add test coverage of what lower()/upper() returns for empty ranges.
/Joel
On Mar 2, 2021, at 9:52 AM, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Mar 2, 2021, at 15:42, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
As discussed in the separate thread "[PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]" [1]
it's currently not possible to create an empty range with bounds information.This patch tries to improve the situation by keeping the bounds information,
and allow accessing it via lower() and upper().I think this is an actively bad idea. We had a clean set-theoretic
definition of ranges as sets of points, and with this we would not.
We should not be whacking around the fundamental semantics of a
whole class of data types on the basis that it'd be cute to make
regexp_position return its result as int4range rather than int[].I think there are *lots* of other use-cases where the current semantics of range types are very problematic.
I'm inclined to think that this conversation is ten years too late. Range semantics are already relied upon in our code, but also in the code of others. It might be very hard to debug code that was correct when written but broken by this proposed change. The problem is not just with lower() and upper(), but with equality testing (mentioned upthread), since code may rely on two different "positions" (your word) both being equal, and both sorting the same.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021, at 17:51, Pantelis Theodosiou wrote:
Marc, perhaps you were referring to this discussion?
/messages/by-id/4D5534D0020000250003A87E@gw.wicourts.gov
Thanks for the link to the discussion.
I will real with great interest and learn the arguments,
so I can explain them to future versions of myself
when they as the same question ten years from now on this list.
/Joel
On Tue, Mar 2, 2021, at 19:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing (mentioned upthread), since code may rely on two different "positions" (your word) both being equal, and both sorting the same.
That's why the patch doesn't change equality.
/Joel
On 03/02/21 13:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing
(mentioned upthread), since code may rely on two different "positions"
(your word) both being equal, and both sorting the same.
Could those concerns be addressed perhaps, not by adding an entirely new
just-like-a-range-but-remembers-position-when-zero-width type (which would
feel wartlike to me), but by tweaking ranges to /secretly/ remember the
position when zero width?
Secretly, in the sense that upper(), lower(), and the default sort
operator would keep their established behavior, but new functions like
upper_or_pos(), lower_or_pos() would return the non-NULL value even for
an empty range, and another sort operator could be provided for use
when one wants the ordering to reflect it?
Regards,
-Chap
On Mar 2, 2021, at 10:08 AM, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Mar 2, 2021, at 19:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing (mentioned upthread), since code may rely on two different "positions" (your word) both being equal, and both sorting the same.
That's why the patch doesn't change equality.
How does that work if I SELECT DISTINCT ON (nr) ... and then take upper(nr). It's just random which values I get?
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes:
On Mar 2, 2021, at 10:08 AM, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Mar 2, 2021, at 19:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing (mentioned upthread), since code may rely on two different "positions" (your word) both being equal, and both sorting the same.
That's why the patch doesn't change equality.
How does that work if I SELECT DISTINCT ON (nr) ... and then take upper(nr). It's just random which values I get?
More generally, values that are visibly different yet compare equal
are user-unfriendly in the extreme. We do have cases like that
(IEEE-float minus zero, area-based compare of some geometric types
come to mind) but they are all warts, not things to be emulated.
regards, tom lane
On Mar 2, 2021, at 10:16 AM, Chapman Flack <chap@anastigmatix.net> wrote:
On 03/02/21 13:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing
(mentioned upthread), since code may rely on two different "positions"
(your word) both being equal, and both sorting the same.Could those concerns be addressed perhaps, not by adding an entirely new
just-like-a-range-but-remembers-position-when-zero-width type (which would
feel wartlike to me), but by tweaking ranges to /secretly/ remember the
position when zero width?Secretly, in the sense that upper(), lower(), and the default sort
operator would keep their established behavior, but new functions like
upper_or_pos(), lower_or_pos() would return the non-NULL value even for
an empty range, and another sort operator could be provided for use
when one wants the ordering to reflect it?
I vaguely recall that ten years ago I wanted zero-width range types to not collapse into an empty range. I can't recall if I ever expressed that opinion -- I just remember thinking it would be nice, and for reasons similar to what Joel is arguing here. But I can't see having compares-equal-but-not-truly-equal ranges as a good idea. I think Tom is right about that.
I also think the regexp work that inspired this thread could return something other than a range, so the motivation for creating a frankenstein range implementation doesn't really exist.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021, at 17:51, Pantelis Theodosiou wrote:
Marc, perhaps you were referring to this discussion?
/messages/by-id/4D5534D0020000250003A87E@gw.wicourts.gov
I've read through the "Re: Range Types: empty ranges" thread from 2011.
My comments:
Jeff Davis wrote:
The cost, of course, is that not all operations are well-defined for
empty ranges. I think those are mostly operators like those mentioned in
the other thread: ">>" (strictly right of), "<<" (strictly left of), and
"-|-" (adjacent); and perhaps "&>" and "&<". These are probably used a
little less frequently, and should probably not be used in a context
where empty ranges are permitted (if they are, it's likely a mistake and
an error should be thrown).
Interesting. I realize all of these operators would actually be well defined for empty ranges *with* bounds information.
"Kevin Grittner" wrote:
Perhaps it was a mistake to get so concrete rather than conceptual
-- basically, it seems like it could be a useful concept for any
planned or scheduled range with an indeterminate end point, which
you want to "reserve" up front and record in progress until
complete. The alternative would be that such "ranges to be" have a
parallel "planned start value" column of the same type as the range,
to be used as the start of the range once it is not empty. Or, as
another way to put it, it seems potentially useful to me to have an
empty range which is pinned to a location, in *addition* to the
unpinned empty ranges such as would be needed to represent the
intersection of two non-overlapping ranges.
I fully agree with this. Such "pinned to a location" empty range
is exactly what I needed for regexp_positions() and is what
the patch implements.
It seems that the consequences of allowing empty ranges with bounds information
wasn't really discussed in detail in this thread. Nobody commented on Kevin's idea, as far as I can see when reading the thread.
Instead, the discussion focused on the consequences of
allowing empty ranges without bounds information,
which apparently was finally accepted, since that's what we have now.
/Joel
On Tue, Mar 2, 2021, at 19:16, Chapman Flack wrote:
On 03/02/21 13:01, Mark Dilger wrote:
The problem is not just with lower() and upper(), but with equality testing
(mentioned upthread), since code may rely on two different "positions"
(your word) both being equal, and both sorting the same.Could those concerns be addressed perhaps, not by adding an entirely new
just-like-a-range-but-remembers-position-when-zero-width type (which would
feel wartlike to me), but by tweaking ranges to /secretly/ remember the
position when zero width?
This is actually how it's implemented. The patch doesn't affect equality.
It just stores the lower/upper bounds, if available, upon creation.
Secretly, in the sense that upper(), lower(), and the default sort
operator would keep their established behavior, but new functions like
upper_or_pos(), lower_or_pos() would return the non-NULL value even for
an empty range, and another sort operator could be provided for use
when one wants the ordering to reflect it?
This is a great idea!
This would solve the potential problems of users relying
on upper()/lower() to always return null when range isempty().
Such new functions could then be used by new users who have read the documentation
and understand how they work.
This would effectively mean there would be absolutely no semantic changes at all.
I will work on a new patch to try out this idea.
/Joel
On Tue, Mar 2, 2021, at 19:17, Mark Dilger wrote:
On Mar 2, 2021, at 10:08 AM, Joel Jacobson <joel@compiler.org> wrote:
That's why the patch doesn't change equality.How does that work if I SELECT DISTINCT ON (nr) ... and then take upper(nr). It's just random which values I get?
Yes. It's random, since equality isn't changed, the sort operation cannot tell the difference, and nor could a user who isn't aware of upper() / lower() could reveal differences.
Demo:
CREATE TABLE t AS SELECT int4range(i,i+FLOOR(random()*2)::integer,'[)') AS nr FROM generate_series(1,10) AS i;
SELECT nr, lower(nr), upper(nr) FROM t ORDER BY 1;
nr | lower | upper
--------+-------+-------
empty | 10 | 10
empty | 4 | 4
empty | 6 | 6
empty | 7 | 7
empty | 1 | 1
empty | 3 | 3
[2,3) | 2 | 3
[5,6) | 5 | 6
[8,9) | 8 | 9
[9,10) | 9 | 10
(10 rows)
SELECT DISTINCT ON (nr) nr, lower(nr), upper(nr) FROM t ORDER BY 1;
nr | lower | upper
--------+-------+-------
empty | 10 | 10
[2,3) | 2 | 3
[5,6) | 5 | 6
[8,9) | 8 | 9
[9,10) | 9 | 10
(5 rows)
/Joel
On Mar 2, 2021, at 11:34 AM, Joel Jacobson <joel@compiler.org> wrote:
Yes. It's random, since equality isn't changed, the sort operation cannot tell the difference, and nor could a user who isn't aware of upper() / lower() could reveal differences.
This sounds unworkable even just in light of the original motivation for this whole thread. If I use your proposed regexp_positions(string text, pattern text, flags text) function to parse a large number of "positions" from a document, store all those positions in a table, and do a join of those positions against something else, it's not going to work. Positions will randomly vanish from the results of that join, which is going to be really surprising. I'm sure there are other examples of Tom's general point about compares-equal-but-not-equal datatypes.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mar 2, 2021, at 11:42 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
On Mar 2, 2021, at 11:34 AM, Joel Jacobson <joel@compiler.org> wrote:
Yes. It's random, since equality isn't changed, the sort operation cannot tell the difference, and nor could a user who isn't aware of upper() / lower() could reveal differences.
This sounds unworkable even just in light of the original motivation for this whole thread. If I use your proposed regexp_positions(string text, pattern text, flags text) function to parse a large number of "positions" from a document, store all those positions in a table, and do a join of those positions against something else, it's not going to work. Positions will randomly vanish from the results of that join, which is going to be really surprising. I'm sure there are other examples of Tom's general point about compares-equal-but-not-equal datatypes.
I didn't phrase that clearly enough. I'm thinking about whether you include the bounds information in the hash function. The current implementation of hash_range(PG_FUNCTION_ARGS) is going to hash the lower and upper bounds, since you didn't change it to do otherwise, so "equal" values won't always hash the same. I haven't tested this out, but it seems you could get a different set of rows depending on whether the planner selects a hash join.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021, at 20:20, Joel Jacobson wrote:
On Tue, Mar 2, 2021, at 19:16, Chapman Flack wrote:
Secretly, in the sense that upper(), lower(), and the default sort
operator would keep their established behavior, but new functions like
upper_or_pos(), lower_or_pos() would return the non-NULL value even for
an empty range, and another sort operator could be provided for use
when one wants the ordering to reflect it?I will work on a new patch to try out this idea.
Here is a patch implementing this idea.
lower() and upper() are now restored to their originals.
The new functions range_start() and range_end()
works just like lower() and upper(),
except they also return bounds information for empty ranges,
if available, otherwise they return null.
This means, there is no risk of affecting any current users of ranges.
I think this a good pragmatical solution to many real-world problems that can be solved efficiently with ranges.
I've also added test coverage of lower() and upper() for null range values.
/Joel
Attachments:
empty-ranges-with-bounds-information-v2.patchapplication/octet-stream; name=empty-ranges-with-bounds-information-v2.patchDownload
diff --git a/src/backend/utils/adt/rangetypes.c b/src/backend/utils/adt/rangetypes.c
index 815175a654..099463cd86 100644
--- a/src/backend/utils/adt/rangetypes.c
+++ b/src/backend/utils/adt/rangetypes.c
@@ -183,8 +183,10 @@ range_recv(PG_FUNCTION_ARGS)
flags &= (RANGE_EMPTY |
RANGE_LB_INC |
RANGE_LB_INF |
+ RANGE_LB_NULL |
RANGE_UB_INC |
- RANGE_UB_INF);
+ RANGE_UB_INF |
+ RANGE_UB_NULL);
/* receive the bounds ... */
if (RANGE_HAS_LBOUND(flags))
@@ -444,6 +446,27 @@ range_lower(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(lower.val);
}
+Datum
+range_start(PG_FUNCTION_ARGS)
+{
+ RangeType *r1 = PG_GETARG_RANGE_P(0);
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+ char flags = range_get_flags(r1);
+
+ typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
+
+ range_deserialize(typcache, r1, &lower, &upper, &empty);
+
+ /* Return NULL if there's no finite lower bound */
+ if (!RANGE_HAS_LBOUND(flags))
+ PG_RETURN_NULL();
+
+ PG_RETURN_DATUM(lower.val);
+}
+
/* extract upper bound value */
Datum
range_upper(PG_FUNCTION_ARGS)
@@ -465,6 +488,27 @@ range_upper(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(upper.val);
}
+Datum
+range_end(PG_FUNCTION_ARGS)
+{
+ RangeType *r1 = PG_GETARG_RANGE_P(0);
+ TypeCacheEntry *typcache;
+ RangeBound lower;
+ RangeBound upper;
+ bool empty;
+ char flags = range_get_flags(r1);
+
+ typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
+
+ range_deserialize(typcache, r1, &lower, &upper, &empty);
+
+ /* Return NULL if there's no finite upper bound */
+ if (!RANGE_HAS_UBOUND(flags))
+ PG_RETURN_NULL();
+
+ PG_RETURN_DATUM(upper.val);
+}
+
/* range -> bool functions */
@@ -1677,7 +1721,7 @@ range_serialize(TypeCacheEntry *typcache, RangeBound *lower, RangeBound *upper,
Assert(!upper->lower);
if (empty)
- flags |= RANGE_EMPTY;
+ flags |= RANGE_EMPTY | RANGE_LB_NULL | RANGE_UB_NULL;
else
{
cmp = range_cmp_bound_values(typcache, lower, upper);
@@ -2188,7 +2232,7 @@ range_parse(const char *string, char *flags, char **lbound_str,
if (pg_strncasecmp(ptr, RANGE_EMPTY_LITERAL,
strlen(RANGE_EMPTY_LITERAL)) == 0)
{
- *flags = RANGE_EMPTY;
+ *flags = RANGE_EMPTY | RANGE_LB_NULL | RANGE_UB_NULL;
*lbound_str = NULL;
*ubound_str = NULL;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1487710d59..27239c214e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9868,9 +9868,15 @@
{ oid => '3848', descr => 'lower bound of range',
proname => 'lower', prorettype => 'anyelement', proargtypes => 'anyrange',
prosrc => 'range_lower' },
+{ oid => '8104', descr => 'range start position',
+ proname => 'range_start', prorettype => 'anyelement', proargtypes => 'anyrange',
+ prosrc => 'range_start' },
{ oid => '3849', descr => 'upper bound of range',
proname => 'upper', prorettype => 'anyelement', proargtypes => 'anyrange',
prosrc => 'range_upper' },
+{ oid => '8105', descr => 'range end position',
+ proname => 'range_end', prorettype => 'anyelement', proargtypes => 'anyrange',
+ prosrc => 'range_end' },
{ oid => '3850', descr => 'is the range empty?',
proname => 'isempty', prorettype => 'bool', proargtypes => 'anyrange',
prosrc => 'range_empty' },
diff --git a/src/include/utils/rangetypes.h b/src/include/utils/rangetypes.h
index 04c302c619..b099409048 100644
--- a/src/include/utils/rangetypes.h
+++ b/src/include/utils/rangetypes.h
@@ -40,18 +40,14 @@ typedef struct
#define RANGE_UB_INC 0x04 /* upper bound is inclusive */
#define RANGE_LB_INF 0x08 /* lower bound is -infinity */
#define RANGE_UB_INF 0x10 /* upper bound is +infinity */
-#define RANGE_LB_NULL 0x20 /* lower bound is null (NOT USED) */
-#define RANGE_UB_NULL 0x40 /* upper bound is null (NOT USED) */
+#define RANGE_LB_NULL 0x20 /* lower bound is null */
+#define RANGE_UB_NULL 0x40 /* upper bound is null */
#define RANGE_CONTAIN_EMPTY 0x80 /* marks a GiST internal-page entry whose
* subtree contains some empty ranges */
-#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_EMPTY | \
- RANGE_LB_NULL | \
- RANGE_LB_INF)))
+#define RANGE_HAS_LBOUND(flags) (!((flags) & (RANGE_LB_NULL | RANGE_LB_INF)))
-#define RANGE_HAS_UBOUND(flags) (!((flags) & (RANGE_EMPTY | \
- RANGE_UB_NULL | \
- RANGE_UB_INF)))
+#define RANGE_HAS_UBOUND(flags) (!((flags) & (RANGE_UB_NULL | RANGE_UB_INF)))
#define RangeIsEmpty(r) ((range_get_flags(r) & RANGE_EMPTY) != 0)
#define RangeIsOrContainsEmpty(r) \
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index 05b882fde9..68d00e5a83 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -181,6 +181,78 @@ select '(a,a)'::textrange;
empty
(1 row)
+select lower('[a,a)'::textrange);
+ lower
+-------
+
+(1 row)
+
+select lower('(a,a]'::textrange);
+ lower
+-------
+
+(1 row)
+
+select lower('(a,a)'::textrange);
+ lower
+-------
+
+(1 row)
+
+select upper('[a,a)'::textrange);
+ upper
+-------
+
+(1 row)
+
+select upper('(a,a]'::textrange);
+ upper
+-------
+
+(1 row)
+
+select upper('(a,a)'::textrange);
+ upper
+-------
+
+(1 row)
+
+select range_start('[a,a)'::textrange);
+ range_start
+-------------
+ a
+(1 row)
+
+select range_start('(a,a]'::textrange);
+ range_start
+-------------
+ a
+(1 row)
+
+select range_start('(a,a)'::textrange);
+ range_start
+-------------
+ a
+(1 row)
+
+select range_end('[a,a)'::textrange);
+ range_end
+-----------
+ a
+(1 row)
+
+select range_end('(a,a]'::textrange);
+ range_end
+-----------
+ a
+(1 row)
+
+select range_end('(a,a)'::textrange);
+ range_end
+-----------
+ a
+(1 row)
+
--
-- create some test data and test the operators
--
diff --git a/src/test/regress/sql/rangetypes.sql b/src/test/regress/sql/rangetypes.sql
index e1b8917c0c..fed62d082f 100644
--- a/src/test/regress/sql/rangetypes.sql
+++ b/src/test/regress/sql/rangetypes.sql
@@ -41,6 +41,18 @@ select '[a,a]'::textrange;
select '[a,a)'::textrange;
select '(a,a]'::textrange;
select '(a,a)'::textrange;
+select lower('[a,a)'::textrange);
+select lower('(a,a]'::textrange);
+select lower('(a,a)'::textrange);
+select upper('[a,a)'::textrange);
+select upper('(a,a]'::textrange);
+select upper('(a,a)'::textrange);
+select range_start('[a,a)'::textrange);
+select range_start('(a,a]'::textrange);
+select range_start('(a,a)'::textrange);
+select range_end('[a,a)'::textrange);
+select range_end('(a,a]'::textrange);
+select range_end('(a,a)'::textrange);
--
-- create some test data and test the operators
On Tue, Mar 2, 2021, at 20:57, Mark Dilger wrote:
I didn't phrase that clearly enough. I'm thinking about whether you include the bounds information in the hash function. The current implementation of hash_range(PG_FUNCTION_ARGS) is going to hash the lower and upper bounds, since you didn't change it to do otherwise, so "equal" values won't always hash the same. I haven't tested this out, but it seems you could get a different set of rows depending on whether the planner selects a hash join.
I think this issue is solved by the empty-ranges-with-bounds-information-v2.patch I just sent,
since with it, there are no semantic changes at all, lower() and upper() works like before.
/Joel
On 03/02/21 14:20, Joel Jacobson wrote:
This would effectively mean there would be absolutely no semantic changes at all.
I will work on a new patch to try out this idea.
I may have been assuming a degree of orthogonality in SQL that isn't
really there ... only in a few situations (like creating an index) do
you easily get to specify a non-default operator class to use when
comparing or hashing a value.
So perhaps a solution could be built on the same range machinery, but
requiring some new syntax in CREATE TYPE ... AS RANGE, something like
WITH POSITIONS. A new concrete range type created that way would not
be a whole different kind of a thing, and would share most machinery
with other range types, but would have the position-remembering
behavior. Given a range type created over int4 in that way, maybe
named int4prange, the regexp_positions function could return one of those.
Regards,
-Chap
On Tue, Mar 2, 2021, at 19:28, Tom Lane wrote:
More generally, values that are visibly different yet compare equal
are user-unfriendly in the extreme. We do have cases like that
(IEEE-float minus zero, area-based compare of some geometric types
come to mind) but they are all warts, not things to be emulated.
I almost agree with you, but if forced to choose,
I would rather have two wart feet I can use, rather than limping around on only one wart free foot.
/Joel
On Mar 2, 2021, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Mar 2, 2021, at 20:57, Mark Dilger wrote:
I didn't phrase that clearly enough. I'm thinking about whether you include the bounds information in the hash function. The current implementation of hash_range(PG_FUNCTION_ARGS) is going to hash the lower and upper bounds, since you didn't change it to do otherwise, so "equal" values won't always hash the same. I haven't tested this out, but it seems you could get a different set of rows depending on whether the planner selects a hash join.
I think this issue is solved by the empty-ranges-with-bounds-information-v2.patch I just sent,
since with it, there are no semantic changes at all, lower() and upper() works like before.
There are semantic differences, because hash_range() doesn't call lower() and upper(), it uses RANGE_HAS_LBOUND and RANGE_HAS_UBOUND, the behavior of which you have changed. I created a regression test and expected results and checked after applying your patch, and your patch breaks the hash function behavior. Notice that before your patch, all three ranges hashed to the same value, but not so after:
@@ -1,18 +1,18 @@
select hash_range('[a,a)'::textrange);
hash_range
------------
- 484847245
+ -590102690
(1 row)
select hash_range('[b,b)'::textrange);
hash_range
------------
- 484847245
+ 281562732
(1 row)
select hash_range('[c,c)'::textrange);
- hash_range
-------------
- 484847245
+ hash_range
+-------------
+ -1887445565
(1 row)
You might change how hash_range() works to get all "equal" values to hash the same value, but that just gets back to the problem that non-equal things appear to be equal. I guess that's your two-warty-feet preference, but not everyone is going to be in agreement on that.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021, at 21:40, Mark Dilger wrote:
On Mar 2, 2021, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Mar 2, 2021, at 20:57, Mark Dilger wrote:
I didn't phrase that clearly enough. I'm thinking about whether you include the bounds information in the hash function. The current implementation of hash_range(PG_FUNCTION_ARGS) is going to hash the lower and upper bounds, since you didn't change it to do otherwise, so "equal" values won't always hash the same. I haven't tested this out, but it seems you could get a different set of rows depending on whether the planner selects a hash join.
I think this issue is solved by the empty-ranges-with-bounds-information-v2.patch I just sent,
since with it, there are no semantic changes at all, lower() and upper() works like before.There are semantic differences, because hash_range() doesn't call lower() and upper(), it uses RANGE_HAS_LBOUND and RANGE_HAS_UBOUND, the behavior of which you have changed. I created a regression test and expected results and checked after applying your patch, and your patch breaks the hash function behavior. Notice that before your patch, all three ranges hashed to the same value, but not so after:
@@ -1,18 +1,18 @@ select hash_range('[a,a)'::textrange); hash_range ------------ - 484847245 + -590102690 (1 row)select hash_range('[b,b)'::textrange); hash_range ------------ - 484847245 + 281562732 (1 row)select hash_range('[c,c)'::textrange); - hash_range ------------- - 484847245 + hash_range +------------- + -1887445565 (1 row)You might change how hash_range() works to get all "equal" values to hash the same value, but that just gets back to the problem that non-equal things appear to be equal. I guess that's your two-warty-feet preference, but not everyone is going to be in agreement on that.
Yikes. Here be dragons. I think I want my wart free foot back please.
Many thanks for explaining. I think I’ll abandon this patch. I guess implementing an entirely new range type could be an acceptable solution, but that’s too big of a project for me to manage on my own. If any more experienced hackers are interested in such a project, I would love to help if I can.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Kind regards,
Joel
On Mar 2, 2021, at 12:51 PM, Joel Jacobson <joel@compiler.org> wrote:
Yikes. Here be dragons. I think I want my wart free foot back please.
Many thanks for explaining. I think I’ll abandon this patch. I guess implementing an entirely new range type could be an acceptable solution, but that’s too big of a project for me to manage on my own. If any more experienced hackers are interested in such a project, I would love to help if I can.
Part of what was strange about arguing against your patch is that I kind of wanted the feature to work that way back when it originally got written. (Not to say that it *should* have worked that way, just that part of me wanted it.)
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Mar 2, 2021, at 21:58, Mark Dilger wrote:
Part of what was strange about arguing against your patch is that I kind of wanted the feature to work that way back when it originally got written. (Not to say that it *should* have worked that way, just that part of me wanted it.)
That's encouraging to hear. I've marked the patch as Rejected, since it was a dead end.
I would accept things as they are, if there was nothing that could be done,
but since Multiranges have not been released yet,
I think it's worth thinking intensively about possible problems until it's too late.
For discrete types, Multiranges <=> Sets should be true,
i.e. they should be equivalent, since there cannot be any values
in between two discrete adjacent values.
Due to the internals of ranges, it's not possible to create a range
that covers all possible valid values for a discrete type,
since the very last value cannot be included.
Example for int4:
SELECT int4range(-2147483647,2147483647,'[]');
ERROR: integer out of range
SELECT int4range(0,2147483647,'[]');
ERROR: integer out of range
SELECT int4range(-2147483647,0,'[]');
int4range
-----------------
[-2147483647,1)
(1 row)
However, 2147483647 is a valid int4 value.
This is due to the unfortunate decision to use [) as the canonical form,
since it must then always be able to calculate the next adjacent value.
If instead [] would have been used as the canonical form,
we would not have this problem.
Not a biggie for int4 maybe, but imagine a very small discrete type,
where it's actually necessary to create a range including its very last value.
Suggestion #1: Use [] as the canonical form for discrete types.
This would allow creating ranges for all values for discrete types.
/Joel
On Tue, 2021-03-02 at 18:52 +0100, Joel Jacobson wrote:
and the theory is only cute until you face the ugly reality.
There are a lot of range functions and operators, as well as different
kinds of ranges (continuous and discrete), and multiranges too. That
means a lot of ways to combine operations in novel ways, and (if we
aren't careful) a lot of ways to produce very unexpected results.
The benefit of falling back on theory is that there's an answer ready
when we need to define or explain the semantics. It might not match
everyone's intuition, but usually avoids the worst kinds of surprises.
Regards,
Jeff Davis
On Thu, 4 Mar 2021 at 01:25, Joel Jacobson <joel@compiler.org> wrote:
Suggestion #1: Use [] as the canonical form for discrete types.
This would allow creating ranges for all values for discrete types.
I won't reiterate here, but there are fundamental reasons why [) is
definitely the right default and canonical form.
In any case, you can create a range containing the last value:
odyssey=> select 2147483647 <@ int4range (0, null);
?column?
----------
t
(1 row)
odyssey=>
It does seem reasonable to me to change it so that specifying the last
value as the right end with ] would use a null endpoint instead of
producing an error when it tries to increment the bound.
On Thu, Mar 4, 2021, at 16:21, Isaac Morland wrote:
On Thu, 4 Mar 2021 at 01:25, Joel Jacobson <joel@compiler.org> wrote:
__
Suggestion #1: Use [] as the canonical form for discrete types.
This would allow creating ranges for all values for discrete types.I won't reiterate here, but there are fundamental reasons why [) is definitely the right default and canonical form.
It would be interesting to hear the reasons.
For discrete types, there are only exact values, there is nothing in between two adjacent discrete values.
So if we mean a range covering only the integer 5, why can't we just say [5,5] which simply means "5"?
Why is it necessary to express it as [5,6) which I interpret as the much more complex "all integers from 5 up until just before the integer 6".
We know for sure nothing can exist after 5 before 6, it's void, so why is it necessary to be explicit about including this space which we know can't contain any values?
For discrete types, we wouldn't even need the inclusive/exclusive features at all.
In any case, you can create a range containing the last value:
odyssey=> select 2147483647 <@ int4range (0, null);
?column?
----------
t
(1 row)odyssey=>
It does seem reasonable to me to change it so that specifying the last value as the right end with ] would use a null endpoint instead of producing an error when it tries to increment the bound.
Neat hack, thanks.
/Joel