GiST for range types (was Re: Range Types - typo + NULL string constructor)
On Fri, Oct 7, 2011 at 7:41 AM, Jeff Davis <pgsql@j-davis.com> wrote:
I'd prefer to include it in the initial patch. If the current GiST code
is going to be replaced, then there's not much sense reviewing/testing
it.You may need to consider unbounded and empty ranges specially. I made an
attempt to do so in the current GiST code, and you might want to take a
look at that first. I'm not particularly attached to my approach, but we
should do something reasonable with unbounded and empty ranges.
The first thing caught my eye in existing GiST code is idea of
subtype_float. float8 has limited precision and can't respresent, for
example, varlena values good enough. Even if we have large int8 value we can
loose lower bits, but data distribution can be so that these bits are
valuable. Wouldn't it better to have function like subtype_diff_float which
returns difference between two values of subtype as an float? Using of such
function could make penalty more sensible to even small difference between
values, and accordingly more relevant.
------
With best regards,
Alexander Korotkov.
On Fri, 2011-10-07 at 12:54 +0400, Alexander Korotkov wrote:
The first thing caught my eye in existing GiST code is idea of
subtype_float. float8 has limited precision and can't respresent, for
example, varlena values good enough. Even if we have large int8 value
we can loose lower bits, but data distribution can be so that these
bits are valuable. Wouldn't it better to have function like
subtype_diff_float which returns difference between two values of
subtype as an float? Using of such function could make penalty more
sensible to even small difference between values, and accordingly more
relevant.
The reason I did it that way is for unbounded ranges. With
subtype_diff_float, it's difficult for the GiST code to differentiate
between [10,) and [100000,), because infinity minus anything is
infinity. But when inserting the range [100,200), the penalty for the
first one should be zero and the second one should have some positive
penalty, right?
Maybe we can still use subtype_diff_float by calling it on various pairs
of bounds to come up with a reasonable cost?
I'm open to suggestion. I'd just like to make sure that unbounded ranges
are a part of the consideration.
Regards,
Jeff Davis
On Sat, Oct 8, 2011 at 1:01 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2011-10-07 at 12:54 +0400, Alexander Korotkov wrote:
The first thing caught my eye in existing GiST code is idea of
subtype_float. float8 has limited precision and can't respresent, for
example, varlena values good enough. Even if we have large int8 value
we can loose lower bits, but data distribution can be so that these
bits are valuable. Wouldn't it better to have function like
subtype_diff_float which returns difference between two values of
subtype as an float? Using of such function could make penalty more
sensible to even small difference between values, and accordingly more
relevant.The reason I did it that way is for unbounded ranges. With
subtype_diff_float, it's difficult for the GiST code to differentiate
between [10,) and [100000,), because infinity minus anything is
infinity. But when inserting the range [100,200), the penalty for the
first one should be zero and the second one should have some positive
penalty, right?
I meant that penalty can be determined as sum of difference of old and new
bounds of range, i.e. penalty = subtype_diff_float(new_lower, old_lower)
+ subtype_diff_float(old_upper, new_upper).
When we insert [100,200) into [10,+inf), union([100,200), [10,+inf))
= [10,+inf), so penalty = subtype_diff_float(10,10)
+ subtype_diff_float(+inf, +inf) = 0 + 0 = 0.
When we insert [100,200) into [100000,), union([100,200), [100000,+inf))
= [100,+inf), so penalty = subtype_diff_float(100,100000)
+ subtype_diff_float(+inf, +inf) = 99900 + 0 = 99900.
But, there are still the problem, when we'are inserting open interval when
there is no such open intervals yet. For example, we're going to insert
[0,+inf), while root page contains [0,10), [10,20), [20,30). Each penalty
will be infinity, while it seems to be better to insert it into [0,10). But,
it seems to me to be general limitation of current GiST interface, when we
have to express penalty in a single float.
------
With best regards,
Alexander Korotkov.
On Sat, 2011-10-08 at 18:43 +0400, Alexander Korotkov wrote:
I meant that penalty can be determined as sum of difference of old and new bounds of range, i.e. penalty = subtype_diff_float(new_lower, old_lower) + subtype_diff_float(old_upper, new_upper). When we insert [100,200) into [10,+inf), union([100,200), [10,+inf)) = [10,+inf), so penalty = subtype_diff_float(10,10) + subtype_diff_float(+inf, +inf) = 0 + 0 = 0. When we insert [100,200) into [100000,), union([100,200), [100000, +inf)) = [100,+inf), so penalty = subtype_diff_float(100,100000) + subtype_diff_float(+inf, +inf) = 99900 + 0 = 99900.
OK, I like that. I will make the change.
But, there are still the problem, when we'are inserting open interval
when there is no such open intervals yet. For example, we're going to
insert [0,+inf), while root page contains [0,10), [10,20), [20,30).
Each penalty will be infinity, while it seems to be better to insert
it into [0,10). But, it seems to me to be general limitation of
current GiST interface, when we have to express penalty in a single
float.
That seems like an acceptable limitation. I don't think my solution
handles it any better.
Regards,
Jeff Davis
Show quoted text
On Fri, 2011-10-07 at 12:54 +0400, Alexander Korotkov wrote:
The first thing caught my eye in existing GiST code is idea of
subtype_float. float8 has limited precision and can't respresent, for
example, varlena values good enough. Even if we have large int8 value
we can loose lower bits, but data distribution can be so that these
bits are valuable. Wouldn't it better to have function like
subtype_diff_float which returns difference between two values of
subtype as an float? Using of such function could make penalty more
sensible to even small difference between values, and accordingly more
relevant.
I started implementing subtype_diff, and I noticed that it requires
defining an extra function for each range type. Previously, the numeric
types could just use a cast, which was convenient for user-defined range
types.
If you have any other ideas to make that cleaner, please let me know.
Otherwise I'll just finish implementing subtype_diff.
Regards,
Jeff Davis
On Sun, 2011-10-16 at 14:43 -0700, Jeff Davis wrote:
On Fri, 2011-10-07 at 12:54 +0400, Alexander Korotkov wrote:
The first thing caught my eye in existing GiST code is idea of
subtype_float. float8 has limited precision and can't respresent, for
example, varlena values good enough. Even if we have large int8 value
we can loose lower bits, but data distribution can be so that these
bits are valuable. Wouldn't it better to have function like
subtype_diff_float which returns difference between two values of
subtype as an float? Using of such function could make penalty more
sensible to even small difference between values, and accordingly more
relevant.I started implementing subtype_diff, and I noticed that it requires
defining an extra function for each range type. Previously, the numeric
types could just use a cast, which was convenient for user-defined range
types.If you have any other ideas to make that cleaner, please let me know.
Otherwise I'll just finish implementing subtype_diff.
I'm beginning to think that we should just allow the user to specify
their own gist_penalty function. Specifying just the subtype_diff
doesn't save much time, and it can only be limiting. Additionally, it's
harder for users to understand the purpose of the function.
Regards,
Jeff Davis
Hi!
On Mon, Oct 17, 2011 at 12:38 PM, Jeff Davis <pgsql@j-davis.com> wrote:
I started implementing subtype_diff, and I noticed that it requires
defining an extra function for each range type. Previously, the numeric
types could just use a cast, which was convenient for user-defined range
types.If you have any other ideas to make that cleaner, please let me know.
Otherwise I'll just finish implementing subtype_diff.
I think implementing subtype_diff for each datatype is ok. We could
implement some universal function based on minus operator and casting to
double precision. But such solution might be unacceptable in both
*predictability
(operator and casting function might do not the things we expect) and
performance.*
I'm beginning to think that we should just allow the user to specify
their own gist_penalty function. Specifying just the subtype_diff
doesn't save much time, and it can only be limiting. Additionally, it's
harder for users to understand the purpose of the function.
If we allow user to specify own gist_penalty function, then such function
should deal with:
1) GiST-specific data structures such as GISTENTRY.
2) Decomposing ranges using range_deserialize.
3) Inifinities, which we could handle in general penalty functions.
Thats why I prefere to implement subtype_diff.
------
With best regards,
Alexander Korotkov.
On Mon, Oct 24, 2011 at 3:05 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
If we allow user to specify own gist_penalty function, then such function
should deal with:
1) GiST-specific data structures such as GISTENTRY.
2) Decomposing ranges using range_deserialize.
3) Inifinities, which we could handle in general penalty functions.
Thats why I prefere to implement subtype_diff.
I forgot another agument for having subtype_diff:
4) In my picksplit algorithm it would be more natural to use subtype_diff
for measuring overlap than use penalty function.
------
With best regards,
Alexander Korotkov.
On Mon, 2011-10-24 at 15:05 +0400, Alexander Korotkov wrote:
I think implementing subtype_diff for each datatype is ok. We could
implement some universal function based on minus operator and casting
to double precision. But such solution might be unacceptable in
both predictability (operator and casting function might do not the
things we expect) and performance.
Done.
Everything is complete in this patch with the exception of two optional
things, which I still intend to do but might best be done in a separate
commit:
* support typmod for ranges
* support casts between different range types
Both of these things, I believe, require the introduction of an
RangeCoerceExpr, similar to ArrayCoerceExpr. That's fine, but it creates
a rather large diff, so it might be best left for a later commit.
Regards,
Jeff Davis
Attachments:
On 01.11.2011 06:33, Jeff Davis wrote:
On Mon, 2011-10-24 at 15:05 +0400, Alexander Korotkov wrote:
I think implementing subtype_diff for each datatype is ok. We could
implement some universal function based on minus operator and casting
to double precision. But such solution might be unacceptable in
both predictability (operator and casting function might do not the
things we expect) and performance.Done.
Thanks, I'm looking into this now.
+ else if (lower1.infinite || upper1.infinite) + length1 = 1.0/0.0;
That seems wrong. I take it that the point is to set length1 to infinity?
PS. I note the docs still refer to subtype_float. I'll fix that before
committing.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 01.11.2011 06:33, Jeff Davis wrote:
+ else if (lower1.infinite || upper1.infinite) + length1 = 1.0/0.0;
That seems wrong. I take it that the point is to set length1 to infinity?
Please use get_float[48]_infinity() or get_float[48]_nan(), as
appropriate (I think the latter may be intended here), rather than
making up your own way of getting those values.
regards, tom lane
On 01.11.2011 06:33, Jeff Davis wrote:
On Mon, 2011-10-24 at 15:05 +0400, Alexander Korotkov wrote:
I think implementing subtype_diff for each datatype is ok. We could
implement some universal function based on minus operator and casting
to double precision. But such solution might be unacceptable in
both predictability (operator and casting function might do not the
things we expect) and performance.Done.
Everything is complete in this patch with the exception of two optional
things, which I still intend to do but might best be done in a separate
commit:* support typmod for ranges
* support casts between different range typesBoth of these things, I believe, require the introduction of an
RangeCoerceExpr, similar to ArrayCoerceExpr. That's fine, but it creates
a rather large diff, so it might be best left for a later commit.
Using the test table from the rangetypes test case:
postgres=# select * from test_range_gist where 10 <@ ir;
ERROR: unsupported type: 3904
This seems to be coming from the selectivity estimation function. The
selectivity function for <@ is scalargtsel, which is usually used for
scalar > and >=. That doesn't seem right. But what do we store in the
statistics for range types in the first place, and what would be the
right thing to do for selectivity estimation?
I'll dig deeper into this tomorrow...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 02.11.2011 22:59, Heikki Linnakangas wrote:
I'll dig deeper into this tomorrow...
Forgot to mention: I have pushed what I have done this far to my git
repository at git://git.postgresql.org/git/users/heikki/postgres.git, if
you want to take a look. Nothing major, just garden-variety cleanup.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2011-11-02 at 21:29 +0200, Heikki Linnakangas wrote:
+ else if (lower1.infinite || upper1.infinite) + length1 = 1.0/0.0;That seems wrong. I take it that the point is to set length1 to infinity?
I reworked this in commit (on my private repo, of course):
6197fbffb00f729feba8082136801cdef5ac850e
For the archives, it's essentially taking the difference on the left
side of the range, and the difference on the right side of the range,
and adding them together. There are just a lot of special cases for
infinite boundaries, empty ranges, and the lack of a subtype_diff
function.
I think it's a little closer to what Alexander intended, which I think
is an improvement. It should now be able to recognize that expanding
[10,) into [0,) has a penalty of 10.
PS. I note the docs still refer to subtype_float. I'll fix that before
committing.
Thank you. The only change I found strange was the test that used \c to
reconnect; but I can't say that my solution was any better.
Regards,
Jeff Davis
On Wed, 2011-11-02 at 22:59 +0200, Heikki Linnakangas wrote:
This seems to be coming from the selectivity estimation function. The
selectivity function for <@ is scalargtsel, which is usually used for
scalar > and >=. That doesn't seem right. But what do we store in the
statistics for range types in the first place, and what would be the
right thing to do for selectivity estimation?
I'll have to think more about that, and it depends on the operator. It
seems like an easier problem for "contains a point" than "contains
another range" or "overlaps with another range".
Right now I don't have a very good answer, and even for the "contains a
point" case I'll have to think about the representation in pg_statistic.
Regards,
Jeff Davis
On 03.11.2011 10:42, Jeff Davis wrote:
On Wed, 2011-11-02 at 22:59 +0200, Heikki Linnakangas wrote:
This seems to be coming from the selectivity estimation function. The
selectivity function for<@ is scalargtsel, which is usually used for
scalar> and>=. That doesn't seem right. But what do we store in the
statistics for range types in the first place, and what would be the
right thing to do for selectivity estimation?I'll have to think more about that, and it depends on the operator. It
seems like an easier problem for "contains a point" than "contains
another range" or "overlaps with another range".Right now I don't have a very good answer, and even for the "contains a
point" case I'll have to think about the representation in pg_statistic.
I've committed this now, after some more cleanup. I removed the
selectivity estimation functions from operators where they were bogus,
so writing those is a clear TODO. But that can well be done as a
separate patch.
Thanks!
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Nov 3, 2011, at 4:59 AM, Heikki Linnakangas wrote:
I've committed this now, after some more cleanup. I removed the selectivity estimation functions from operators where they were bogus, so writing those is a clear TODO. But that can well be done as a separate patch.
Thanks!
Woo! Congrats Jeff. Awesome news. Very excited about this feature. Thanks for getting this in, Heikki.
Best,
David
On Nov3, 2011, at 18:54 , David E. Wheeler wrote:
On Nov 3, 2011, at 4:59 AM, Heikki Linnakangas wrote:
I've committed this now, after some more cleanup. I removed the selectivity estimation functions from operators where they were bogus, so writing those is a clear TODO. But that can well be done as a separate patch.
Thanks!
Woo! Congrats Jeff. Awesome news. Very excited about this feature. Thanks for getting this in, Heikki.
+1. Great work, guys!
best regards,
Florian Pflug
On Thu, Nov 3, 2011 at 3:59 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
I've committed this now, after some more cleanup. I removed the
selectivity estimation functions from operators where they were bogus, so
writing those is a clear TODO. But that can well be done as a separate
patch.
Cool! Patch with GiST on range types improvements from me will be soon.
------
With best regards,
Alexander Korotkov.
During work on gist for range types I've faced with following problem:
test=# select 'empty'::int4range !?;
ERROR: operator does not exist: int4range !?
LINE 1: select 'empty'::int4range !?;
^
HINT: No operator matches the given name and argument type(s). You might
need to add explicit type casts.
test=# select 'empty'::int4range ?;
ERROR: operator does not exist: int4range ?
LINE 1: select 'empty'::int4range ?;
^
HINT: No operator matches the given name and argument type(s). You might
need to add explicit type casts.
So, !? and ? operators are mentioned in documentation, but don't present in
catalog. Are them just missed in the catalog or there is some more serious
problem?
------
With best regards,
Alexander Korotkov.