Review: GiST support for UUIDs
/messages/by-id/CA+renyVepHxTO1c7dFbVjP1GYMUc0-3qDNWPN30-noo5MPyaVQ@mail.gmail.com
Patch looks perfect but it's still needed some work.
0) rebase to current HEAD (done, in attach)
1) UUIDSIZE -> UUID_LEN (it's defined in utils/uuid.h, done)
2)
static double
uuid2num(const pg_uuid_t *i)
{
return *((uint64 *)i);
}
It isn't looked as correct transformation for me. May be, it's better
to transform to numeric type (UUID looks like a 16-digit hexademical number)
and follow gbt_numeric_penalty() logic (or even call directly).
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
btree_gist_uuid_2.patchtext/plain; charset=UTF-8; name=btree_gist_uuid_2.patchDownload+2730-1424
Please post reviews to the original thread unless that's already humongously large. Otherwise somebody needs to manually link up the threads in the CF application.
It also makes it much harder to follow the development because there'll likely be a new version of the patch after a review. Which then cab either pushed in a thread titled review, or in an old one, without context.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund wrote:
Please post reviews to the original thread unless that's already humongously large.
I've lost original mail. Sorry.
Otherwise somebody needs to manually link up the threads in the CF application.
Of course, I did it
It also makes it much harder to follow the development because there'll likely be a new version of the patch
after a review. Which then cab either pushed in a thread titled review, or in an old one, without context.
Sorry, I tried to fix that as possible.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2)
static double
uuid2num(const pg_uuid_t *i)
{
return *((uint64 *)i);
}
It isn't looked as correct transformation for me. May be, it's better
to transform to numeric type (UUID looks like a 16-digit hexademical
number)
and follow gbt_numeric_penalty() logic (or even call directly).
Thanks for the review! A UUID is actually not stored as a string of
hexadecimal digits. (It is normally displayed that way, but with 32
digits, not 16.) Rather it is stored as an unstructured 128-bit value
(which in C is 16 unsigned chars). Here is the easy-to-misread
declaration from src/backend/utils/adt/uuid.c:
#define UUID_LEN 16
struct pg_uuid_t
{
unsigned char data[UUID_LEN];
};
I would love to just cast this to a 128-bit unsigned int. But it looks
like Postgres doesn't always have 128-bit ints available, so my code
takes the top half and uses that for penalty calculations. It seemed to
me that was "good enough" for this purpose.
The only other 128-bit type I found in btree_gist was Interval. For that
type we convert to a double using INTERVAL_TO_SEC, then call
penalty_num. By my read that accepts a similar loss of precision.
If I'm mistaken about 128-bit integer support, let me know, and maybe we
can do the penalty computation on the whole UUID. Or maybe I should just
convert the uint64 to a double before calling penalty_num? I don't
completely understand what the penalty calculation is all about, so I
welcome suggestions here.
Thanks again,
Paul
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Paul Jungwirth wrote:
2)
static double
uuid2num(const pg_uuid_t *i)
{
return *((uint64 *)i);
}
It isn't looked as correct transformation for me. May be, it's better
to transform to numeric type (UUID looks like a 16-digit hexademical
number)
and follow gbt_numeric_penalty() logic (or even call directly).Thanks for the review! A UUID is actually not stored as a string of
hexadecimal digits. (It is normally displayed that way, but with 32
digits, not 16.) Rather it is stored as an unstructured 128-bit value
(which in C is 16 unsigned chars). Here is the easy-to-misread
declaration from src/backend/utils/adt/uuid.c:
Missed number of digit, but nevertheless it doesn't matter for idea.
Original coding uses only 8 bytes from 16 to compute penalty which could
cause a problem with index performance. Simple way is just printing each
4bits with %02d modifier into string and then make a numeric value with
a help of numeric_in.
Or something like this in pseudocode:
numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))
The only other 128-bit type I found in btree_gist was Interval. For that
type we convert to a double using INTERVAL_TO_SEC, then call
penalty_num. By my read that accepts a similar loss of precision.
Right, but precision of double is enough to represent 1 century
interval with 0.00001 seconds accuracy which is enough for practical
usage. In UUID case you will take into account only half of value. Of
course, GiST will work even with penalty function returning constant but
each scan could become full-index-scan.
If I'm mistaken about 128-bit integer support, let me know, and maybe we
can do the penalty computation on the whole UUID. Or maybe I should just
convert the uint64 to a double before calling penalty_num? I don't
completely understand what the penalty calculation is all about, so I
welcome suggestions here.
Penalty method calculates how union key will be enlarged if insert will
be produced in current subtree. It directly affects selectivity of subtree.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Or something like this in pseudocode:
numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))
This is more like what I was hoping for, rather than converting to a
string and back. Would you mind confirming for me: int8_numeric turns an
int64 into an arbitrary-precision numeric Datum? So there is no overflow
risk here?
But it looks like int8_numeric takes a *signed* integer. Isn't that a
problem? I suppose I could get it working though by jumping through some
hoops.
In UUID case you will take into account only half of value. Of
course, GiST will work even with penalty function returning constant but
each scan could become full-index-scan.
Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
18446744073709551616 records. Or another way of thinking about it, if 64
bits (or 32) is a good enough penalty input for all the other data
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
distributed. Perhaps we could use the bottom half (instead of the top)
to ensure even distribution for type 1 and 2 too.
It seems to me that using only the top half should be okay, but if you
believe it's not I'll go with the int8_numeric approach (in three chunks
instead of two to work around signed-vs-unsigned).
Thanks,
Paul
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Paul Jungwirth wrote:
Or something like this in pseudocode:
numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))This is more like what I was hoping for, rather than converting to a
string and back. Would you mind confirming for me: int8_numeric turns an
int64 into an arbitrary-precision numeric Datum? So there is no overflow
risk here?
Sure, no risk. Numeric precision is limited 1000 digits with magnitude 1000
But it looks like int8_numeric takes a *signed* integer. Isn't that a
problem? I suppose I could get it working though by jumping through some
hoops.
signed vs unsigned problem does not exist actually, because of precision
of numeric is much better than we need and presence of numeric_abs.
Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
18446744073709551616 records. Or another way of thinking about it, if 64
:) "only"
bits (or 32) is a good enough penalty input for all the other data
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
distributed. Perhaps we could use the bottom half (instead of the top)
to ensure even distribution for type 1 and 2 too.
it must be. But UUID could be taken for unknown source and we can't
predict distribution. I believe pg generates them correctly, but other
generators could be not so good.
It seems to me that using only the top half should be okay, but if you
believe it's not I'll go with the int8_numeric approach (in three chunks
instead of two to work around signed-vs-unsigned).
Yes, I believe. It is not good case when we can ruin index performance
with special set of value.
Some difficulty which I see is how to transform numeric penalty to
double as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 +
INT_MAX64)? Keep in mind, that penalty is how range will be enlarged
after range union, so minimal penalty is 0 and maximum is
0xffffffffffffffffffffffffffffffff
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 15, 2015 at 3:03 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Paul Jungwirth wrote:
Or something like this in pseudocode:
numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))This is more like what I was hoping for, rather than converting to a
string and back. Would you mind confirming for me: int8_numeric turns an
int64 into an arbitrary-precision numeric Datum? So there is no overflow
risk here?Sure, no risk. Numeric precision is limited 1000 digits with magnitude 1000
But it looks like int8_numeric takes a *signed* integer. Isn't that a
problem? I suppose I could get it working though by jumping through some
hoops.signed vs unsigned problem does not exist actually, because of precision of
numeric is much better than we need and presence of numeric_abs.Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
18446744073709551616 records. Or another way of thinking about it, if 64:) "only"
bits (or 32) is a good enough penalty input for all the other data
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
distributed. Perhaps we could use the bottom half (instead of the top)
to ensure even distribution for type 1 and 2 too.it must be. But UUID could be taken for unknown source and we can't predict
distribution. I believe pg generates them correctly, but other generators
could be not so good.It seems to me that using only the top half should be okay, but if you
believe it's not I'll go with the int8_numeric approach (in three chunks
instead of two to work around signed-vs-unsigned).Yes, I believe. It is not good case when we can ruin index performance with
special set of value.Some difficulty which I see is how to transform numeric penalty to double
as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 + INT_MAX64)?
Keep in mind, that penalty is how range will be enlarged after range union,
so minimal penalty is 0 and maximum is 0xffffffffffffffffffffffffffffffff
This patch has been marked as "returned with feedback" for CF 2015-11
because of the presence of a review and a certain lack of activity..
Please free to move it to next CF if you think that's more adapted.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, 15 Sep 2015 09:03:00 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:
Paul Jungwirth wrote:
Or something like this in pseudocode:
numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))This is more like what I was hoping for, rather than converting to a
string and back. Would you mind confirming for me: int8_numeric
turns an int64 into an arbitrary-precision numeric Datum? So there
is no overflow risk here?Sure, no risk. Numeric precision is limited 1000 digits with
magnitude 1000But it looks like int8_numeric takes a *signed* integer. Isn't that
a problem? I suppose I could get it working though by jumping
through some hoops.signed vs unsigned problem does not exist actually, because of
precision of numeric is much better than we need and presence of
numeric_abs.Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
18446744073709551616 records. Or another way of thinking about it,
if 64:) "only"
bits (or 32) is a good enough penalty input for all the other data
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be
evenly distributed. Perhaps we could use the bottom half (instead
of the top) to ensure even distribution for type 1 and 2 too.it must be. But UUID could be taken for unknown source and we can't
predict distribution. I believe pg generates them correctly, but
other generators could be not so good.It seems to me that using only the top half should be okay, but if
you believe it's not I'll go with the int8_numeric approach (in
three chunks instead of two to work around signed-vs-unsigned).Yes, I believe. It is not good case when we can ruin index
performance with special set of value.Some difficulty which I see is how to transform numeric penalty to
double as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 +
INT_MAX64)? Keep in mind, that penalty is how range will be enlarged
after range union, so minimal penalty is 0 and maximum is
0xffffffffffffffffffffffffffffffff
There is a more improved version of the patch. Main idea is to present
uuid as two uint64 values, and make comparisons and penalty calculation
based on these values. This approach is much faster than using memcmp
for uuid comparisons.
--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
btree_gist_uuid_3.patchtext/x-patchDownload+2800-1427
On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
There is a more improved version of the patch. Main idea is to present
uuid as two uint64 values, and make comparisons and penalty calculation
based on these values. This approach is much faster than using memcmp
for uuid comparisons.
Thank you for picking this up! I'm sorry I was not able to work on it
the last few months. I'm very glad to see someone wrapping it up. I'm
not a reviewer, but personally it looks like a good change to me.
Happy holidays,
Paul
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, 23 Dec 2015 16:36:23 -0800
Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
There is a more improved version of the patch. Main idea is to
present uuid as two uint64 values, and make comparisons and penalty
calculation based on these values. This approach is much faster
than using memcmp for uuid comparisons.Thank you for picking this up! I'm sorry I was not able to work on it
the last few months. I'm very glad to see someone wrapping it up. I'm
not a reviewer, but personally it looks like a good change to me.Happy holidays,
Paul
Thanks! The patch was almost done and ready. I attached new version of
the patch with compability changes.
--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
btree_gist_uuid_4.patchtext/x-patchDownload+2796-1427
Thank you, but I have some notices about
static float
uuid_parts_distance(pg_uuid_t *a, pg_uuid_t *b)
{
pg_uuid_t ua, ub;
const double mp = pow(2, -64);
uuid_cnv(a, &ua);
uuid_cnv(b, &ub);
Assert(ua.v64[0] > ub.v64[0]);
uint64 high = ua.v64[0] - ub.v64[0];
uint64 low = ua.v64[1] - ub.v64[1];
if (low > ua.v64[1])
high--;
return (float) (ldexp(high, 64) + (double) low * mp);
}
First, variables (high and low) should not be declared in the middle of code.
Second, assert will fail if ua.v64[0] == ub.v64[0] and
ua.v64[1] > ub.v64[1] although it's a possible and allowed case. Third, actually
you multiply high value by 2^64 anf low by 2^-64. Seems, it's needed to do only
one multiplication.
Ildus Kurbangaliev wrote:
On Wed, 23 Dec 2015 16:36:23 -0800
Paul Jungwirth <pj@illuminatedcomputing.com> wrote:On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
There is a more improved version of the patch. Main idea is to
present uuid as two uint64 values, and make comparisons and penalty
calculation based on these values. This approach is much faster
than using memcmp for uuid comparisons.Thank you for picking this up! I'm sorry I was not able to work on it
the last few months. I'm very glad to see someone wrapping it up. I'm
not a reviewer, but personally it looks like a good change to me.Happy holidays,
Paul
Thanks! The patch was almost done and ready. I attached new version of
the patch with compability changes.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 25 Dec 2015 13:34:25 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:
Thank you, but I have some notices about
static float
uuid_parts_distance(pg_uuid_t *a, pg_uuid_t *b)
{
pg_uuid_t ua, ub;
const double mp = pow(2, -64);uuid_cnv(a, &ua);
uuid_cnv(b, &ub);Assert(ua.v64[0] > ub.v64[0]);
uint64 high = ua.v64[0] - ub.v64[0];
uint64 low = ua.v64[1] - ub.v64[1];
if (low > ua.v64[1])
high--;return (float) (ldexp(high, 64) + (double) low * mp);
}First, variables (high and low) should not be declared in the middle
of code. Second, assert will fail if ua.v64[0] == ub.v64[0] and
ua.v64[1] > ub.v64[1] although it's a possible and allowed case.
Third, actually you multiply high value by 2^64 anf low by 2^-64.
Seems, it's needed to do only one multiplication.
Thank you for review. Fixed.
--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
btree_gist_uuid_5.patchtext/x-patchDownload+2800-1427
On 12/25/2015 03:23 AM, Ildus Kurbangaliev wrote:
On Fri, 25 Dec 2015 13:34:25 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:First, variables (high and low) should not be declared in the middle
of code. Second, assert will fail if ua.v64[0] == ub.v64[0] and
ua.v64[1] > ub.v64[1] although it's a possible and allowed case.
Third, actually you multiply high value by 2^64 anf low by 2^-64.
Seems, it's needed to do only one multiplication.Thank you for review. Fixed.
Just wanted to follow up on this and see about getting it added to the
next commitfest. It looks like Iluds's latest patch
(btree_gist_uuid_5.patch) doesn't appear on the commitfest page here:
https://commitfest.postgresql.org/7/332/
Also the last few emails of the thread don't show up, although you can
read them here:
/messages/by-id/20151225142340.46e577dd@lp
I'm not sure the next step here. Do I click "Move to next CF" in the
"Status" dropdown? Apologies for not being familiar with the protocol.
I'd be sad to see this work just get ignored.
Thanks,
Paul
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 5, 2016 at 4:45 PM, Paul Jungwirth
<pj@illuminatedcomputing.com> wrote:
On 12/25/2015 03:23 AM, Ildus Kurbangaliev wrote:
Thank you for review. Fixed.
Just wanted to follow up on this and see about getting it added to the next
commitfest. It looks like Iluds's latest patch (btree_gist_uuid_5.patch)
doesn't appear on the commitfest page here:
https://commitfest.postgresql.org/7/332/
I took the btree_gist_uuid_5.patch file from Ildus and updated it so
it would apply on the latest commit. My version is attached as
btree_gist_uuid_6.patch. Since Tom recently changed the SQL signatures
of a few functions (gbt_*_union and gbt_*_same), I did the same with
the new uuid functions. I'd love to see this get accepted next time
around (in September it looks like)!
Thanks,
Paul