CUBE_MAX_DIM

Started by Devrim Gündüzover 5 years ago6 messages

devrim@gunduz.org

over 5 years ago

Hi,

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.

They said they use 500, and never had a problem. I never added such patches to the RPMS, and will not -- but wanted to ask if we can safely increase it in upstream?

Regards,

--
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

Darafei "Komяpa" Praliaskouski

me@komzpa.net

over 5 years ago

In reply to: Devrim Gündüz (#1)

Re: CUBE_MAX_DIM

Hello,

The problem with higher dimension cubes is that starting with
dimensionality of ~52 the "distance" metrics in 64-bit float have less than
a single bit per dimension in mantissa, making cubes indistinguishable.
Developers for facial recognition software had a chat about that on russian
postgres telegram group https://t.me/pgsql. Their problem was that they had
128-dimensional points, recompiled postgres - distances weren't helpful,
and GIST KNN severely degraded to almost full scans. They had to change the
number of facial features to smaller in order to make KNN search work.

Floating point overflow isn't that much of a risk per se, worst
case scenario it becomes an Infinity or 0 which are usually acceptable in
those contexts.

While mathematically possible, there are implementation issues with higher
dimension cubes. I'm ok with raising the limit if such nuances get a
mention in docs.

On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz <devrim@gunduz.org> wrote:

Hi,

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.

They said they use 500, and never had a problem. I never added such
patches to the RPMS, and will not -- but wanted to ask if we can safely
increase it in upstream?

Regards,

--
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Devrim Gündüz (#1)

Re: CUBE_MAX_DIM

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.

They said they use 500, and never had a problem.

I guess I'm wondering what's the use-case. 100 already seems an order of
magnitude more than anyone could want. Or, if it's not enough, why does
raising the limit just 5x enable any large set of new applications?

The practical issue here is that, since the data requires 16 bytes per
dimension (plus a little bit of overhead), we'd be talking about
increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
bytes. And cube is not toastable, so that couldn't be compressed or
shoved out-of-line. Maybe your OP never had a problem with it, but
plenty of use-cases would have "tuple too large" failures due to not
having room on a heap page for whatever other data they want in the row.

Even a non-toastable 2KB field is going to give the tuple toaster
algorithm problems, as it'll end up shoving every other toastable field
out-of-line in an ultimately vain attempt to bring the tuple size below
2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
it is now without any other changes.

A more credible proposal would be to make cube toast-aware and then
raise the limit to ~1GB ... but that would take a significant amount
of work, and we still haven't got a use-case justifying it.

I think I'd counsel storing such data as plain float8 arrays, which
do have the necessary storage infrastructure. Is there something
about the cube operators that's particularly missing?

regards, tom lane

Alastair McKinley

a.mckinley@analyticsengines.com

over 5 years ago

In reply to: Tom Lane (#3)

Re: CUBE_MAX_DIM

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= <devrim@gunduz.org> writes:

Someone contacted me about increasing CUBE_MAX_DIM
in contrib/cube/cubedata.h (in the community RPMs). The current value
is 100 with the following comment:

* This limit is pretty arbitrary, but don't make it so large that you
* risk overflow in sizing calculations.

They said they use 500, and never had a problem.

I guess I'm wondering what's the use-case. 100 already seems an order of
magnitude more than anyone could want. Or, if it's not enough, why does
raising the limit just 5x enable any large set of new applications?

The dimensionality of embeddings generated by deep neural networks can be high.
Google BERT has 768 dimensions for example.

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful for these applications. There are other pre-processing techniques that can be used to improved the speed of the search, but it still ends up with a kNN search in a high-ish dimensional space.

The practical issue here is that, since the data requires 16 bytes per
dimension (plus a little bit of overhead), we'd be talking about
increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000
bytes. And cube is not toastable, so that couldn't be compressed or
shoved out-of-line. Maybe your OP never had a problem with it, but
plenty of use-cases would have "tuple too large" failures due to not
having room on a heap page for whatever other data they want in the row.

Even a non-toastable 2KB field is going to give the tuple toaster
algorithm problems, as it'll end up shoving every other toastable field
out-of-line in an ultimately vain attempt to bring the tuple size below
2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where
it is now without any other changes.

A more credible proposal would be to make cube toast-aware and then
raise the limit to ~1GB ... but that would take a significant amount
of work, and we still haven't got a use-case justifying it.

I think I'd counsel storing such data as plain float8 arrays, which
do have the necessary storage infrastructure. Is there something
about the cube operators that's particularly missing?

The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays.

regards, tom lane

Best regards,
Alastair

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alastair McKinley (#4)

Re: CUBE_MAX_DIM

Alastair McKinley <a.mckinley@analyticsengines.com> writes:

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful for these applications. There are other pre-processing techniques that can be used to improved the speed of the search, but it still ends up with a kNN search in a high-ish dimensional space.

Is there a way to fix the numerical instability involved? If we could do
that, then we'd definitely have a use-case justifying the work to make
cube toastable.

regards, tom lane

Alastair McKinley

a.mckinley@analyticsengines.com

over 5 years ago

In reply to: Tom Lane (#5)

Re: CUBE_MAX_DIM

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: 25 June 2020 17:43

Alastair McKinley <a.mckinley@analyticsengines.com> writes:

I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful for these applications. There are other pre-processing techniques that can be used to improved the speed of the search, but it still ends up with a kNN search in a high-ish dimensional space.

Is there a way to fix the numerical instability involved? If we could do
that, then we'd definitely have a use-case justifying the work to make
cube toastable.

I am not that familiar with the nature of the numerical instability, but it might be worth noting for additional context that for the NN use case:

- The value of each dimension is likely to be between 0 and 1
- The L1 distance is meaningful for high numbers of dimensions, which *possibly* suffers less from the numeric issues than euclidean distance.

The numerical stability isn't the only issue for high dimensional kNN, the GiST search performance currently degrades with increasing N towards sequential scan performance, although maybe they are related?

regards, tom lane

Best regards,
Alastair