Learned Index

Started by Deepak Balasubramanyamabout 8 years ago3 messages

deepak.balu@gmail.com

about 8 years ago

I came across this paper making a case for indices that use machine
learning to optimise search.

https://arxiv.org/pdf/1712.01208.pdf

The gist seems to be to use a linear regression model or feed a tensor flow
model when a more complicated distribution is needed for the data and allow
SIMD instructions working on top of GPUs / TPUs to speed up lookups. The
speedup observed is anywhere from 40-60%.

That result looks impressive but I don't have enough context on say
rebuilding a neural net on every DML operation. The equivalent operation
that I can relate to on PG would be to rebalance the B-tree for DML
operations.

In your opinion, would a ML model work for a table whose operations are
both write and read heavy? I'd love to hear your thoughts on the paper.

Thanks for reading
- Deepak

Laurenz Albe

laurenz.albe@cybertec.at

about 8 years ago

In reply to: Deepak Balasubramanyam (#1)

Re: Learned Index

Deepak Balasubramanyam wrote:

I came across this paper making a case for indices that use machine learning to optimise search.

https://arxiv.org/pdf/1712.01208.pdf

The gist seems to be to use a linear regression model or feed a tensor flow model when a more complicated distribution is needed for the data and allow SIMD instructions working on top of GPUs / TPUs to speed up lookups. The speedup observed is anywhere from 40-60%.

That result looks impressive but I don't have enough context on say rebuilding a neural net on every DML operation. The equivalent operation that I can relate to on PG would be to rebalance the B-tree for DML operations.

In your opinion, would a ML model work for a table whose operations are both write and read heavy? I'd love to hear your thoughts on the paper.

I have read into the paper.

This may be interesting or not, but the paper is very vague about
its concepts and algorithms, so it's hard to tell.

I'd say that the paper does not meet publication standards.

For example, they say that their results were generated by comparing
a B-tree implementation with "learned indexes using a 2-stage
RMI model and different second-stage sizes (i.e., 10k, 50k, 100k, and 200k)",
but they don't say exactly what the neural network in these stages is
(at least it is not obvious to me). Their "Learning Index Framework" (LIF)
is described with a few vague sentences and a reference to the literature
saying that is where they got some ideas from.

There is also no clear concept of how these indexes should handle
data modifications, so I think that there are some loose ends to be
tied up before it is ready for implementation.

Finally, I don't see any clear statement as to the error guarantees
that the neural network prediction can give, and if it is possible that
it may degrade to scanning relevant parts of the table in some cases.

Yours,
Laurenz Albe

Oleg Ivanov

o.ivanov@postgrespro.ru

about 8 years ago

In reply to: Laurenz Albe (#2)

Re: Learned Index

On 12/12/2017 12:16 PM, Laurenz Albe wrote:

I have read into the paper.

This may be interesting or not, but the paper is very vague about
its concepts and algorithms, so it's hard to tell.

I'd say that the paper does not meet publication standards.

For example, they say that their results were generated by comparing
a B-tree implementation with "learned indexes using a 2-stage
RMI model and different second-stage sizes (i.e., 10k, 50k, 100k, and
200k)",
but they don't say exactly what the neural network in these stages is
(at least it is not obvious to me). Their "Learning Index Framework"
(LIF)
is described with a few vague sentences and a reference to the literature
saying that is where they got some ideas from.

That is not the answer, but gives us the idea of which kind of neural
networks was used: "For this paper, we only focused on 2 types of
models, simple neural nets with zero to two fully-connected hidden
layers and ReLU activation functions and a layer width of up to 32
neurons and B-Trees (a.k.a. decision trees)".

There is also no clear concept of how these indexes should handle
data modifications, so I think that there are some loose ends to be
tied up before it is ready for implementation.

Finally, I don't see any clear statement as to the error guarantees
that the neural network prediction can give, and if it is possible that
it may degrade to scanning relevant parts of the table in some cases.

No guarantees are provided (I don't think it is even possible), besides
the guarantee that if the error of the neural network prediction is more
than the error of B-tree prediction, B-tree will be used: "Note, that
hybrid indexes allow us to bound the worst case performance of learned
indexes to the performance of B-Trees".

Oleg Ivanov
Postgres Professional
The Russian PostgreSQL Company