ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Started by Nikolay Samokhvalovabout 8 years ago18 messages
#1Nikolay Samokhvalov
samokhvalov@gmail.com

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not
really down-to-earth?

#2Oleg Bartunov
obartunov@gmail.com
In reply to: Nikolay Samokhvalov (#1)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not
really down-to-earth?

Oleg made some analysis of the paper.

#3Oleg Ivanov
o.ivanov@postgrespro.ru
In reply to: Oleg Bartunov (#2)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not
really down-to-earth?

Oleg made some analysis of the paper.

If the keys are comparable and the data distribution is simple enough
(which seems to happen rather often) and the data does not change, we
can learn the distribution, and so perform the search faster, and save
the memory also. That is what authors wrote and that is what must work
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or 
less straightforward to nearly append-only workloads and to workloads in 
which the data distribution does not change or changes very slowly (the 
data itself or its amount may change, nevertheless). Otherwise it is 
challenging, because the key positions cannot be considered as 
independent values anymore, but it looks still possible. The other 
proposed by the authors option is to store new objects in buffer and 
retrain model in the background, but that does not look as a nice solution.
+ GPU are fast, but the latency of doing operations on the GPU almost 
certainly avoids all speed benefits for such small models. The only case 
in which it is reasonable to use GPU is training models (i. e. building 
such indexes).
The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account 
the limitations above.
+ For hash functions authors propose to replace hash function with the 
learned CDF, which can decrease the amount of collisions, which decreaes 
the memory consumption. That is reasonable, but this hash function 
implicitly uses the information about the order on the keys. I suppose 
that using the ordering on the data and with access to the data 
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but 
it is not clear that it is the best option. For example, one might use 
the end-to-end learning procedure with weights sparsity regularizers as 
well. I wonder whether the last approach can obtain the competitive 
results with the first one, and if not, then why. Anyway, I have a 
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly
about the changing data), and has some points to think about how to
implement them properly (paging, for example), but it looks promising
enough nevertheless.

#4Stefan Keller
sfkeller@gmail.com
In reply to: Oleg Ivanov (#3)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Dear Olegs, dear Nikolay, dear all

Allow me to revive this thread:

Are there any advances in a learned index for PostgreSQL?

Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).

My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel, J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment and Prognosis. Data Engineering, 3. Web access: https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."

Yours, Stefan

[1]: Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel, J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment and Prognosis. Data Engineering, 3. Web access: https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf

Bibliography (without any claim for completeness):

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)

Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:
https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
(pretends to be open source C++ but none found)

Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).

Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf

Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)

Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:

Show quoted text

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not
really down-to-earth?

Oleg made some analysis of the paper.

If the keys are comparable and the data distribution is simple enough
(which seems to happen rather often) and the data does not change, we
can learn the distribution, and so perform the search faster, and save
the memory also. That is what authors wrote and that is what must work
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or
less straightforward to nearly append-only workloads and to workloads in
which the data distribution does not change or changes very slowly (the
data itself or its amount may change, nevertheless). Otherwise it is
challenging, because the key positions cannot be considered as
independent values anymore, but it looks still possible. The other
proposed by the authors option is to store new objects in buffer and
retrain model in the background, but that does not look as a nice solution.
+ GPU are fast, but the latency of doing operations on the GPU almost
certainly avoids all speed benefits for such small models. The only case
in which it is reasonable to use GPU is training models (i. e. building
such indexes).
The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account
the limitations above.
+ For hash functions authors propose to replace hash function with the
learned CDF, which can decrease the amount of collisions, which decreaes
the memory consumption. That is reasonable, but this hash function
implicitly uses the information about the order on the keys. I suppose
that using the ordering on the data and with access to the data
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but
it is not clear that it is the best option. For example, one might use
the end-to-end learning procedure with weights sparsity regularizers as
well. I wonder whether the last approach can obtain the competitive
results with the first one, and if not, then why. Anyway, I have a
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly
about the changing data), and has some points to think about how to
implement them properly (paging, for example), but it looks promising
enough nevertheless.

#5Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Stefan Keller (#4)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

20 апр. 2021 г., в 22:56, Stefan Keller <sfkeller@gmail.com> написал(а):

Are there any advances in a learned index for PostgreSQL?

BTW take a look into PGM [0]https://pgm.di.unipi.it/. I'm slowly working on implementing it.
I think it is kind of straightforward to implement it as extension.
I've started from forking B-tree[1]https://github.com/x4m/pg2m. I've removed support of anything that is not int4.
Then I plan to remove opclass\comparator abstraction layer.
And then add interpolation search over the page before binsearch.
And, maybe, add delta compression. Or, perhaps, fix WAL, which is completely broken. Or add some vectorisation.
The concurrency is from a regular B-tree, of cause.

This is my pet project. So, unsurprisingly, I've done only parts of big plan :) I would appreciate any help.

Thanks!

Best regards, Andrey Borodin.

[0]: https://pgm.di.unipi.it/
[1]: https://github.com/x4m/pg2m

In reply to: Andrey Borodin (#5)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 11:18 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

BTW take a look into PGM [0]. I'm slowly working on implementing it.
I think it is kind of straightforward to implement it as extension.
I've started from forking B-tree[1]. I've removed support of anything that is not int4.
Then I plan to remove opclass\comparator abstraction layer.
And then add interpolation search over the page before binsearch.

FWIW I'm a skeptic of learned indexes. There are lots of reasons why I
don't think that they're practical, but it boils down to this: in
general data structures that work well don't need anybody to make a
case for them. They simply work well for the task they were designed
for.

Anything is possible, but I strongly doubt that learned indexes are
the exception to that general rule. Why hasn't anybody been able to
apply the techniques in any real world database system to date? I'm
sure that it's possible to make things much faster if it's possible to
make certain wide-reaching assumptions. They talk about big
improvements in space utilization compared to B-Trees as if there was
some very fixed idea of how that works in B-Trees. Why haven't they
compared their model to grotty heuristics that practical experience
has shown to work, such as the rightmost leaf page split heuristic?
Any paper about learned indexes that I've ever read doesn't even
acknowledge the existence of these heuristics.

--
Peter Geoghegan

#7Chapman Flack
chap@anastigmatix.net
In reply to: Peter Geoghegan (#6)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On 04/20/21 15:24, Peter Geoghegan wrote:

data structures that work well don't need anybody to make a case for them.
They simply work well for the task they were designed for.

How would showing that to be true for data structure X be different from
making a case for data structure X?

Regards,
-Chap

In reply to: Chapman Flack (#7)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net> wrote:

How would showing that to be true for data structure X be different from
making a case for data structure X?

You don't have to understand the theoretical basis of B-Tree indexes
to see that they work well. In fact, it took at least a decade for
somebody to formalize how the space utilization works with B-Trees
containing random data. Of course theory matters, but the fact is that
B-Trees had been widely used for commercial and scientific
applications that whole time.

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

--
Peter Geoghegan

#9Jonah H. Harris
jonah.harris@gmail.com
In reply to: Peter Geoghegan (#8)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 3:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net>
wrote:

How would showing that to be true for data structure X be different from
making a case for data structure X?

You don't have to understand the theoretical basis of B-Tree indexes
to see that they work well. In fact, it took at least a decade for
somebody to formalize how the space utilization works with B-Trees
containing random data. Of course theory matters, but the fact is that
B-Trees had been widely used for commercial and scientific
applications that whole time.

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

Agreed on all of your takes, Peter. In time, they will probably be more
realistic. But, at present, I tend to see the research papers make
comparisons between learned vs. traditional pitching the benefits of the
former without any of the well-known optimizations of the latter - as if
time stood still since the original B-Tree. Similarly, where most academic
research starts to fall apart in practicality is the lack of addressing
realistic write volumes and related concurrency issues. I'm happy to be
disproven on this, though.

--
Jonah H. Harris

In reply to: Jonah H. Harris (#9)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

Agreed on all of your takes, Peter. In time, they will probably be more realistic.

A big problem when critically evaluating any complicated top-down
model in the abstract is that it's too easy for the designer to hide
*risk* (perhaps inadvertently). If you are allowed to make what
amounts to an assumption that you have perfect foreknowledge of the
dataset, then sure, you can do a lot with that certainty. You can
easily find a way to make things faster or more space efficient by
some ridiculous multiple that way (like 10x, 100x, whatever).

None of these papers ever get around to explaining why what they've
come up with is not simply fool's gold. The assumption that you can
have robust foreknowledge of the dataset seems incredibly fragile,
even if your model is *almost* miraculously good. I have no idea how
fair that is. But my job is to make Postgres better, not to judge
papers. My mindset is very matter of fact and practical.

--
Peter Geoghegan

#11Stefan Keller
sfkeller@gmail.com
In reply to: Peter Geoghegan (#10)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Just for the records: A learned index as no more foreknowledge about
the dataset as other indices.

I'd give learned indexes at least a change to provide a
proof-of-concept. And I want to learn more about the requirements to
be accepted as a new index (before undergoing month's of code
sprints).

As you may have seen, the "Stonebraker paper" I cited [1]https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf is also
sceptic requiring full parity on features (like "concurrency control,
recovery, non main memory,and multi-user settings")! Non main memory
code I understand.
=> But index read/write operations and multi-user settings are part of
a separate software (transaction manager), aren't they?

~Stefan

[1]: https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf

Am Di., 20. Apr. 2021 um 22:22 Uhr schrieb Peter Geoghegan <pg@bowt.ie>:

Show quoted text

On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

Agreed on all of your takes, Peter. In time, they will probably be more realistic.

A big problem when critically evaluating any complicated top-down
model in the abstract is that it's too easy for the designer to hide
*risk* (perhaps inadvertently). If you are allowed to make what
amounts to an assumption that you have perfect foreknowledge of the
dataset, then sure, you can do a lot with that certainty. You can
easily find a way to make things faster or more space efficient by
some ridiculous multiple that way (like 10x, 100x, whatever).

None of these papers ever get around to explaining why what they've
come up with is not simply fool's gold. The assumption that you can
have robust foreknowledge of the dataset seems incredibly fragile,
even if your model is *almost* miraculously good. I have no idea how
fair that is. But my job is to make Postgres better, not to judge
papers. My mindset is very matter of fact and practical.

--
Peter Geoghegan

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Keller (#11)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Stefan Keller <sfkeller@gmail.com> writes:

I'd give learned indexes at least a change to provide a
proof-of-concept. And I want to learn more about the requirements to
be accepted as a new index (before undergoing month's of code
sprints).

There's enough support these days that you can build a new index
type as an extension, without touching the core code at all.
I'd advise proceeding that way, at least until you have a pretty
convincing prototype.

regards, tom lane

In reply to: Stefan Keller (#11)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote:

Just for the records: A learned index as no more foreknowledge about
the dataset as other indices.

Maybe. ML models are famously prone to over-interpreting training
data. In any case I am simply not competent to assess how true this
is.

I'd give learned indexes at least a change to provide a
proof-of-concept. And I want to learn more about the requirements to
be accepted as a new index (before undergoing month's of code
sprints).

I have everything to gain and nothing to lose by giving them a chance
-- I'm not required to do anything to give them a chance, after all. I
just want to be clear that I'm a skeptic now rather than later. I'm
not the one making a big investment of my time here.

As you may have seen, the "Stonebraker paper" I cited [1] is also
sceptic requiring full parity on features (like "concurrency control,
recovery, non main memory,and multi-user settings")! Non main memory
code I understand.
=> But index read/write operations and multi-user settings are part of
a separate software (transaction manager), aren't they?

It's easy for me to be a skeptic -- again, what do I have to lose by
freely expressing my opinion? Mostly I'm just saying that I wouldn't
work on this because ISTM that there is significant uncertainty about
the outcome, but much less uncertainty about the outcome of
alternative projects of comparable difficulty. That's fundamentally
how I assess what to work on. There is plenty of uncertainty on my end
-- but that's beside the point.

--
Peter Geoghegan

#14Stefan Keller
sfkeller@gmail.com
In reply to: Peter Geoghegan (#13)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's enough support these days that you can build a new index
type as an extension, without touching the core code at all.

Thanks. I'm ramping up knowledge about extending PG with C++.

I'm still interested to understand in principle what an index has to
do with concurrency control, in order to divide
concerns/reponsibilities of code.

Di., 20. Apr. 2021 23:51 Uhr Peter Geoghegan <pg@bowt.ie> wrote:

It's easy for me to be a skeptic

Isn't being skeptic a requirement for all of us to be a db engineer :-)

but much less uncertainty about the outcome of alternative projects of comparable difficulty

Oh. As mentioned above I'm trying to get an overview of indices. So,
if you have hints about other new indexes (like PGM, VODKA for
text/ts, or Hippo), I'm interested.

~Stefan

Am Di., 20. Apr. 2021 um 23:51 Uhr schrieb Peter Geoghegan <pg@bowt.ie>:

Show quoted text

On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote:

Just for the records: A learned index as no more foreknowledge about
the dataset as other indices.

Maybe. ML models are famously prone to over-interpreting training
data. In any case I am simply not competent to assess how true this
is.

I'd give learned indexes at least a change to provide a
proof-of-concept. And I want to learn more about the requirements to
be accepted as a new index (before undergoing month's of code
sprints).

I have everything to gain and nothing to lose by giving them a chance
-- I'm not required to do anything to give them a chance, after all. I
just want to be clear that I'm a skeptic now rather than later. I'm
not the one making a big investment of my time here.

As you may have seen, the "Stonebraker paper" I cited [1] is also
sceptic requiring full parity on features (like "concurrency control,
recovery, non main memory,and multi-user settings")! Non main memory
code I understand.
=> But index read/write operations and multi-user settings are part of
a separate software (transaction manager), aren't they?

It's easy for me to be a skeptic -- again, what do I have to lose by
freely expressing my opinion? Mostly I'm just saying that I wouldn't
work on this because ISTM that there is significant uncertainty about
the outcome, but much less uncertainty about the outcome of
alternative projects of comparable difficulty. That's fundamentally
how I assess what to work on. There is plenty of uncertainty on my end
-- but that's beside the point.

--
Peter Geoghegan

#15Oleg Bartunov
obartunov@postgrespro.ru
In reply to: Stefan Keller (#4)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Tue, Apr 20, 2021 at 8:56 PM Stefan Keller <sfkeller@gmail.com> wrote:

Dear Olegs, dear Nikolay, dear all

Allow me to revive this thread:

Are there any advances in a learned index for PostgreSQL?

Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).

Have you seen recent paper "Benchmarking Learned Indexes" ?
https://vldb.org/pvldb/vol14/p1-marcus.pdf
I haven't look into the code https://github.com/learnedsystems/sosd

My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."

Yours, Stefan

[1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf

Bibliography (without any claim for completeness):

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)

Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:

https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
(pretends to be open source C++ but none found)

Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).

Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf

Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)

Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter
https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean

and co

at GOOG just released a paper showing how machine-learned indexes can
replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster

than

B-Trees, 10-100x less space. Executes on GPU, which are getting faster
unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's

not

really down-to-earth?

Oleg made some analysis of the paper.

If the keys are comparable and the data distribution is simple enough
(which seems to happen rather often) and the data does not change, we
can learn the distribution, and so perform the search faster, and save
the memory also. That is what authors wrote and that is what must work
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or
less straightforward to nearly append-only workloads and to workloads in
which the data distribution does not change or changes very slowly (the
data itself or its amount may change, nevertheless). Otherwise it is
challenging, because the key positions cannot be considered as
independent values anymore, but it looks still possible. The other
proposed by the authors option is to store new objects in buffer and
retrain model in the background, but that does not look as a nice

solution.

+ GPU are fast, but the latency of doing operations on the GPU almost
certainly avoids all speed benefits for such small models. The only case
in which it is reasonable to use GPU is training models (i. e. building
such indexes).

The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account
the limitations above.
+ For hash functions authors propose to replace hash function with the
learned CDF, which can decrease the amount of collisions, which decreaes
the memory consumption. That is reasonable, but this hash function
implicitly uses the information about the order on the keys. I suppose
that using the ordering on the data and with access to the data
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but
it is not clear that it is the best option. For example, one might use
the end-to-end learning procedure with weights sparsity regularizers as
well. I wonder whether the last approach can obtain the competitive
results with the first one, and if not, then why. Anyway, I have a
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly
about the changing data), and has some points to think about how to
implement them properly (paging, for example), but it looks promising
enough nevertheless.

--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#16Bruce Momjian
bruce@momjian.us
In reply to: Stefan Keller (#14)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote:

Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's enough support these days that you can build a new index
type as an extension, without touching the core code at all.

Thanks. I'm ramping up knowledge about extending PG with C++.

I'm still interested to understand in principle what an index has to
do with concurrency control, in order to divide
concerns/reponsibilities of code.

The issue is that some index structures, like bitmap indexes, have very
poor concurrent performance. This means that some indexes perform very
well for a single user but poorly for multiple users.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#17Stefan Keller
sfkeller@gmail.com
In reply to: Bruce Momjian (#16)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

Mi., 21. Apr. 2021, 11:16 Uhr, Oleg Bartunov <obartunov@postgrespro.ru> wrote:

Have you seen recent paper "Benchmarking Learned Indexes" ?

Yes. I skipped it after that this benchmark "just" compares the
algorithm implementations.

What's needed - and what many here as well as the "ML-In-Databases"
paper from Kraska et al. (2021) are saying - is, that a new index
(like a learned index) should be implemented as a PostgreSQL
extension.

Mi., 21. Apr. 2021, 15:46 Uhr, Bruce Momjian <bruce@momjian.us> wrote:

The issue is that some index structures, like bitmap indexes, have very
poor concurrent performance. This means that some indexes perform very
well for a single user but poorly for multiple users.

I see now. That looks to me like a second step of an experiment to
implement a possible new index.

~Stefan

Am Mi., 21. Apr. 2021 um 15:46 Uhr schrieb Bruce Momjian <bruce@momjian.us>:

Show quoted text

On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote:

Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:

There's enough support these days that you can build a new index
type as an extension, without touching the core code at all.

Thanks. I'm ramping up knowledge about extending PG with C++.

I'm still interested to understand in principle what an index has to
do with concurrency control, in order to divide
concerns/reponsibilities of code.

The issue is that some index structures, like bitmap indexes, have very
poor concurrent performance. This means that some indexes perform very
well for a single user but poorly for multiple users.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#18Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Stefan Keller (#17)
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

21 апр. 2021 г., в 21:01, Stefan Keller <sfkeller@gmail.com> написал(а):

What's needed - and what many here as well as the "ML-In-Databases"
paper from Kraska et al. (2021) are saying - is, that a new index
(like a learned index) should be implemented as a PostgreSQL
extension.

BTW, you don't have to implement index from scratch. You can take one of many existing and modify towards Learned Index.
Index contains many parts: scans, builds, vacuums, wal, opclasses etc. And simply adhering to API is a lot of work.
I'll be happy to help you to with formulating key differences and, probably, code of an extension.

Best regards, Andrey Borodin.