[GSoC] application ideas

Started by pantilimonov mishaalmost 7 years ago5 messages

pantlimon@yandex.ru

almost 7 years ago

<div xmlns="http://www.w3.org/1999/xhtml"><div>Greetings,</div><div> </div><div>i am interested in databases and would like to make a contribution to the</div><div>PostgreSQL by participating in GSoC 2019. Currently i am studying in HSE[1],</div><div>doing last year of master's program that mostly build on top of collaboration</div><div>with ISP RAS[2].</div><div> </div><div>In the previous year i have been working on llvm_jit extension for</div><div>PostgreSQL 9.6, that was developed in ISP RAS and presented at PGCON[3].</div><div>Specifically, my work consisted of adding support for several missing</div><div>nodes(bitmapscan, mergejoin, subqueryscan, etc)</div><div>by rewriting them with LLVM API, as well as other functionality(e.g. distinct in group by)</div><div>that is required to fully support TPC-H of SCALE 100.</div><div> </div><div>Originally i wanted to pursue "TOAST" tasks from ideas list, but noticed</div><div>that couple of students have already mentioned them in mailing list. So, instead</div><div>of increasing the queue for single possible idea, i would like to offer other</div><div>ones, that sound interesting to me and can potentially be useful for PostgreSQL</div><div>and community:</div><div> </div><div>1) The so-called Adaptive join, that exists in modern Oracle[4] and MSSQL[5] </div><div> versions. This type of node is designed to mitigate cardinality estimation</div><div> errors in queries that are somewhere inbetween NL(nested loop with indexscan)</div><div> and HJ(hashjoin).</div><div> </div><div> One possible implementation of that is to start execution in HJ fasion, by accumulating</div><div> rows in hashtable with certain threshold. If threshold is not exceeded, then</div><div> continue with indexscan, otherwise switch to usual HJ.</div><div> </div><div>2) Changing buffer manager strategy.</div><div> Somewhere in 2016 Andres Freund made a presention[6] of possible improvements</div><div> that can be done in buffer manager. I find the idea of changing hashtable to</div><div> trees of radix trees[7] promising. Most likely, taking into account program's</div><div> time constraints, this task won't be done as "ready to deploy" solution.<br /> Instead, some kind of prototype can be implemented and benchmarked. </div><div> </div><div>3) Improvements in jit component.</div><div> Great progress has been made in this direction in 10 and 11 versions, but</div><div> still there's a lot to be done. Possible subareas: compiled code caching/sharing,</div><div> cost-based optimizer improvements, push-based execution with bytecode</div><div> transformation, compiling plpgsql, etc.</div><div> </div><div>At this stage i would like to receive some feedback from the community,</div><div>which of those ideas are more useful for the near future of PostgreSQL and</div><div>more suitable for GSoC itself. With that information i can dive into particular</div><div>topic, extract additional information and prepare required proposal.</div><div> </div><div>p.s. my preferred order: 2,1,3</div><div> </div><div>--------------------------------------------------------------------------------</div><div>[1] https://www.hse.ru/en/ma/sp</div><div>[2] http://www.ispras.ru/en/</div><div>[3] http://www.pgcon.org/2017/schedule/events/1092.en.html</div><div>[4] https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-optimizer-with-oracledb-12c-1963236.pdf</div><div>[5] https://blogs.msdn.microsoft.com/sqlserverstorageengine/2017/04/19/introducing-batch-mode-adaptive-joins/</div><div>[6] https://pgconf.ru/media/2016/05/13/1io.pdf</div><div>[7] http://events17.linuxfoundation.org/sites/events/files/slides/LinuxConNA2016%20-%20Radix%20Tree.pdf</div><div> </div><div>Best regards,</div><div> </div><div>Michael.</div></div>

pantilimonov misha

pantlimon@yandex.ru

almost 7 years ago

In reply to: pantilimonov misha (#1)

Re: [GSoC] application ideas

Excuse me for the previous letter, should be fixed now by using simple html.

---

Greetings,

i am interested in databases and would like to make a contribution to the
PostgreSQL by participating in GSoC 2019. Currently i am studying in HSE[1]https://www.hse.ru/en/ma/sp,
doing last year of master's program that mostly build on top of collaboration
with ISP RAS[2]http://www.ispras.ru/en/.

In the previous year i have been working on llvm_jit extension for
PostgreSQL 9.6, that was developed in ISP RAS and presented at PGCON[3]http://www.pgcon.org/2017/schedule/events/1092.en.html.
Specifically, my work consisted of adding support for several missing
nodes(bitmapscan, mergejoin, subqueryscan, etc)
by rewriting them with LLVM API, as well as other functionality(e.g. distinct in group by)
that is required to fully support TPC-H of SCALE 100.

Originally i wanted to pursue "TOAST" tasks from ideas list, but noticed
that couple of students have already mentioned them in mailing list. So, instead
of increasing the queue for single possible idea, i would like to offer other
ones, that sound interesting to me and can potentially be useful for PostgreSQL
and community:

1) The so-called Adaptive join, that exists in modern Oracle[4]https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-optimizer-with-oracledb-12c-1963236.pdf and MSSQL[5]https://blogs.msdn.microsoft.com/sqlserverstorageengine/2017/04/19/introducing-batch-mode-adaptive-joins/
versions. This type of node is designed to mitigate cardinality estimation
errors in queries that are somewhere inbetween NL(nested loop with indexscan)
and HJ(hashjoin).

One possible implementation of that is to start execution in HJ fasion, by accumulating
rows in hashtable with certain threshold. If threshold is not exceeded, then
continue with indexscan, otherwise switch to usual HJ.

2) Changing buffer manager strategy.
Somewhere in 2016 Andres Freund made a presention[6]https://pgconf.ru/media/2016/05/13/1io.pdf of possible improvements
that can be done in buffer manager. I find the idea of changing hashtable to
trees of radix trees[7]http://events17.linuxfoundation.org/sites/events/files/slides/LinuxConNA2016%20-%20Radix%20Tree.pdf promising. Most likely, taking into account program's
time constraints, this task won't be done as "ready to deploy" solution.
Instead, some kind of prototype can be implemented and benchmarked.

3) Improvements in jit component.
Great progress has been made in this direction in 10 and 11 versions, but
still there's a lot to be done. Possible subareas: compiled code caching/sharing,
cost-based optimizer improvements, push-based execution with bytecode
transformation, compiling plpgsql, etc.

At this stage i would like to receive some feedback from the community,
which of those ideas are more useful for the near future of PostgreSQL and
more suitable for GSoC itself. With that information i can dive into particular
topic, extract additional information and prepare required proposal.

p.s. my preferred order: 2,1,3

--------------------------------------------------------------------------------
[1]: https://www.hse.ru/en/ma/sp
[2]: http://www.ispras.ru/en/
[3]: http://www.pgcon.org/2017/schedule/events/1092.en.html
[4]: https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-optimizer-with-oracledb-12c-1963236.pdf
[5]: https://blogs.msdn.microsoft.com/sqlserverstorageengine/2017/04/19/introducing-batch-mode-adaptive-joins/
[6]: https://pgconf.ru/media/2016/05/13/1io.pdf
[7]: http://events17.linuxfoundation.org/sites/events/files/slides/LinuxConNA2016%20-%20Radix%20Tree.pdf

Best regards,

Michael.

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: pantilimonov misha (#1)

Re: [GSoC] application ideas

Hi, Michael!

19 марта 2019 г., в 14:53, pantilimonov misha <pantlimon@yandex.ru> написал(а):

2) Changing buffer manager strategy.
Somewhere in 2016 Andres Freund made a presention[6] of possible improvements
that can be done in buffer manager. I find the idea of changing hashtable to
trees of radix trees[7] promising. Most likely, taking into account program's
time constraints, this task won't be done as "ready to deploy" solution.
Instead, some kind of prototype can be implemented and benchmarked.

I like the idea of more efficient BufferTag->Page data structure. I'm not sure cache locality is a real problem there, but I believe this idea deserves giving it a shot.
I'd happily review your proposal and co-mentor project, if it will be chosen for GSoC.

Also, plz check some work of my students in related area [0]/messages/by-id/89A121E3-B593-4D65-98D9-BBC210B87268@yandex-team.ru.

Best regards, Andrey Borodin.

[0]: /messages/by-id/89A121E3-B593-4D65-98D9-BBC210B87268@yandex-team.ru

pantilimonov misha

pantlimon@yandex.ru

almost 7 years ago

In reply to: Andrey Borodin (#3)

Re: [GSoC] application ideas

Andrey, thank you for your reply.

24.03.2019, 12:12, "Andrey Borodin" <x4mmm@yandex-team.ru>:

I like the idea of more efficient BufferTag->Page data structure. I'm not sure cache locality is a real problem there, but I believe this idea deserves giving it a shot.
I'd happily review your proposal and co-mentor project, if it will be chosen for GSoC.

Here it is:

https://docs.google.com/document/d/1HmhOs07zE8Q1TX1pOdtjxHSjjAUjaO2tp9NSmry8muY/edit?usp=sharing

Also, plz check some work of my students in related area [0].

This is definitely helpful! Also found couple of relevant discussions, putting information together...

--
Best regards,

Michael.

Andrey Borodin

x4mmm@yandex-team.ru

almost 7 years ago

In reply to: pantilimonov misha (#4)

Re: [GSoC] application ideas

Hi!

We are discussing GSoC details offlist, but I'll put some recommendations on your proposal to the list.

3 апр. 2019 г., в 2:53, pantilimonov misha <pantlimon@yandex.ru> написал(а):

Andrey, thank you for your reply.

24.03.2019, 12:12, "Andrey Borodin" <x4mmm@yandex-team.ru>:

I like the idea of more efficient BufferTag->Page data structure. I'm not sure cache locality is a real problem there, but I believe this idea deserves giving it a shot.
I'd happily review your proposal and co-mentor project, if it will be chosen for GSoC.

Here it is:

https://docs.google.com/document/d/1HmhOs07zE8Q1TX1pOdtjxHSjjAUjaO2tp9NSmry8muY/edit?usp=sharing

While your project is clearly research-oriented, it is planned within PostgreSQL development process. Can you please add some information about which patches are your going to put on commitfest?
Also, please plan to review one or more patches. This is important for integrating into community.

BTW, there is somewhat related IntegerSet data structure added recently [0]https://github.com/postgres/postgres/commit/df816f6ad532ad685a3897869a2e64d3a53fe312. In my version it was implemented as radix tree. I think it is a good example how generic data structure can be presented and then reused by BufferManager.

Best regards, Andrey Borodin.

[0]: https://github.com/postgres/postgres/commit/df816f6ad532ad685a3897869a2e64d3a53fe312