Cube extension split algorithm fix

Started by Stas Kelvichover 12 years ago3 messageshackers
Jump to latest
#1Stas Kelvich
stas.kelvich@gmail.com

Hello, hackers.

I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman paper (old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Stas.

Attachments:

split.diffapplication/octet-stream; name=split.diff; x-unix-mode=0644Download+928-225
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Stas Kelvich (#1)
Re: Cube extension split algorithm fix

On 9/25/13 7:14 AM, Stas Kelvich wrote:

I've fixed split algorithm that was implemented in cube extension.

This patch creates a bunch of new compiler warnings. Please fix those.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Stas Kelvich (#1)
Re: Cube extension split algorithm fix

Hi!

On Wed, Sep 25, 2013 at 3:14 PM, Stas Kelvich <stas.kelvich@gmail.com>wrote:

I've fixed split algorithm that was implemented in cube extension. I've
changed it according to the original Guttman paper (old version was more
simple algorithm) and also ported Alexander Korotkov's algorithm from box
datatype indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of
goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Thanks for patch! Here is my review.

1) After points for cubes were committed, this patch doesn't applies
anymore to head.

patching file contrib/cube/cube.c
Hunk #1 FAILED at 131.
Hunk #2 succeeded at 482 (offset 17 lines).
Hunk #3 succeeded at 494 (offset 17 lines).
Hunk #4 succeeded at 501 (offset 17 lines).
Hunk #5 succeeded at 611 (offset 17 lines).
Hunk #6 FAILED at 681.
Hunk #7 succeeded at 790 with fuzz 1 (offset 30 lines).
Hunk #8 succeeded at 808 (offset 29 lines).
Hunk #9 succeeded at 888 (offset 29 lines).
Hunk #10 succeeded at 1090 (offset 29 lines).
Hunk #11 succeeded at 1160 (offset 29 lines).
Hunk #12 succeeded at 1272 (offset 27 lines).
Hunk #13 succeeded at 1560 with fuzz 1 (offset 95 lines).
2 out of 13 hunks FAILED -- saving rejects to file contrib/cube/cube.c.rej
(Stripping trailing CRs from patch.)
patching file contrib/cube/cubedata.h
Hunk #1 succeeded at 47 (offset 35 lines).

2) Split algorithms are choosing by number of dimensions in first cube.
This is generally weak idea :) At least, cubes could have different number
of dimensions in the same index. This rule would give different answers for
different orders of cubes. Also, as corner case n+1 dimensional cubes with
confluent (n+1)th dimension are generally same as n dimensional cubes.
Choosing split algorithm shouldn't depend on it. We should try to
extrapolate this principle little more and detect useless for split
dimensions to do a better choice. Also, as responsible part as choosing
split algorithm needs to be documented (with references to the papers if
needed).

3) We need more comprehensive testing. One dataset is not enough. We need
much more to prove the rule of choosing split algorithm. Description of set
of queries 'Cube point search' is not detail. Can you share dataset you
use? We need the tests that anyone can reproduce.

------
With best regards,
Alexander Korotkov.