WIP: Fast GiST index build

Started by Alexander Korotkovalmost 15 years ago143 messageshackers
Jump to latest
#1Alexander Korotkov
aekorotkov@gmail.com

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.0.3.patch.gzapplication/x-gzip; name=gist_fast_build-0.0.3.patch.gzDownload+1-0
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
Re: WIP: Fast GiST index build

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

I did some quick performance testing of this. I installed postgis 1.5,
and loaded an extract of the OpenStreetMap data covering Finland. The
biggest gist index in that data set is the idx_nodes_geom index on nodes
table. I have maintenance_work_mem and shared_buffers both set to 512
MB, and this laptop has 4GB of RAM.

Without the patch, reindexing the index takes about 170 seconds and the
index size is 321 MB. And with the patch, it takes about 150 seconds,
and the resulting index size is 319 MB.

The nodes table is 618MB in size, so it fits in RAM. I presume the gain
would be bigger if it doesn't, as the random I/O to update the index
starts to hurt more. But this shows that even when it does, this patch
helps a little bit, and the resulting index size is comparable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
Re: WIP: Fast GiST index build

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets (datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who have
interesting real-life datasets for testing are welcome.

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#3)
Re: WIP: Fast GiST index build

On 06.06.2011 10:42, Heikki Linnakangas wrote:

On 03.06.2011 14:02, Alexander Korotkov wrote:

Hackers,

WIP patch of fast GiST index build is attached. Code is dirty and
comments
are lacking, but it works. Now it is ready for first benchmarks, which
should prove efficiency of selected technique. It's time to compare fast
GiST index build with repeat insert build on large enough datasets
(datasets
which don't fit to cache). There are following aims of testing:
1) Measure acceleration of index build.
2) Measure change in index quality.
I'm going to do first testing using synthetic datasets. Everybody who
have
interesting real-life datasets for testing are welcome.

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours
with the patch, which of course should be rounded to 3 hours, not 2.
Anyway, it is still a very impressive improvement.

I'm glad you could get the patch ready for benchmarking this quickly.
Now you just need to get the patch into shape so that it can be
committed. That is always the more time-consuming part, so I'm glad you
have plenty of time left for it.

Could you please create a TODO list on the wiki page, listing all the
missing features, known bugs etc. that will need to be fixed? That'll
make it easier to see how much work there is left. It'll also help
anyone looking at the patch to know which issues are known issues.

Meanwhile, it would still be very valuable if others could test this
with different workloads. And Alexander, it would be good if at some
point you could write some benchmark scripts too, and put them on the
wiki page, just to see what kind of workloads have been taken into
consideration and tested already. Do you think there's some worst-case
data distributions where this algorithm would perform particularly badly?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: WIP: Fast GiST index build

Hi!

On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 06.06.2011 10:42, Heikki Linnakangas wrote:

I ran another test with a simple table generated with:

CREATE TABLE pointtest (p point);
INSERT INTO pointtest SELECT point(random(), random()) FROM
generate_series(1,50000000);

Generating a gist index with:

CREATE INDEX i_pointtest ON pointtest USING gist (p);

took about 15 hours without the patch, and 2 hours with it. That's quite
dramatic.

Oops, that was a rounding error, sorry. The run took about 2.7 hours with
the patch, which of course should be rounded to 3 hours, not 2. Anyway, it
is still a very impressive improvement.

I have similar results on 100 millions of rows: 21.6 hours without patch and
2 hours with patch. But I found a problem: index quality is worse. See
following query plans. There test is relation where index was created in
ordinal way, and test2 is relation where patch was used.

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4391.01..270397.31 rows=100000 width=20)
(actual time=1.257..2.147 rows=838 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=968
-> Bitmap Index Scan on test_idx (cost=0.00..4366.01 rows=100000
width=0) (actual time=1.162..1.162 rows=838 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=131
Total runtime: 2.214 ms
(7 rows)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test2 (cost=4370.84..270377.13 rows=100000 width=20)
(actual time=5.252..6.056 rows=838 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=1458
-> Bitmap Index Scan on test2_idx (cost=0.00..4345.84 rows=100000
width=0) (actual time=5.155..5.155 rows=838 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=621
Total runtime: 6.121 ms
(7 rows)

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4391.01..270397.31 rows=100000 width=20)
(actual time=2.148..2.977 rows=850 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1099
-> Bitmap Index Scan on test_idx (cost=0.00..4366.01 rows=100000
width=0) (actual time=2.052..2.052 rows=850 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=249
Total runtime: 3.033 ms
(7 rows)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test2 (cost=4370.84..270377.13 rows=100000 width=20)
(actual time=6.806..7.602 rows=850 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1615
-> Bitmap Index Scan on test2_idx (cost=0.00..4345.84 rows=100000
width=0) (actual time=6.709..6.709 rows=850 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=773
Total runtime: 7.667 ms
(7 rows)

We can see that index scan requires read of several times more
pages. Original paper denotes such effect. It explains it by the routing
rectangles in less optimal ways. But this effect wasn't so dramatic in tests
provided in the paper. So, I have following thoughts about this problem:

1) Number of pages, which was readed from index is too large even with
ordinal index build. Querying of small area requires read of hundred of
pages. It probbably caused by picksplit implementation. I've version of
picksplit algorithm which seems to be much more efficient. I'll do some
benchmarks with my picksplit algorithm. I hope difference in index quality
will be not so dramatic.

2) I can try to do some enchancements in fast build alogrithms which could
improve tree quality. In original paper Hilbert heuristic was used to achive
even better tree quality than tree which was created in ordinal way. But
since we use GiST we are restricted by it's interface (or we have to create
new interface functions(s), but I like to avoid it). I would like to try to
do some ordering by penalty value in buffer emptying process and buffers
relocation on split.

3) Probably, there is some bug which affects tree quality.

Could you please create a TODO list on the wiki page, listing all the
missing features, known bugs etc. that will need to be fixed? That'll make
it easier to see how much work there is left. It'll also help anyone looking
at the patch to know which issues are known issues.

Sure. I'll create such list on wiki page. I believe that currenlty most
important issue is index quality.

Meanwhile, it would still be very valuable if others could test this with
different workloads. And Alexander, it would be good if at some point you
could write some benchmark scripts too, and put them on the wiki page, just
to see what kind of workloads have been taken into consideration and tested
already. Do you think there's some worst-case data distributions where this
algorithm would perform particularly badly?

I don't expect some bad cases in terms in IO. My most worrying is about
index quality which is strongly related to data distribution.

------
With best regards,
Alexander Korotkov.

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: WIP: Fast GiST index build

On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Do you think there's some worst-case data distributions where this
algorithm would perform particularly badly?

I think there could be some worst-case GiST applications. Just now gist fast
build algorithm invokes more penalty calls than repeatable insert algorithm.
If I succeed then it will invoke even more such calls. So, if penalty
function is very slow then gist fast build will be slover then
repeatable insert.

------
With best regards,
Alexander Korotkov.

#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#6)
Re: WIP: Fast GiST index build

On Mon, Jun 6, 2011 at 4:14 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

If I succeed then it will invoke even more such calls.

I meant here that if I succeed in enhancements which improve index quality
then fast build algorithm will invoke even more such calls.

------
With best regards,
Alexander Korotkov.

#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#7)
Re: WIP: Fast GiST index build

I've tried index tuples sorting on penalty function before buffer relocation
on split. But it was without any success. Index quality becomes even worse
than without sorting.
The next thing I've tried is buffer relocation between all neighbor buffers.
Results of first tests is much more promising. Number of page accesses
during index scan is similar to those without fast index build. I'm going to
hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

test=# select pg_size_pretty(pg_relation_size('test_idx'));
pg_size_pretty
----------------
6939 MB
(1 row)

test=# explain (analyze, buffers) select * from test where v <@
'(0.903,0.203),(0.9,0.2)'::box;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4366.78..258752.22 rows=100000 width=16)
(actual time=1.412..2.295 rows=897 loops=1)
Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=1038
-> Bitmap Index Scan on test_idx (cost=0.00..4341.78 rows=100000
width=0) (actual time=1.311..1.311 rows=897 loops=1)
Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
Buffers: shared hit=141
Total runtime: 2.375 ms
(7 rows)

test=# explain (analyze, buffers) select * from test where v <@
'(0.503,0.503),(0.5,0.5)'::box;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4366.78..258752.22 rows=100000 width=16)
(actual time=2.113..2.972 rows=855 loops=1)
Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=1095
-> Bitmap Index Scan on test_idx (cost=0.00..4341.78 rows=100000
width=0) (actual time=2.016..2.016 rows=855 loops=1)
Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
Buffers: shared hit=240
Total runtime: 3.043 ms
(7 rows)

------
With best regards,
Alexander Korotkov.

Attachments:

gist_fast_build-0.1.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.1.0.patch.gzDownload
#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#8)
Re: WIP: Fast GiST index build

On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

I've tried index tuples sorting on penalty function before buffer
relocation on split. But it was without any success. Index quality becomes
even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor
buffers. Results of first tests is much more promising. Number of page
accesses during index scan is similar to those without fast index build. I'm
going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

I forget to say that build time increases in about 40%, but it is still
faster than ordinal build in about 10 times.

------
With best regards,
Alexander Korotkov.

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#9)
Re: WIP: Fast GiST index build

On 15.06.2011 10:24, Alexander Korotkov wrote:

On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

I've tried index tuples sorting on penalty function before buffer
relocation on split. But it was without any success. Index quality becomes
even worse than without sorting.
The next thing I've tried is buffer relocation between all neighbor
buffers. Results of first tests is much more promising. Number of page
accesses during index scan is similar to those without fast index build. I'm
going to hold on this approach.

test=# create index test_idx on test using gist(v);
NOTICE: Level step = 1, pagesPerBuffer = 406
CREATE INDEX
Time: 10002590,469 ms

I forget to say that build time increases in about 40%, but it is still
faster than ordinal build in about 10 times.

Is this relocation mechanism something that can be tuned, for different
tradeoffs between index quality and build time? In any case, it seems
that we're going to need a lot of testing with different data sets to
get a better picture of how this performs. But at least for now, it
looks like this approach is going to be acceptable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: WIP: Fast GiST index build

On Wed, Jun 15, 2011 at 12:03 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Is this relocation mechanism something that can be tuned, for different
tradeoffs between index quality and build time?

Yes, it can. I believe that it can be index parameter.

In any case, it seems that we're going to need a lot of testing with
different data sets to get a better picture of how this performs.

Sure. My problem is that I haven't large enough reallife datasets. Picture
of syntetic datasets can be unrepresentative on reallife cases. On smaller
datasets that I have I actually can compare only index quality. Also, tests
with large datasets takes long time especially without fast build. Probably
solution is to limit cache size during testing. It should allow to measure
I/O benefit even on relatively small datasets. But while I don't know now to
do that on Linux.

------
With best regards,
Alexander Korotkov.

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#11)
Re: WIP: Fast GiST index build

Actually, I would like to measure CPU and IO load independently for more
comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#12)
Re: WIP: Fast GiST index build

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 1:33 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

Show quoted text

Actually, I would like to measure CPU and IO load independently for more
comprehensive benchmarks. Can you advice me some appropriate tools for it?

------
With best regards,
Alexander Korotkov.

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#13)
Re: WIP: Fast GiST index build

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: WIP: Fast GiST index build

Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Show quoted text

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#15)
Re: WIP: Fast GiST index build

New version of patch. There are some bugfixes, minor refactoring, comments
(unfortunatelly, not all the code is covered by comments yet). Also
"fastbuild" parameter was added to the GiST index. It allows to test index
building with and without fast build without postgres recompile.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:35 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Show quoted text

Oh, actually it's so easy. Thanks.

------
With best regards,
Alexander Korotkov.

On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 16.06.2011 21:13, Alexander Korotkov wrote:

My current idea is to measure number of IO accesses by pg_stat_statements
and measure CPU usage by /proc/PID/stat. Any thoughts?

Actually, you get both of those very easily with:

set log_statement_stats=on

LOG: QUERY STATISTICS
DETAIL: ! system usage stats:
! 0.000990 elapsed 0.000000 user 0.000000 system sec
! [0.000000 user 0.008000 sys total]
! 0/0 [32/0] filesystem blocks in/out
! 0/0 [0/959] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [10/1] voluntary/involuntary context switches
STATEMENT: SELECT generate_series(1,100);

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist_fast_build-0.2.0.patch.gzapplication/x-gzip; name=gist_fast_build-0.2.0.patch.gzDownload
#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#16)
Re: WIP: Fast GiST index build

Hi!

I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?
As you can see I found that CPU usage might be much higher
with gist_trgm_ops. I believe it's due to relatively expensive penalty
method in that opclass. But, probably index build can be still faster when
index doesn't fit cache even for gist_trgm_ops. Also with that opclass index
quality is slightly worse but the difference is not dramatic.

------
With best regards,
Alexander Korotkov.

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#17)
Re: WIP: Fast GiST index build

On 21.06.2011 13:08, Alexander Korotkov wrote:

I've created section about testing in project wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that
others can reproduce the tests.

It's surprising that the search time differs so much between the
point_ops tests with uniformly random data with 100M and 10M rows. Just
to be sure I'm reading it correctly: a small search time is good, right?
You might want to spell that out explicitly.

As you can see I found that CPU usage might be much higher
with gist_trgm_ops.

Yeah, that is a bit worrysome. 6 minutes without the patch and 18
minutes with it.

I believe it's due to relatively expensive penalty
method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a
gist_trgm_ops index on a list of English words from
/usr/share/dict/words. oprofile shows that with the patch, 60% of the
CPU time is spent in the makesign() function.

But, probably index build can be still faster when
index doesn't fit cache even for gist_trgm_ops.

Yep.

Also with that opclass index
quality is slightly worse but the difference is not dramatic.

5-10% difference should be acceptable

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: WIP: Fast GiST index build

On Fri, Jun 24, 2011 at 12:40 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 21.06.2011 13:08, Alexander Korotkov wrote:

I've created section about testing in project wiki page:
http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**
GSoC_2011#Testing_results<http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results&gt;
Do you have any notes about table structure?

It would be nice to have links to the datasets and scripts used, so that
others can reproduce the tests.

Done.

It's surprising that the search time differs so much between the point_ops
tests with uniformly random data with 100M and 10M rows. Just to be sure I'm
reading it correctly: a small search time is good, right? You might want to
spell that out explicitly.

Yes, you're reading this correctly. Detailed explanation was added to the
wiki page. It's surprising for me too. I need some more insight into causes
of index quality difference.

Now I found some large enough real-life datasets (thanks to Oleg Bartunov)
and I'm performing tests on them.

------
With best regards,
Alexander Korotkov.

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#18)
Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

On 24.06.2011 11:40, Heikki Linnakangas wrote:

On 21.06.2011 13:08, Alexander Korotkov wrote:

I believe it's due to relatively expensive penalty
method in that opclass.

Hmm, I wonder if it could be optimized. I did a quick test, creating a
gist_trgm_ops index on a list of English words from
/usr/share/dict/words. oprofile shows that with the patch, 60% of the
CPU time is spent in the makesign() function.

I couldn't resist looking into this, and came up with the attached
patch. I tested this with:

CREATE TABLE words (word text);
COPY words FROM '/usr/share/dict/words';
CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);

And then ran "REINDEX INDEX i_words" a few times with and without the
patch. Without the patch, reindex takes about 4.7 seconds. With the
patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes
even more important with Alexander's fast GiST build patch, which calls
the penalty function more.

I used the attached showsign-debug.patch to verify that the patched
makesign function produces the same results as the existing code. I
haven't tested the big-endian code, however.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fast_makesign.patchtext/x-diff; name=fast_makesign.patchDownload+77-6
showsign-debug.patchtext/x-diff; name=showsign-debug.patchDownload+29-0
#21Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#22)
#24Jesper Krogh
jesper@krogh.cc
In reply to: Heikki Linnakangas (#3)
#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jesper Krogh (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#25)
#27Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#19)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#27)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#27)
#30Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#29)
#31Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#30)
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#31)
#33Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#20)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#36)
#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#33)
#39Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#39)
#41Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#40)
#42Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#41)
#43Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#42)
#44Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#43)
#45Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#44)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#42)
#47Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#46)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#47)
#49Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#48)
#50Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#49)
#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#50)
#52Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#51)
#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#52)
#54Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#53)
#55Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#54)
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#55)
#57Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
#58Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
#59Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#58)
#60Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#59)
#61Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#60)
#62Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#56)
#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#62)
#64Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#63)
#65Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#60)
#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#65)
#67Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#66)
#68Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#67)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Thom Brown (#68)
#70Thom Brown
thom@linux.com
In reply to: Simon Riggs (#69)
#71Alexander Korotkov
aekorotkov@gmail.com
In reply to: Simon Riggs (#69)
#72Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#67)
#73Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#72)
#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#73)
#75Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#74)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#75)
#77Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#67)
#78Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#77)
#79Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#78)
#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#79)
#81Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#64)
#82Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#80)
#83Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#1)
#84Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#81)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#84)
#86Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#84)
#87Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#86)
#88Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#87)
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#88)
#90Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#89)
#91Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#90)
#92Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#91)
#93Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#91)
#94Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
#95Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#94)
#96Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#93)
#97Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
#98Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#92)
#99Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#97)
#100Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#99)
#101Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#100)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#96)
#103Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#102)
#104Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#103)
#105Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#104)
#106Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#105)
#107Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#104)
#108Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#107)
#109Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#108)
#110Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#109)
#111Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#110)
#112Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#111)
#113Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#112)
#114Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#113)
#115Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#112)
#116Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#115)
#117Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#114)
#118Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#116)
#119Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#116)
#120Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#119)
#121Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#120)
#122Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#119)
#123Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#121)
#124Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#122)
#125Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#121)
#126Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#125)
#127Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#122)
#128Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#127)
#129Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#128)
#130Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#128)
#131Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#130)
#132Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#130)
#133Stefan Keller
sfkeller@gmail.com
In reply to: Alexander Korotkov (#132)
#134Alexander Korotkov
aekorotkov@gmail.com
In reply to: Stefan Keller (#133)
#135Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#132)
#136Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#135)
#137Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#132)
#138Oleg Bartunov
oleg@sai.msu.su
In reply to: Heikki Linnakangas (#135)
#139Alexander Korotkov
aekorotkov@gmail.com
In reply to: Oleg Bartunov (#138)
#140Stefan Keller
sfkeller@gmail.com
In reply to: Alexander Korotkov (#134)
#141Stefan Keller
sfkeller@gmail.com
In reply to: Stefan Keller (#140)
#142Robert Haas
robertmhaas@gmail.com
In reply to: Stefan Keller (#140)
#143Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#137)