GIN index creation extremely slow ?

Started by Stefan Kaltenbrunnerover 19 years ago11 messages
#1Stefan Kaltenbrunner
stefan@kaltenbrunner.cc

on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...

Stefan

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Stefan Kaltenbrunner (#1)
Re: GIN index creation extremely slow ?

gin uses maintenance_work_mem,so try to increase it and see dramatic
improvements

Oleg
On Mon, 26 Jun 2006, Stefan Kaltenbrunner wrote:

on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...

Stefan

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Christopher Kings-Lynne
chris.kings-lynne@calorieking.com
In reply to: Stefan Kaltenbrunner (#1)
Re: GIN index creation extremely slow ?

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...

Ummm my bad. Sorry...

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Stefan Kaltenbrunner (#1)
Re: GIN index creation extremely slow ?

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

I'll look at this, but GiST time creation is suspiciously small.
Can you test on smaller table, for example with 100000 records and if results
are repeat, pls, send to me test suite...

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#5Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Teodor Sigaev (#4)
Re: GIN index creation extremely slow ?

Teodor Sigaev wrote:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

I'll look at this, but GiST time creation is suspiciously small.
Can you test on smaller table, for example with 100000 records and if
results are repeat, pls, send to me test suite...

I won't have access to the original testcase and server for a few days
but I just redid some testing on a slower personal box of mine with a
smaller(but similiar) testset and on that box I could not reproduce that
issue.
So the problem is either caused by the size of the table or somehow by
the data itself :-(

Stefan

#6Teodor Sigaev
teodor@sigaev.ru
In reply to: Stefan Kaltenbrunner (#5)
Re: GIN index creation extremely slow ?

I won't have access to the original testcase and server for a few days
but I just redid some testing on a slower personal box of mine with a
smaller(but similiar) testset and on that box I could not reproduce that
issue.
So the problem is either caused by the size of the table or somehow by
the data itself :-(

We tested gin with 5 millions records (but not a wikipedia's text, but with blog
records which is usually mush shorter) and index creation time was near 8 hours
on Dual Xeon 3.2, but we didn't try GiST. In other hand, creating GiST index on
0.5 millions emails took about 12 hours.

That's why 7 minutes is very suspicious result.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Teodor Sigaev (#6)
Re: GIN index creation extremely slow ?

On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote:

We tested gin with 5 millions records (but not a wikipedia's text, but with
blog records which is usually mush shorter) and index creation time was
near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand,
creating GiST index on 0.5 millions emails took about 12 hours.

That's why 7 minutes is very suspicious result.

Tom did commit a patch a while ago which made a huge difference in
index creation time for tsearch by changing one routine. I don't know
if it got backpatched, so it might be worth checking people are working
on the same version.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#8Oleg Bartunov
oleg@sai.msu.su
In reply to: Martijn van Oosterhout (#7)
Re: GIN index creation extremely slow ?

On Wed, 28 Jun 2006, Martijn van Oosterhout wrote:

On Wed, Jun 28, 2006 at 01:09:23PM +0400, Teodor Sigaev wrote:

We tested gin with 5 millions records (but not a wikipedia's text, but with
blog records which is usually mush shorter) and index creation time was
near 8 hours on Dual Xeon 3.2, but we didn't try GiST. In other hand,
creating GiST index on 0.5 millions emails took about 12 hours.

That's why 7 minutes is very suspicious result.

Tom did commit a patch a while ago which made a huge difference in
index creation time for tsearch by changing one routine. I don't know
if it got backpatched, so it might be worth checking people are working
on the same version.

it's on 8.1.4 and speedup is about 10 times.

Have a nice day,

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#9Teodor Sigaev
teodor@sigaev.ru
In reply to: Martijn van Oosterhout (#7)
Re: GIN index creation extremely slow ?

Tom did commit a patch a while ago which made a huge difference in
index creation time for tsearch by changing one routine. I don't know
if it got backpatched, so it might be worth checking people are working
on the same version.

I saw that patch, but I still think that 7 minutes is too small :)

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#10Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Teodor Sigaev (#9)
Re: GIN index creation extremely slow ?

Teodor Sigaev wrote:

Tom did commit a patch a while ago which made a huge difference in
index creation time for tsearch by changing one routine. I don't know
if it got backpatched, so it might be worth checking people are working
on the same version.

I saw that patch, but I still think that 7 minutes is too small :)

hmm I did some further testing on that and it looks like you might
indeed be right in suspecting that there is something fishy with the
GIST results.
It might be possible that there was some issue with the generated
tsvectors (all of them empty due to hitting the "too long value" error
case for exceeding MAXSTRPOS) in the GIST case - sorry for the confusion
:-(

Speaking of the "too long value" error message - some of the
errormessages in tsvector.c are a bit "terse" (ie it is not really
obvious what is causing the above error without looking at the source
for example).

Stefan

#11Teodor Sigaev
teodor@sigaev.ru
In reply to: Stefan Kaltenbrunner (#1)
Re: GIN index creation extremely slow ?

Try again, today's patch solves the problem.

Stefan Kaltenbrunner wrote:

on IRC somebody mentioned that it took >34h to greate a GIN index (on a
tsvector) on a ~3 Million column table (wikipedia dump) with a
reasonable speced box (AMD 3400+).
After getting hold of a dump of said table (around 4,1GB in size) I
managed to get the following timings:

test=# CREATE INDEX idxFTI_idx ON wikipedia USING gist(vector);
CREATE INDEX
Time: 416122.896 ms

so about 7 minutes - sounds very reasonable

test=# CREATE INDEX idxFTI2_idx ON wikipedia USING gin(vector);
CREATE INDEX
Time: 52681605.101 ms

ouch - that makes for a whoppy 14,6hours(!). During that time the box is
completely CPU bottlenecked and during virtually no IO at all - (varing
maintainance_work_mem does not seem to make any noticable difference).

That box is a fast Dual Opteron 2.6Ghz with 8GB RAM and a 4 disk RAID10
for the WAL and 12 disks for the data running a very recent -HEAD
checkout ...

It looks like we still don't have any docs for GIN in the tree so I
don't know if those timings are expected or not ...

Stefan

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/