still gin index creation takes forever

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Ivan Sergio Borgonovo (#1)

Re: still gin index creation takes forever

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Any suggestion about how to track down the problem?

What you are describing sounds rather like a use-of-uninitialized-memory
problem, wherein the behavior depends on what happened to be in that
memory previously. If so, using a debug/cassert-enabled build of
Postgres might help to make the behavior more reproducible.

(Of course, if the result is that it's reproducibly fast, this doesn't
get us much closer to solving the problem :-(. But it seems worth
trying.)

regards, tom lane

mail@webthatworks.it

over 17 years ago

In reply to: Tom Lane (#2)

Re: still gin index creation takes forever

On Tue, 11 Nov 2008 22:02:17 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Any suggestion about how to track down the problem?

What you are describing sounds rather like a
use-of-uninitialized-memory problem, wherein the behavior depends
on what happened to be in that memory previously. If so, using a
debug/cassert-enabled build of Postgres might help to make the
behavior more reproducible.

(Of course, if the result is that it's reproducibly fast, this
doesn't get us much closer to solving the problem :-(. But it
seems worth trying.)

There is no such a beast for Debian etch/sid.

Fortunately the re-indexing will happens very seldom and I can just
split the 2 parts so that I'll do my superstitious rituals before
re-indexing.
But it's like living with a ghost at home and at this moment it is
out of my reach compiling postgres.

I'm surprised I'm the only one experiencing this problem and I think
I'm using a quite popular set of packages: etch + postgresql
backport so I'm wondering if postgresql really deserve the blame or
it's something else.

But I can't think of any "strange" behaviour on my part that could
justify what's happening.
There are times (seldom actually) when the index get created in
around 6 minutes and times it takes forever even when the box is not
under load. Re-indexing with gist always succede in around 2min.
I even stopped the server and reload everything from backup.
During restore index creation happens in reasonable time.
Restore didn't report any error, but the behaviour is still there.

So maybe this stuff is triggered by some combination of the
postgresql configuration.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

teodor@sigaev.ru

over 17 years ago

In reply to: Tom Lane (#2)

Re: still gin index creation takes forever

Any suggestion about how to track down the problem?

What you are describing sounds rather like a use-of-uninitialized-memory
problem, wherein the behavior depends on what happened to be in that
memory previously. If so, using a debug/cassert-enabled build of
Postgres might help to make the behavior more reproducible.

It seems to me, possible reason of that behavior could be an order of table's
scanning. GIN's build algorithm prefers scan from begin to the end of table, but
in 8.3 it's not always true - scan may begin from the middle or end of table
depending on sequence scan's history.

GIN's build algorithm could use bulk insert of ItemPointers if and only if they
should be inserted on rightmost page (exact piece of code - dataPlaceToPage() in
gindatapage.c, lines 407-427)

Is any way to force table's scan from the beginning?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Teodor Sigaev (#4)

Re: still gin index creation takes forever

Teodor Sigaev <teodor@sigaev.ru> writes:

GIN's build algorithm could use bulk insert of ItemPointers if and only if they
should be inserted on rightmost page (exact piece of code - dataPlaceToPage() in
gindatapage.c, lines 407-427)

I'm not following. Rightmost page of what --- it can't be the whole
index, can it, or the case would hardly ever apply?

Is any way to force table's scan from the beginning?

We could extend IndexBuildHeapScan's API to support that, but I'm
not quite convinced that this is the issue.

regards, tom lane

teodor@sigaev.ru

over 17 years ago

In reply to: Tom Lane (#5)

Re: still gin index creation takes forever

GIN's build algorithm could use bulk insert of ItemPointers if and only if they
should be inserted on rightmost page (exact piece of code - dataPlaceToPage() in
gindatapage.c, lines 407-427)

I'm not following. Rightmost page of what --- it can't be the whole
index, can it, or the case would hardly ever apply?

GIN's index contains btree over keys (entry tree) and for each key it
contains list of ItemPointers (posting list) or btree over ItemPointers
(posting tree or data tree) depending on its quantity. Bulk insertion
process collects into memory keys and sorted arrays of ItemPointers, and
then for each keys, it tries to insert every ItemPointer from array into
corresponding data tree one by one. But if the smallest ItemPointer in
array is greater than the biggest stored one then algorithm will insert
the whole array on rightmost page in data tree.

So, in that case process can insert about 1000 ItemPointers per one data
tree lookup, in opposite case it does 1000 lookups in data tree.

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Teodor Sigaev (#6)

Re: still gin index creation takes forever

Teodor Sigaev <teodor@sigaev.ru> writes:

I'm not following. Rightmost page of what --- it can't be the whole
index, can it, or the case would hardly ever apply?

GIN's index contains btree over keys (entry tree) and for each key it
contains list of ItemPointers (posting list) or btree over ItemPointers
(posting tree or data tree) depending on its quantity. Bulk insertion
process collects into memory keys and sorted arrays of ItemPointers, and
then for each keys, it tries to insert every ItemPointer from array into
corresponding data tree one by one. But if the smallest ItemPointer in
array is greater than the biggest stored one then algorithm will insert
the whole array on rightmost page in data tree.

So, in that case process can insert about 1000 ItemPointers per one data
tree lookup, in opposite case it does 1000 lookups in data tree.

I see. So this could explain Ivan's issue if his table contains large
numbers of repeated GIN keys. Ivan, is that what your data looks like?

regards, tom lane

mail@webthatworks.it

over 17 years ago

In reply to: Tom Lane (#7)

Re: still gin index creation takes forever

On Wed, 12 Nov 2008 15:18:05 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, in that case process can insert about 1000 ItemPointers per
one data tree lookup, in opposite case it does 1000 lookups in
data tree.

I see. So this could explain Ivan's issue if his table contains
large numbers of repeated GIN keys. Ivan, is that what your data
looks like?

Well if by GIN keys you mean lexemes it could be. But I wouldn't say
this circumstance is uncommon among users of tsearch. I'd expect
other people had used tsearch2 to search through titles, authors and
publishers of books, so if that was the problem I'd expect the
problem to come up earlier.
Actually tsearch2 is not completely tuned up, since I still have to
"mix" Italian and English configuration to get rid of some more
stop words etc... that may increase the number of repetitions, but I
doubt this only put me in a corner case.

Anyway trying to answer in a more objective way to your question I
ran:

SELECT * FROM ts_stat('SELECT ft1idx FROM catalog_items') order by
nentry desc, ndoc desc limit 20;

It ran over 9h and I still wasn't able to get the answer.
I killed psql client that was running it and postgres continued to
eat 100% CPU for a while till I stopped it.

Considering that running:
SELECT * FROM ts_stat('SELECT ft1idx FROM catalog_items limit 50000')
order by nentry desc, ndoc desc limit 20;
returned in less than 2 minutes and catalog_items has a bit less
than 1M record... there is still something weird.

"springer";10824;10833
"e";7703;8754
"di";6815;7771
"il";5622;6168
"la";4989;5407
"hall";4357;4416
"prentic";4321;4369
"l";3920;4166
"del";3092;3281
"edizioni";2465;2465
"della";2292;2410
"m";2283;2398
"dell";2150;2281
"j";1967;2099
"d";1789;1864
"per";1685;1770
"longman";1671;1746
"le";1656;1736
"press";1687;1687
"de";1472;1564

examining 90K records took a bit more than 6min.

I'll try to move everything on another box and see what happens.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Ivan Sergio Borgonovo (#8)

Re: still gin index creation takes forever

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I see. So this could explain Ivan's issue if his table contains
large numbers of repeated GIN keys. Ivan, is that what your data
looks like?

Well if by GIN keys you mean lexemes it could be. But I wouldn't say
this circumstance is uncommon among users of tsearch. I'd expect
other people had used tsearch2 to search through titles, authors and
publishers of books, so if that was the problem I'd expect the
problem to come up earlier.

Yeah, I'm not convinced either. Still, Teodor's theory should be easily
testable: set synchronize_seqscans to FALSE and see if the problem goes
away.

regards, tom lane

#10

teodor@sigaev.ru

over 17 years ago

In reply to: Tom Lane (#9)

Re: still gin index creation takes forever

Yeah, I'm not convinced either. Still, Teodor's theory should be easily
testable: set synchronize_seqscans to FALSE and see if the problem goes
away.

Test suit to reproduce the problem:
DROP TABLE IF EXISTS foo;
DROP TABLE IF EXISTS footmp;

CREATE OR REPLACE FUNCTION gen_array()
RETURNS _int4 AS
$$
SELECT ARRAY(
SELECT (random()*1000)::int
FROM generate_series(1,10+(random()*90)::int)
)
$$
LANGUAGE SQL VOLATILE;

SELECT gen_array() AS v INTO foo FROM generate_series(1,100000);

VACUUM ANALYZE foo;

CREATE INDEX fooidx ON foo USING gin (v);
DROP INDEX fooidx;

SELECT * INTO footmp FROM foo LIMIT 90000;

CREATE INDEX fooidx ON foo USING gin (v);
DROP INDEX fooidx;

On my notebook with HEAD and default postgresql.conf it produce (show only
interesting part):

postgres=# CREATE INDEX fooidx ON foo USING gin (v);
Time: 14961,409 ms
postgres=# SELECT * INTO footmp FROM foo LIMIT 90000;
postgres=# CREATE INDEX fooidx ON foo USING gin (v);
LOG: checkpoints are occurring too frequently (12 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 56286,507 ms

So, time for creation is 4-time bigger after select.
Without "SELECT * INTO footmp FROM foo LIMIT 90000;":
postgres=# CREATE INDEX fooidx ON foo USING gin (v);
CREATE INDEX
Time: 13894,050 ms
postgres=# CREATE INDEX fooidx ON foo USING gin (v);
LOG: checkpoints are occurring too frequently (14 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 15087,348 ms

Near to the same time.

With synchronize_seqscans = off and SELECT:
postgres=# CREATE INDEX fooidx ON foo USING gin (v);
CREATE INDEX
Time: 14452,024 ms
postgres=# SELECT * INTO footmp FROM foo LIMIT 90000;
postgres=# CREATE INDEX fooidx ON foo USING gin (v);
LOG: checkpoints are occurring too frequently (16 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 14557,750 ms

Again, near to the same time.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#11

teodor@sigaev.ru

over 17 years ago

In reply to: Tom Lane (#5)

Re: still gin index creation takes forever

We could extend IndexBuildHeapScan's API to support that, but I'm
not quite convinced that this is the issue.

That extension might be useful for bitmap index too to simplify index creation
process.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#12

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Teodor Sigaev (#10)

Re: still gin index creation takes forever

Teodor Sigaev <teodor@sigaev.ru> writes:

Yeah, I'm not convinced either. Still, Teodor's theory should be easily
testable: set synchronize_seqscans to FALSE and see if the problem goes
away.

Test suit to reproduce the problem:

I don't doubt that you're describing a real effect, I'm just not sure
yet that it's the same thing Ivan is seeing. He seems to be talking
about more than 4x differences.

regards, tom lane

#13

mail@webthatworks.it

over 17 years ago

In reply to: Tom Lane (#12)

Re: still gin index creation takes forever

On Thu, 13 Nov 2008 09:11:05 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

Yeah, I'm not convinced either. Still, Teodor's theory should
be easily testable: set synchronize_seqscans to FALSE and see
if the problem goes away.

Test suit to reproduce the problem:

I don't doubt that you're describing a real effect, I'm just not
sure yet that it's the same thing Ivan is seeing. He seems to be
talking about more than 4x differences.

Yes... 6min compared to something that span a night and is far more
than what I'm willing to wait to give an exact measure since it does
look to last more than the box itself.
Anyway... I'll try Teodor's trick to see if somehow it can
circumvent the real cause and I'll try everything on another box
ASAP.

thanks to all

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#14

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Teodor Sigaev (#11)

Re: still gin index creation takes forever

Teodor Sigaev <teodor@sigaev.ru> writes:

We could extend IndexBuildHeapScan's API to support that, but I'm
not quite convinced that this is the issue.

That extension might be useful for bitmap index too to simplify index creation
process.

Maybe, but in any case the measurable GIN speed penalty justifies
changing it; I've applied a patch for that. I'm still not quite
convinced that Ivan isn't seeing some other issue though.

In the meantime, I noticed something odd while experimenting with your
test case: when running with default maintenance_work_mem = 16MB,
there is a slowdown of 3x or 4x for the un-ordered case, just as you
say. But at maintenance_work_mem = 200MB I see very little difference.
This doesn't make sense to me --- it seems like a larger workspace
should result in more difference because of greater chance to dump a
lot of tuples into the index at once. Do you know why that's happening?

regards, tom lane

#15

teodor@sigaev.ru

over 17 years ago

In reply to: Tom Lane (#14)

Re: still gin index creation takes forever

changing it; I've applied a patch for that. I'm still not quite
convinced that Ivan isn't seeing some other issue though.

Thank you

In the meantime, I noticed something odd while experimenting with your
test case: when running with default maintenance_work_mem = 16MB,
there is a slowdown of 3x or 4x for the un-ordered case, just as you
say. But at maintenance_work_mem = 200MB I see very little difference.
This doesn't make sense to me --- it seems like a larger workspace
should result in more difference because of greater chance to dump a
lot of tuples into the index at once. Do you know why that's happening?

I suppose, if maintenance_work_mem is rather big then all data of index
accumulates in memory and so it writes at disk at once. With that test's options
size of index is equal to 40Mb.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#16