Fixing hash index build time

Started by Tom Laneabout 19 years ago6 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

There are several reasons why Postgres' hash indexes currently suck,
but one of the bigger ones is that the time to build an index on a
large existing table is excessive, eg
http://archives.postgresql.org/pgsql-novice/2007-03/msg00064.php

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over. This explains the empiric observation I made a long time ago:
http://archives.postgresql.org/pgsql-hackers/2002-04/msg01379.php

This could be fixed with a relatively small amount of new code: when
beginning hashbuild, estimate the parent table's rowcount (the same
method used by the planner will do fine, viz RelationGetNumberOfBlocks
times an estimated tuple density) and construct the appropriate number
of buckets immediately. No splits, just write out empty pages as fast
as we can. *Then* do the insertions.

Comments?

regards, tom lane

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Fixing hash index build time

I wrote:

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over.

Well, unfortunately this theory seems to be all wet. Given that the
bucket loading is reasonably even, the time to split a bucket is about
constant and so there's no O(N^2) effect. (The multiplier hidden inside
O(N) is pretty awful, but it doesn't change with N.)

The real reason why performance falls off a cliff for building large
hash indexes seems to be much harder to fix: basically, once the size
of your index exceeds working memory, it's nap time. Given that the
incoming data has randomly distributed hash values, each bucket is about
as likely to be touched next as any other; there is no locality of
access and so the "working set" is the same size as the index. Once it
doesn't fit in RAM anymore you're into swap hell.

The only way I can see to fix that is to try to impose some locality of
access during the index build. This is not impossible: for example,
given a choice for the number of buckets, we could sort all the index
tuples by hashed bucket number and then start inserting. btree does a
preliminary sort, and its index build times are way more reasonable
than hash's currently are, so the cost of the sort isn't outrageous.
(I note this is mainly because we know how to do sorting with locality
of access...) Before we start inserting we will know exactly how many
tuples there are, so we can pre-create the right number of buckets and
be sure that no on-the-fly splits will be needed for the rest of the
build. If we guessed wrong about the number of buckets there will be
some places in the process where we concurrently insert into several
buckets not just one, or perhaps come back to a bucket that we touched
earlier, but that's still maintaining plenty of locality of access.

This is looking like more work than I want to do in the near future,
but I thought I'd put it into the archives for someone to tackle.
Bruce, would you add a TODO item linking to this:

* Improve hash index build time by sorting

regards, tom lane

#3Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: Fixing hash index build time

Added to TODO:

o During index creation, pre-sort the tuples to improve build speed

http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

---------------------------------------------------------------------------

Tom Lane wrote:

I wrote:

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over.

Well, unfortunately this theory seems to be all wet. Given that the
bucket loading is reasonably even, the time to split a bucket is about
constant and so there's no O(N^2) effect. (The multiplier hidden inside
O(N) is pretty awful, but it doesn't change with N.)

The real reason why performance falls off a cliff for building large
hash indexes seems to be much harder to fix: basically, once the size
of your index exceeds working memory, it's nap time. Given that the
incoming data has randomly distributed hash values, each bucket is about
as likely to be touched next as any other; there is no locality of
access and so the "working set" is the same size as the index. Once it
doesn't fit in RAM anymore you're into swap hell.

The only way I can see to fix that is to try to impose some locality of
access during the index build. This is not impossible: for example,
given a choice for the number of buckets, we could sort all the index
tuples by hashed bucket number and then start inserting. btree does a
preliminary sort, and its index build times are way more reasonable
than hash's currently are, so the cost of the sort isn't outrageous.
(I note this is mainly because we know how to do sorting with locality
of access...) Before we start inserting we will know exactly how many
tuples there are, so we can pre-create the right number of buckets and
be sure that no on-the-fly splits will be needed for the rest of the
build. If we guessed wrong about the number of buckets there will be
some places in the process where we concurrently insert into several
buckets not just one, or perhaps come back to a bucket that we touched
earlier, but that's still maintaining plenty of locality of access.

This is looking like more work than I want to do in the near future,
but I thought I'd put it into the archives for someone to tackle.
Bruce, would you add a TODO item linking to this:

* Improve hash index build time by sorting

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#4Hannu Krosing
hannu@tm.ee
In reply to: Bruce Momjian (#3)
Re: Fixing hash index build time

Ühel kenal päeval, K, 2007-03-21 kell 17:25, kirjutas Bruce Momjian:

Added to TODO:

o During index creation, pre-sort the tuples to improve build speed

http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

Maybe the TODO text should mention, that it is about HASH indexes ?

---------------------------------------------------------------------------

Tom Lane wrote:

I wrote:

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over.

Well, unfortunately this theory seems to be all wet. Given that the
bucket loading is reasonably even, the time to split a bucket is about
constant and so there's no O(N^2) effect. (The multiplier hidden inside
O(N) is pretty awful, but it doesn't change with N.)

The real reason why performance falls off a cliff for building large
hash indexes seems to be much harder to fix: basically, once the size
of your index exceeds working memory, it's nap time. Given that the
incoming data has randomly distributed hash values, each bucket is about
as likely to be touched next as any other; there is no locality of
access and so the "working set" is the same size as the index. Once it
doesn't fit in RAM anymore you're into swap hell.

The only way I can see to fix that is to try to impose some locality of
access during the index build. This is not impossible: for example,
given a choice for the number of buckets, we could sort all the index
tuples by hashed bucket number and then start inserting. btree does a
preliminary sort, and its index build times are way more reasonable
than hash's currently are, so the cost of the sort isn't outrageous.
(I note this is mainly because we know how to do sorting with locality
of access...) Before we start inserting we will know exactly how many
tuples there are, so we can pre-create the right number of buckets and
be sure that no on-the-fly splits will be needed for the rest of the
build. If we guessed wrong about the number of buckets there will be
some places in the process where we concurrently insert into several
buckets not just one, or perhaps come back to a bucket that we touched
earlier, but that's still maintaining plenty of locality of access.

This is looking like more work than I want to do in the near future,
but I thought I'd put it into the archives for someone to tackle.
Bruce, would you add a TODO item linking to this:

* Improve hash index build time by sorting

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

#5Bruce Momjian
bruce@momjian.us
In reply to: Hannu Krosing (#4)
Re: Fixing hash index build time

Hannu Krosing wrote:

?hel kenal p?eval, K, 2007-03-21 kell 17:25, kirjutas Bruce Momjian:

Added to TODO:

o During index creation, pre-sort the tuples to improve build speed

http://archives.postgresql.org/pgsql-hackers/2007-03/msg01199.php

Maybe the TODO text should mention, that it is about HASH indexes ?

It is in the HASH section of the TODO list.

---------------------------------------------------------------------------

---------------------------------------------------------------------------

Tom Lane wrote:

I wrote:

I'm not sure if this has been discussed before, but I suddenly realized
while responding to the above message that the reason for the awful
performance is pretty obvious: hashbuild starts with a minimum-size
index (two buckets) and repeatedly splits buckets as insertions are
done, exactly the same as ordinary dynamic growth of the index would do.
This means that for an N-row table, approximately N/entries-per-bucket
splits need to occur during index build, which results in roughly O(N^2)
performance because we have to reprocess already-inserted entries over
and over.

Well, unfortunately this theory seems to be all wet. Given that the
bucket loading is reasonably even, the time to split a bucket is about
constant and so there's no O(N^2) effect. (The multiplier hidden inside
O(N) is pretty awful, but it doesn't change with N.)

The real reason why performance falls off a cliff for building large
hash indexes seems to be much harder to fix: basically, once the size
of your index exceeds working memory, it's nap time. Given that the
incoming data has randomly distributed hash values, each bucket is about
as likely to be touched next as any other; there is no locality of
access and so the "working set" is the same size as the index. Once it
doesn't fit in RAM anymore you're into swap hell.

The only way I can see to fix that is to try to impose some locality of
access during the index build. This is not impossible: for example,
given a choice for the number of buckets, we could sort all the index
tuples by hashed bucket number and then start inserting. btree does a
preliminary sort, and its index build times are way more reasonable
than hash's currently are, so the cost of the sort isn't outrageous.
(I note this is mainly because we know how to do sorting with locality
of access...) Before we start inserting we will know exactly how many
tuples there are, so we can pre-create the right number of buckets and
be sure that no on-the-fly splits will be needed for the rest of the
build. If we guessed wrong about the number of buckets there will be
some places in the process where we concurrently insert into several
buckets not just one, or perhaps come back to a bucket that we touched
earlier, but that's still maintaining plenty of locality of access.

This is looking like more work than I want to do in the near future,
but I thought I'd put it into the archives for someone to tackle.
Bruce, would you add a TODO item linking to this:

* Improve hash index build time by sorting

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
----------------
Hannu Krosing
Database Architect
Skype Technologies O?
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#4)
Re: Fixing hash index build time

Hannu Krosing <hannu@skype.net> writes:

Ühel kenal päeval, K, 2007-03-21 kell 17:25, kirjutas Bruce Momjian:

Added to TODO:
o During index creation, pre-sort the tuples to improve build speed

Maybe the TODO text should mention, that it is about HASH indexes ?

It's under the hash-index heading.

regards, tom lane