Parallel index build during COPY

Started by Jim C. Nasbyover 19 years ago3 messages
#1Jim C. Nasby
jnasby@pervasive.com

It's not uncommon for index creation to take a substantial amount of
time for loading data, even when using the 'trick' of loading the data
before building the indexes. On fast RAID arrays, it's also possible for
this to be a CPU-bound operation, so I've been wondering if there was
some reasonable way to parallelize it in the context of a restore from
pg_dump. Needless to say, that's a non-trivial proposition.

But the thought occured to me: why read from the table we just loaded
multiple times to create the indexes on it? If we're loading into an
empty table, we could feed newly created pages (or tuples) into sort
processes, one for each index. After the entire table is loaded, each
sort could then be finalized, and the appropriate index written out.
It's unclear if this would be a win on a small table, but not needing to
make multiple read passes over a large table would almost certainly be a
win.

If someone wants to hack up a patch to allow testing this, I can get
some benchmark numbers.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#2Toru SHIMOGAKI
shimogaki.toru@oss.ntt.co.jp
In reply to: Jim C. Nasby (#1)
Re: Parallel index build during COPY

NTT has some ideas about index creation during a large amount of data loading.
Our approach is the following: index tuples are created at the same time as heap
tuples and added into heapsort. In addition, we use old index tuples as sorted
list if the target table has already data. It is not necessary for data loader
to sort all the index tuples including old ones. After only new index tuples are
sorted, both sorted lists are merged and the whole index is built. It can save
both CPU resources and disk accesses dramatically, especially if the target
table has already so many tuples.
This approach needs to acquire a table lock, which is unlike COPY's lock mode,
so we have developed it as another bulk load tool. We will talk about it in
PostgreSQL Anniversary Conference at Toronto. Thank you for Josh’s coordination.

Best regards,

Jim C. Nasby wrote:

It's not uncommon for index creation to take a substantial amount of
time for loading data, even when using the 'trick' of loading the data
before building the indexes. On fast RAID arrays, it's also possible for
this to be a CPU-bound operation, so I've been wondering if there was
some reasonable way to parallelize it in the context of a restore from
pg_dump. Needless to say, that's a non-trivial proposition.

But the thought occured to me: why read from the table we just loaded
multiple times to create the indexes on it? If we're loading into an
empty table, we could feed newly created pages (or tuples) into sort
processes, one for each index. After the entire table is loaded, each
sort could then be finalized, and the appropriate index written out.
It's unclear if this would be a win on a small table, but not needing to
make multiple read passes over a large table would almost certainly be a
win.

If someone wants to hack up a patch to allow testing this, I can get
some benchmark numbers.

--
Toru SHIMOGAKI
NTT Opensource Software Center <shimogaki.toru@oss.ntt.co.jp>

#3Jim Nasby
jnasby@pervasive.com
In reply to: Toru SHIMOGAKI (#2)
Re: Parallel index build during COPY

On Jun 15, 2006, at 9:45 PM, Toru SHIMOGAKI wrote:

NTT has some ideas about index creation during a large amount of
data loading. Our approach is the following: index tuples are
created at the same time as heap tuples and added into heapsort. In
addition, we use old index tuples as sorted list if the target
table has already data. It is not necessary for data loader to sort
all the index tuples including old ones. After only new index
tuples are sorted, both sorted lists are merged and the whole index
is built. It can save both CPU resources and disk accesses
dramatically, especially if the target table has already so many
tuples.
This approach needs to acquire a table lock, which is unlike COPY's
lock mode, so we have developed it as another bulk load tool. We
will talk about it in PostgreSQL Anniversary Conference at Toronto.
Thank you for Josh’s coordination.

So does that mean you're able to do all that without hacking the back-
end? Impressive. :)

I look forward to hearing about it.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461