more than one index in a single heap pass?

Started by Andrew Dunstanover 16 years ago11 messages
#1Andrew Dunstan
andrew@dunslane.net

I was just wondering idly today if we could usefully build a number of
indexes at the same time in a single pass over the heap, or could it be
that we wouldn't gain much? I haven't even got around to thinking about
any syntax for it.

cheers

andrew

#2Alvaro Herrera
alvherre@commandprompt.com
In reply to: Andrew Dunstan (#1)
Re: more than one index in a single heap pass?

Andrew Dunstan wrote:

I was just wondering idly today if we could usefully build a number of
indexes at the same time in a single pass over the heap, or could it be
that we wouldn't gain much? I haven't even got around to thinking about
any syntax for it.

Could we make it work on two backends building one index each in
synchronized scans?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: more than one index in a single heap pass?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

I was just wondering idly today if we could usefully build a number of
indexes at the same time in a single pass over the heap, or could it be
that we wouldn't gain much? I haven't even got around to thinking about
any syntax for it.

Could we make it work on two backends building one index each in
synchronized scans?

Don't we more or less have that already?

regards, tom lane

#4Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#3)
Re: more than one index in a single heap pass?

On Tue, Jul 14, 2009 at 8:50 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

I was just wondering idly today if we could usefully build a number of
indexes at the same time in a single pass over the heap, or could it be
that we wouldn't gain much? I haven't even got around to thinking about
any syntax for it.

Could we make it work on two backends building one index each in
synchronized scans?

Don't we more or less have that already?

Wasn't that a big part of the point of the "parallel pg_restore" feature?

--
greg
http://mit.edu/~gsstark/resume.pdf

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Greg Stark (#4)
Re: more than one index in a single heap pass?

Greg Stark wrote:

On Tue, Jul 14, 2009 at 8:50 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

I was just wondering idly today if we could usefully build a number of
indexes at the same time in a single pass over the heap, or could it be
that we wouldn't gain much? I haven't even got around to thinking about
any syntax for it.

Could we make it work on two backends building one index each in
synchronized scans?

Don't we more or less have that already?

Wasn't that a big part of the point of the "parallel pg_restore" feature?

Well, yes, it's some of it, and in theory Tom's late addition of a queue
that gets all the dependencies of a table as soon as the table data is
restored should make that work better. But of course, that's not the
only time indexes are created, and each index creation command will be
doing its own heap processing, albeit that synchronised scanning will
make that lots cheaper.

As I said originally, it was just an idle thought that came to me today.

cheers

andrew

#6Glen Parker
glenebob@nwlink.com
In reply to: Andrew Dunstan (#5)
Re: more than one index in a single heap pass?

Andrew Dunstan wrote:

Well, yes, it's some of it, and in theory Tom's late addition of a queue
that gets all the dependencies of a table as soon as the table data is
restored should make that work better. But of course, that's not the
only time indexes are created, and each index creation command will be
doing its own heap processing, albeit that synchronised scanning will
make that lots cheaper.

As I said originally, it was just an idle thought that came to me today.

Sounds to me like another reason to separate index definition from
creation. If an index can be defined but not yet created or valid, then
you could imagine syntax like:

DEFINE INDEX blahblah1 ON mytable (some fields);
DEFINE INDEX blahblah2 ON mytable (some other fields);
[RE]INDEX TABLE mytable;

...provided that REINDEX TABLE could recreate all indexes simultaneously
as you suggest.

-Glen

#7Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#5)
Re: more than one index in a single heap pass?

Andrew Dunstan wrote:

Wasn't that a big part of the point of the "parallel pg_restore" feature?

Well, yes, it's some of it, and in theory Tom's late addition of a queue
that gets all the dependencies of a table as soon as the table data is
restored should make that work better. But of course, that's not the
only time indexes are created, and each index creation command will be
doing its own heap processing, albeit that synchronised scanning will
make that lots cheaper.

As I said originally, it was just an idle thought that came to me today.

Well, TODO has:

Allow multiple indexes to be created concurrently, ideally via a
single heap scan, and have pg_restore use it

Isn't this already largely done by parallel pg_restore work?

so we have to decide if we still want that item. I think what we don't
have is a way to create multiple indexes simultaneously via SQL.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#7)
Re: more than one index in a single heap pass?

Bruce Momjian <bruce@momjian.us> writes:

Well, TODO has:

Allow multiple indexes to be created concurrently, ideally via a
single heap scan, and have pg_restore use it

Isn't this already largely done by parallel pg_restore work?

so we have to decide if we still want that item. I think what we don't
have is a way to create multiple indexes simultaneously via SQL.

And if we did build that, people would be bleating because it could only
make use of one CPU. I think multiple backends using the existing
syncscan infrastructure meets this need pretty well already.

regards, tom lane

#9Dimitri Fontaine
dfontaine@hi-media.com
In reply to: Glen Parker (#6)
Re: more than one index in a single heap pass?

Hi,

Le 15 juil. 09 à 02:01, Glen Parker a écrit :

Sounds to me like another reason to separate index definition from
creation. If an index can be defined but not yet created or valid,
then you could imagine syntax like:

DEFINE INDEX blahblah1 ON mytable (some fields);
DEFINE INDEX blahblah2 ON mytable (some other fields);
[RE]INDEX TABLE mytable;

...provided that REINDEX TABLE could recreate all indexes
simultaneously as you suggest.

Well to me it sounded much more like:
BEGIN;
CREATE INDEX idx_a ON t(a) DEFERRED;
CREATE INDEX idx_b ON t(b) DEFERRED;
COMMIT;

And at commit time, PostgreSQL would build all the transaction indexes
in one pass over the heap, but as Tom already pointed out, using only
1 CPU. Maybe that'd be a way to limit the overall io bandwidth usage
while not consuming too many CPU resources at the same time.

I mean now we have a choice to either sync scan the table heap on
multiple CPU, saving IO but using 1 CPU per index, or to limit CPU to
only 1 but then scan the heap once per index. The intermediary option
of using 1 CPU while still making a single heap scan sure can be
worthwhile to some?

Regards,
--
dim

#10decibel
decibel@decibel.org
In reply to: Dimitri Fontaine (#9)
Re: more than one index in a single heap pass?

On Jul 15, 2009, at 2:52 PM, Dimitri Fontaine wrote:

Le 15 juil. 09 à 02:01, Glen Parker a écrit :

Sounds to me like another reason to separate index definition from
creation. If an index can be defined but not yet created or
valid, then you could imagine syntax like:

DEFINE INDEX blahblah1 ON mytable (some fields);
DEFINE INDEX blahblah2 ON mytable (some other fields);
[RE]INDEX TABLE mytable;

...provided that REINDEX TABLE could recreate all indexes
simultaneously as you suggest.

Well to me it sounded much more like:
BEGIN;
CREATE INDEX idx_a ON t(a) DEFERRED;
CREATE INDEX idx_b ON t(b) DEFERRED;
COMMIT;

And at commit time, PostgreSQL would build all the transaction
indexes in one pass over the heap, but as Tom already pointed out,
using only 1 CPU. Maybe that'd be a way to limit the overall io
bandwidth usage while not consuming too many CPU resources at the
same time.

I mean now we have a choice to either sync scan the table heap on
multiple CPU, saving IO but using 1 CPU per index, or to limit CPU
to only 1 but then scan the heap once per index. The intermediary
option of using 1 CPU while still making a single heap scan sure
can be worthwhile to some?

Here's an off-the-wall thought... since most of the CPU time is in
the sort, what about allowing a backend to fork off dedicated sort
processes? Aside from building multiple indexes at once, that
functionality could also be useful in general queries.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: decibel (#10)
Re: more than one index in a single heap pass?

decibel wrote:

Here's an off-the-wall thought... since most of the CPU time is in the
sort, what about allowing a backend to fork off dedicated sort
processes? Aside from building multiple indexes at once, that
functionality could also be useful in general queries.

Sure, that would be cool. And also a lot of work :-). The comparison
operators can be arbitrarily complex, potentially querying other tables
etc, so you would indeed need pretty much all the infrastrucutre you
need to solve the general case.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com