Online index builds
I just sent in the patch for online index builds to -patches.
. The work to combine the two phases into a single non-transactional command
is done. I'm not sure how long to wait between lock checks or how verbose to
be about why it's taking so long. I do think we have to print something or
else the DBA won't know if it's hung waiting for something external.
Currently it prints a notice the first time it sleeps.
. Also it prints out how many tuples it found which normal index doesn't.
Probably that message should go away. On the other hand the index stats
probably need to be filled in.
. I need to check what locks I'm taking. I think I still have some old code
with the wrong locks in it.
. this includes the tid btree opclass sent earlier (with a warning I didn't
notice before fixed up). opr_sanity now fails but I think that's due to the
gin commits not this opclass.
. In case of an error during phase2 the invalid index is left in place. It can
be dropped with DROP INDEX. The footwork to get it dropped in case of an
error would be quite tricky but there's a sketch of how to do it in the source.
. no documentation yet, there's not much to write though.
. no regression tests yet. I don't see any way to test this reasonably in the
regression tests. I've done some testing myself by building indexes while
pgbench is running. Then I have to do index scans to see how many records
are returned with index scans. It wouldn't be easy to automate and even if
it were done it wouldn't really be all that great a test. The corner cases
found during the development are pretty narrow and will be hard to reliably
test.
--
greg
On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote:
no regression tests yet.
We'll need some performance tests that show that lock-hold time is
*actually* reduced, given the shenanigans needed to get there.
We may need to have usage recommendations in the docs.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes:
On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote:
no regression tests yet.
We'll need some performance tests that show that lock-hold time is
*actually* reduced, given the shenanigans needed to get there.
Reducing lock hold time is not the point ... reducing the strength of
the lock at the cost of increased elapsed time is the point.
We may need to have usage recommendations in the docs.
Agreed.
regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes:
On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote:
no regression tests yet.
We'll need some performance tests that show that lock-hold time is
*actually* reduced, given the shenanigans needed to get there.
I'm not sure what you mean by "lock-hold time". Online index builds
effectively take *no* locks in the user-visible sense that regular index
builds do. Other transactions can insert, update, delete continuously
throughout the entire process.
The only locks that are taken are
1) a ShareUpdateExclusiveLock which blocks vacuum from running on the table
being indexed. This is taken by both phase 1 and phase 2. (Actually I had
the wrong lock in the patch I emailed in one place. Fixed in my source tree
here)
2) An ExclusiveLock that is taken momentarily and immediately released. Even
if that can never be acquired due to a busy system it can eventually
proceed anyways as long as there are no long-running transactions that are
refusing to commit.
That said we do need some performance tests to get an idea how long phase 2
takes for large tables. The additional index and heap scan and tid sort could
take a substantial amount of time though never as long as the original index
build done in phase 1.
What's worse is that in some cases the merge could potentially be doing a lot
of retail index inserts. I have no good intuition for how long those will take
relative to the wholesale index build method, especially since for some index
methods like GIN retail inserts are extremely expensive.
So for indexes that don't have a lot of records that need to be inserted
individually what I expect -- and what I put in the docs -- is something under
100% time penalty for an online index build. In fact I expect it to be more
like 50% though it depends on how wide the original index. For ones that do
have lots of records mutated for phase 2 all bets are off.
We may need to have usage recommendations in the docs.
I'm writing docs now. I'm trying to find a happy medium between explaining all
the issues and spamming the docs with lots of discussion. Right now what I
have is a single paragraph in the create_index man page that refers to the
Postgres manual where I list the issues in more depth.
I also still have to get some kind of regression tests. I don't think we have
any concurrent regression tests currently, do we? To thoroughly test it will
be quite hard. Some of the corner cases are extremely narrow or require very
particular types of transactions running with very specific timing.
--
greg
Ühel kenal päeval, N, 2006-07-13 kell 01:07, kirjutas Greg Stark:
Simon Riggs <simon@2ndquadrant.com> writes:
On Wed, 2006-07-12 at 12:09 -0400, Greg Stark wrote:
no regression tests yet.
We'll need some performance tests that show that lock-hold time is
*actually* reduced, given the shenanigans needed to get there.I'm not sure what you mean by "lock-hold time". Online index builds
effectively take *no* locks in the user-visible sense that regular index
builds do. Other transactions can insert, update, delete continuously
throughout the entire process.The only locks that are taken are
1) a ShareUpdateExclusiveLock which blocks vacuum from running on the table
being indexed. This is taken by both phase 1 and phase 2. (Actually I had
the wrong lock in the patch I emailed in one place. Fixed in my source tree
here)2) An ExclusiveLock that is taken momentarily and immediately released. Even
if that can never be acquired due to a busy system it can eventually
proceed anyways as long as there are no long-running transactions that are
refusing to commit.That said we do need some performance tests to get an idea how long phase 2
takes for large tables. The additional index and heap scan and tid sort could
take a substantial amount of time though never as long as the original index
build done in phase 1.
Maybe we can show progress indicators in status line (either
pg_stat_activity.current_query or commandline shown in ps), like
WAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE
or
INSERTING INDEX ENTRY N OF M
changing every few seconds.
that could give the sysadmin some idea of what is going on without too
much verbosity on console.
there could of course be a VERBOSE mode, which acts similar to VACUUM
VERBOSE.
And why not make t possible to add a verbosity level there as well:
'CREATE INDEX ... VEBOSE 3'. And to VACUUM VERBOSE as well
At level 3 all status changes could also be sent to client as well.
What's worse is that in some cases the merge could potentially be doing a lot
of retail index inserts. I have no good intuition for how long those will take
relative to the wholesale index build method, especially since for some index
methods like GIN retail inserts are extremely expensive.So for indexes that don't have a lot of records that need to be inserted
individually what I expect -- and what I put in the docs -- is something under
100% time penalty for an online index build. In fact I expect it to be more
like 50% though it depends on how wide the original index. For ones that do
have lots of records mutated for phase 2 all bets are off.
the only quarantee seems to be, that if there are still some resources
left, the index build will eventualy complete.
but showing progress will let the DBA to make the decision to abort the
build if he sees that it takes "too long".
Another related thing - throttling
----------------------------------
Did you do any work on using vacuum_cost_* GUC vars to throttle the
build process if desired ?
I guess that the initial seqscans are probably cheap enough ( or at
least they are no worse than if someone just did "select * ..."
concurrently), but the index merge can probably still not be as light on
OLTP queries as desirable.
In reality on OLTP dbs even SELECT COUNT(*) can sometimes be a
"mainenance operation" and thus the goal may not be to complete as fast
as possible, but to be as light as possible on concurrent OLTP queries.
Eventually it would nice to have special optimiser rules for any
"maintenance" queries and DDL ops as defined by DBA, but making use of
vacuum_cost_* vars for generic throttling would be a good first cut ;
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia
Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> writes:
Maybe we can show progress indicators in status line (either
pg_stat_activity.current_query or commandline shown in ps), likeWAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE
or
INSERTING INDEX ENTRY N OF M
changing every few seconds.
Hm. That would be very interesting. I'll say that one of the things that
impressed me very much with Postgres moving from Oracle was the focus on
usability. Progress indicators would be excellent for a lot of operations.
That said I'm not sure how much I can do here. For a substantial index we
should expect most of the time will be spent in the tuplesort. It's hard to
see how to get any sort of progress indicator out of there and as long as we
can't it's hard to see the point of getting one during the heap scan or any of
the other i/o operations.
I think it does make sense to put something in current_query indicating when
it's waiting for transactions to end and when it's past that point. That's
something the DBA should be aware of.
And why not make t possible to add a verbosity level there as well:
'CREATE INDEX ... VEBOSE 3'. And to VACUUM VERBOSE as wellAt level 3 all status changes could also be sent to client as well.
Wouldn't you just control this with log_min_messages? It seems unnecessary to
clutter the grammar for every command with "verbose" options.
Another related thing - throttling
----------------------------------Did you do any work on using vacuum_cost_* GUC vars to throttle the
build process if desired ?
Actually no. While there is consensus that will be necessary I'm not sure I
can do it with this patch. The problem is that most of the real heavy lifting
here is done inside tuplesort. Even aside from that most of what's left is
inside bulkdelete(*) and the code that handles regular index builds.
So I think we'll need some global thinking about what options Postgres needs
to control throttling in general. And probably someone needs to write a
separate patch that adds all the hooks to the various places in a single go.
Trying to throttle just one operation at a time when a lot of the code that
implements these operations is shared will have us running in circles.
(*) Hm. Come to think of it I wonder if the vacuum_cost parameters are already
kicking in for this phase. That would be a bit strange since it's the fastest
of the three scans.
--
greg
That said I'm not sure how much I can do here. For a substantial index we
should expect most of the time will be spent in the tuplesort. It's hard to
see how to get any sort of progress indicator out of there and as long as we
can't it's hard to see the point of getting one during the heap scan or any of
the other i/o operations.
Well from a DBA perspective, just knowing that "something" productive is
happening is useful.
When using vacuum I almost always use vacuum verbose, just so I have an
idea of what is going on.
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
A great first step would be to add elog(INFO,...) in some standardized
format over the wire so that clients can tell what's going on. It could
be triggered by a GUC which is off by default.
-M
On Jul 15, 2006, at 9:10 PM, Greg Stark wrote:
Hannu Krosing <hannu@skype.net> writes:
Maybe we can show progress indicators in status line (either
pg_stat_activity.current_query or commandline shown in ps), likeWAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE
or
INSERTING INDEX ENTRY N OF M
changing every few seconds.
Hm. That would be very interesting. I'll say that one of the things
that
impressed me very much with Postgres moving from Oracle was the focus
on
usability. Progress indicators would be excellent for a lot of
operations.That said I'm not sure how much I can do here. For a substantial index
we
should expect most of the time will be spent in the tuplesort. It's
hard to
see how to get any sort of progress indicator out of there and as long
as we
can't it's hard to see the point of getting one during the heap scan
or any of
the other i/o operations.I think it does make sense to put something in current_query
indicating when
it's waiting for transactions to end and when it's past that point.
That's
something the DBA should be aware of.
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
agentm@themactionfaction.com
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
On Sat, 2006-07-15 at 21:10 -0400, Greg Stark wrote:
Hannu Krosing <hannu@skype.net> writes:
Another related thing - throttling
----------------------------------Did you do any work on using vacuum_cost_* GUC vars to throttle the
build process if desired ?Actually no. While there is consensus that will be necessary I'm not sure I
can do it with this patch. The problem is that most of the real heavy lifting
here is done inside tuplesort. Even aside from that most of what's left is
inside bulkdelete(*) and the code that handles regular index builds.So I think we'll need some global thinking about what options Postgres needs
to control throttling in general. And probably someone needs to write a
separate patch that adds all the hooks to the various places in a single go.
Trying to throttle just one operation at a time when a lot of the code that
implements these operations is shared will have us running in circles.(*) Hm. Come to think of it I wonder if the vacuum_cost parameters are already
kicking in for this phase. That would be a bit strange since it's the fastest
of the three scans.
Separate patches sounds more sensible. Using statement_cost_* in many
places sounds useful to me and not too hard to get into 8.2
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, L, 2006-07-15 kell 21:10, kirjutas Greg Stark:
Hannu Krosing <hannu@skype.net> writes:
Maybe we can show progress indicators in status line (either
pg_stat_activity.current_query or commandline shown in ps), likeWAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE
or
INSERTING INDEX ENTRY N OF M
changing every few seconds.
Hm. That would be very interesting. I'll say that one of the things that
impressed me very much with Postgres moving from Oracle was the focus on
usability. Progress indicators would be excellent for a lot of operations.That said I'm not sure how much I can do here. For a substantial index we
should expect most of the time will be spent in the tuplesort.
If index is substantially bigger than shared memory/available RAM then
tuple by tuple index insertion part can also be quite slow and i/o
consuming even if the number of entries to insert is low compared to
total index size. That is where "INSERTING INDEX ENTRY N OF M" would be
a thing which helps the DBA to estimate time to completion.
It's hard to
see how to get any sort of progress indicator out of there and as long as we
can't it's hard to see the point of getting one during the heap scan or any of
the other i/o operations.
It would be useful to know, in what step we are in general , that is are
we doing some useful work or just waiting for "something".
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia
Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com
On Saturday 15 July 2006 21:37, Joshua D. Drake wrote:
That said I'm not sure how much I can do here. For a substantial index we
should expect most of the time will be spent in the tuplesort. It's hard
to see how to get any sort of progress indicator out of there and as long
as we can't it's hard to see the point of getting one during the heap
scan or any of the other i/o operations.Well from a DBA perspective, just knowing that "something" productive is
happening is useful.When using vacuum I almost always use vacuum verbose, just so I have an
idea of what is going on.
+1
--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Sat, Jul 15, 2006 at 09:10:46PM -0400, Greg Stark wrote:
Hannu Krosing <hannu@skype.net> writes:
Maybe we can show progress indicators in status line (either
pg_stat_activity.current_query or commandline shown in ps), likeWAITING TO START PHASE 1 - WAITING FOR TRANSACTION XXX TO COMPLETE
or
INSERTING INDEX ENTRY N OF M
changing every few seconds.
Hm. That would be very interesting. I'll say that one of the things that
impressed me very much with Postgres moving from Oracle was the focus on
usability. Progress indicators would be excellent for a lot of operations.That said I'm not sure how much I can do here. For a substantial index we
should expect most of the time will be spent in the tuplesort. It's hard to
see how to get any sort of progress indicator out of there and as long as we
can't it's hard to see the point of getting one during the heap scan or any of
the other i/o operations.
I'd love to have any kind of progress indication for any sorts that
spill to disk, and there's any number of other long-running operations
where progress info would be very welcome. I certainly wouldn't let lack
of a progress indicator for sorts prevent you from adding one. I like
the idea of periodically updating both current_query and the commandline
that ps shows.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Consindering the syntax for this, we currently allow read access during
index creation, just not write access, so I think the new syntax should
be:
CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ [ENABLE] WRITE [ACCESS] ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]
This is clear, and adds no new keywords.
---------------------------------------------------------------------------
Greg Stark wrote:
I just sent in the patch for online index builds to -patches.
. The work to combine the two phases into a single non-transactional command
is done. I'm not sure how long to wait between lock checks or how verbose to
be about why it's taking so long. I do think we have to print something or
else the DBA won't know if it's hung waiting for something external.
Currently it prints a notice the first time it sleeps.. Also it prints out how many tuples it found which normal index doesn't.
Probably that message should go away. On the other hand the index stats
probably need to be filled in.. I need to check what locks I'm taking. I think I still have some old code
with the wrong locks in it.. this includes the tid btree opclass sent earlier (with a warning I didn't
notice before fixed up). opr_sanity now fails but I think that's due to the
gin commits not this opclass.. In case of an error during phase2 the invalid index is left in place. It can
be dropped with DROP INDEX. The footwork to get it dropped in case of an
error would be quite tricky but there's a sketch of how to do it in the source.. no documentation yet, there's not much to write though.
. no regression tests yet. I don't see any way to test this reasonably in the
regression tests. I've done some testing myself by building indexes while
pgbench is running. Then I have to do index scans to see how many records
are returned with index scans. It wouldn't be easy to automate and even if
it were done it wouldn't really be all that great a test. The corner cases
found during the development are pretty narrow and will be hard to reliably
test.--
greg---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
Consindering the syntax for this, we currently allow read access during
index creation, just not write access, so I think the new syntax should
be:CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ [ENABLE] WRITE [ACCESS] ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]This is clear, and adds no new keywords.
It's not particularly clear to me. On its face this seems to me to imply
something about how the index will be able to be used, not about how it
is to be built.
cheers
andrew
Andrew Dunstan wrote:
Bruce Momjian wrote:
Consindering the syntax for this, we currently allow read access during
index creation, just not write access, so I think the new syntax should
be:CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ [ENABLE] WRITE [ACCESS] ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]This is clear, and adds no new keywords.
It's not particularly clear to me. On its face this seems to me to imply
something about how the index will be able to be used, not about how it
is to be built.
Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same
problem. We need something that talks about the build-time behavior.
How about NOWAIT?
CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ NOWAIT ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]
--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ [ENABLE] WRITE [ACCESS] ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]
This is clear, and adds no new keywords.
... and is very very painful for psql to parse. If you want this,
*you* write the patch for command_no_begin(). Be sure to handle
quoted and schema-qualified identifiers properly.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
CREATE [ UNIQUE ] INDEX name ON table
[ USING method ] [ [ENABLE] WRITE [ACCESS] ]
( { column | ( expression ) } [ opclass ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]This is clear, and adds no new keywords.
... and is very very painful for psql to parse. If you want this,
*you* write the patch for command_no_begin(). Be sure to handle
quoted and schema-qualified identifiers properly.
I was going to use our tab-completion code, which I think handles all
these.
--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
Andrew Dunstan wrote:
It's not particularly clear to me. On its face this seems to me to imply
something about how the index will be able to be used, not about how it
is to be built.Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same
problem. We need something that talks about the build-time behavior.
How about
CREATE INDEX CONCURRENTLY?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote:
Bruce Momjian wrote:
Andrew Dunstan wrote:
It's not particularly clear to me. On its face this seems to me to imply
something about how the index will be able to be used, not about how it
is to be built.Yea, that was always a confusion. CREATE CONCURRENT INDEX has the same
problem. We need something that talks about the build-time behavior.How about
CREATE INDEX CONCURRENTLY?
Yea, that actually sounds pretty good. Though a new keyword, it might
be the best solution.
--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +