create index concurrently - duplicate index to reduce time without an index

Started by Nonamealmost 16 years ago14 messagesgeneral
Jump to latest
#1Noname
Gareth.Williams@csiro.au

Hi,

We want to reindex the database behind a production service without interrupting the service.

I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?

Thanks,

Gareth

#2Szymon Guz
mabewlun@gmail.com
In reply to: Noname (#1)
Re: create index concurrently - duplicate index to reduce time without an index

2010/6/1 <Gareth.Williams@csiro.au>

Hi,

We want to reindex the database behind a production service without
interrupting the service.

I had an idea for creating the index with a new name then dropping the
existing index and renaming the new one - and it seems to work and would
reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to
idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not
meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?

The index name is not used for planning query execution (most important
thing is the index definition), but of course it is important to have some
logical name convention that usually helps in fast understanding database
schema. Name the index as you want, database really doesn't understand the
names.

regards
Szymon Guz

#3Noname
Gareth.Williams@csiro.au
In reply to: Szymon Guz (#2)
Re: create index concurrently - duplicate index to reduce time without an index

From: Szymon Guz [mailto:mabewlun@gmail.com]

-snip-

2010/6/1 <Gareth.Williams@csiro.au>
Hi,

We want to reindex the database behind a production service without interrupting the service.

I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal.  I tried:

psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?

The index name is not used for planning query execution (most important thing is the index definition), but of course it is important to have some logical name convention that usually helps in fast understanding database schema. Name the index as you want, database really doesn't understand the names.

Thanks Szymon,

I was afraid that was the case. So the rest of the question is, if I have two indexes with identical definitions, what happens? I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining - and without breaking the client on my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I care about it's integrity so much more.

I've just tested further and if I create two indexes with identical definitions, my queries are fast, and I can delete either of them and the queries are still fast, but with both deleted the queries are slow. And yes, renaming makes no difference but is nice for understanding the purpose of the index.

So I still haven't seen this procedure break anything, but does anyone know a reason it might be inadvisable?

Cheers,

Gareth

#4Szymon Guz
mabewlun@gmail.com
In reply to: Noname (#3)
Re: create index concurrently - duplicate index to reduce time without an index

2010/6/1 <Gareth.Williams@csiro.au>

From: Szymon Guz [mailto:mabewlun@gmail.com]

-snip-

2010/6/1 <Gareth.Williams@csiro.au>
Hi,

We want to reindex the database behind a production service without

interrupting the service.

I had an idea for creating the index with a new name then dropping the

existing index and renaming the new one - and it seems to work and would
reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to
idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not

meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?

The index name is not used for planning query execution (most important

thing is the index definition), but of course it is important to have some
logical name convention that usually helps in fast understanding database
schema. Name the index as you want, database really doesn't understand the
names.

Thanks Szymon,

I was afraid that was the case. So the rest of the question is, if I have
two indexes with identical definitions, what happens? I've confirmed that I
can create indexes with identical definitions (except name) without postgres
complaining - and without breaking the client on my test system - but I am
wary of trying it on my production system where there is much more data
(8GB) and I care about it's integrity so much more.

I've just tested further and if I create two indexes with identical
definitions, my queries are fast, and I can delete either of them and the
queries are still fast, but with both deleted the queries are slow. And yes,
renaming makes no difference but is nice for understanding the purpose of
the index.

So I still haven't seen this procedure break anything, but does anyone know
a reason it might be inadvisable?

When you have two exactly the same index definitions, that's just a waste of
resources. There will be used only one of them for speeding up selects, but
all of them must be updated during insert/update/delete operations.

regards
Szymon Guz

#5Noname
Gareth.Williams@csiro.au
In reply to: Szymon Guz (#4)
Re: create index concurrently - duplicate index to reduce time without an index

When you have two exactly the same index definitions, that's just a waste of resources. There will be used only one of them for speeding up selects, but all of them must be updated during insert/update/delete operations.

Thanks Szymon,

I appreciate the info. The duplication would only be for the time of the index build - which may be significant. When the new index was successfully complete we would drop the old one.

I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*. I'm worried it might actually break the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if there are duplicate ones and whether there is likely to be a problem?

My understanding was that the index would not be updated for inserts to the actual table - which is why one has to reindex or drop the index and create it again (with concurrently if you don't want to block reads). Am I missing something?

Perhaps I don't need to recreate the index at all - but I've been advised to reindex.

See:
http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0#
"On a side note, for the irods performance, an other key factor is also
the iCAT database performances and in some of your tests below, it has
an impact. Recomputing the index on a regular basis is something
important: here, I am doing it once per week for each iCAT (cron task).
Even for a catalog with millions of files, if you have ingested tens of
thousands of new files, you will see an improvement when doing
reindexing (for example an ils performed in 0.3 s will goes down to less
than 0.1 s when reindexing has been made)."

Thanks,

Gareth

#6Greg Smith
gsmith@gregsmith.com
In reply to: Noname (#3)
Re: create index concurrently - duplicate index to reduce time without an index

Gareth.Williams@csiro.au wrote:

So the rest of the question is, if I have two indexes with identical definitions, what happens? I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining - and without breaking the client on my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I care about it's integrity so much more.

The database doesn't care one bit if you create a bunch of redundant
indexes. So long as one of them is around to satisfy the queries that
need the index to run well, you're fine.

The main thing you can't do with the index concurrently/rename shuffle
you've discovered here is use that approach to concurrently rebuild an
index that enforces a constraint or unique index. If your index is
enforcing a PRIMARY KEY for example, you'll discover a major roadblock
were you to try this same technique to rebuild it. Those are tied into
the constraint portion of the system catalogs and manipulating them
isn't so easy.

Regular indexes that exist just to speed up queries, those you can
rename around as you've been figuring out without any downside. From a
general paranoia perspective, you should run an explicit ANALYZE on the
underlying table after you finish the shuffle, just to make absolutely
sure the right statistics are available afterwards.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#7Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Noname (#5)
Re: create index concurrently - duplicate index to reduce time without an index

On 2 Jun 2010, at 6:00, <Gareth.Williams@csiro.au> <Gareth.Williams@csiro.au> wrote:

My understanding was that the index would not be updated for inserts to the actual table - which is why one has to reindex or drop the index and create it again (with concurrently if you don't want to block reads). Am I missing something?

That's definitely not true, indexes would be rather useless if that were the case. What doesn't get updated with inserts are planner statistics, but you update those by executing ANALYSE (or VACUUM ANALYSE), or automatically by autovacuum.

Perhaps I don't need to recreate the index at all - but I've been advised to reindex.

See:
http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0#
"On a side note, for the irods performance, an other key factor is also
the iCAT database performances and in some of your tests below, it has
an impact. Recomputing the index on a regular basis is something
important: here, I am doing it once per week for each iCAT (cron task).
Even for a catalog with millions of files, if you have ingested tens of
thousands of new files, you will see an improvement when doing
reindexing (for example an ils performed in 0.3 s will goes down to less
than 0.1 s when reindexing has been made)."

There are cases where reindexing shows a performance improvement over just analysing a table, but the above (only inserts) shouldn't be one of them.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

!DSPAM:737,4c064ad110154201810452!

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: Noname (#5)
Re: create index concurrently - duplicate index to reduce time without an index

On Tue, Jun 1, 2010 at 10:00 PM, <Gareth.Williams@csiro.au> wrote:

I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*.
 I'm worried it might actually break the service worse than the locking that I'm trying to avoid.
Does anyone know which index will be used if there are duplicate ones and whether there is
likely to be a problem?

I've done this dozens of times on a production server, midday, max
load, with no real problems. create index concurrently is one of the
greatest things anyone's ever done for pgsql.

#9Greg Smith
gsmith@gregsmith.com
In reply to: Noname (#5)
Re: create index concurrently - duplicate index to reduce time without an index

Gareth.Williams@csiro.au wrote:

I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*. I'm worried it might actually break the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if there are duplicate ones and whether there is likely to be a problem?

This is a simple disk capacity question. If you can run CREATE INDEX
CONCURRENTLY, and it doesn't effectively take your server down due to
the disks being overrun with data to process, it's perfectly fine to do
so. Only in the case where the index creation itself causes what is
effectively client downtime, due to the increased load of the build,
could that aproach be worse than the terrible locking that comes with
non-concurrent rebuild.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noname (#1)
Re: create index concurrently - duplicate index to reduce time without an index

Excerpts from Gareth.Williams's message of mar jun 01 02:44:35 -0400 2010:

Hi,

We want to reindex the database behind a production service without interrupting the service.

I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

Note that you should wait until the new index becomes usable before
deleting the old one; otherwise you could have an intermediate period
during which you have no index.

IIRC a concurrently created index does not become usable until the
oldest transaction that was current when index creation started has
finished (IOW the transaction in pg_index.indcheckxmin has gone).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#11Noname
Gareth.Williams@csiro.au
In reply to: Greg Smith (#6)
Re: create index concurrently - duplicate index to reduce time without an index

-----Original Message-----
From: Greg Smith [mailto:greg@2ndquadrant.com]

-snip-

Gareth.Williams wrote:

So the rest of the question is, if I have two indexes with identical

definitions, what happens? I've confirmed that I can create indexes with
identical definitions (except name) without postgres complaining - and
without breaking the client on my test system - but I am wary of trying it
on my production system where there is much more data (8GB) and I care
about it's integrity so much more.

The database doesn't care one bit if you create a bunch of redundant
indexes. So long as one of them is around to satisfy the queries that
need the index to run well, you're fine.

The main thing you can't do with the index concurrently/rename shuffle
you've discovered here is use that approach to concurrently rebuild an
index that enforces a constraint or unique index. If your index is
enforcing a PRIMARY KEY for example, you'll discover a major roadblock
were you to try this same technique to rebuild it. Those are tied into
the constraint portion of the system catalogs and manipulating them
isn't so easy.

Regular indexes that exist just to speed up queries, those you can
rename around as you've been figuring out without any downside. From a
general paranoia perspective, you should run an explicit ANALYZE on the
underlying table after you finish the shuffle, just to make absolutely
sure the right statistics are available afterwards.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

Thanks Greg, Alban and others,

This has cleared up a misunderstanding I had about why one should reindex. Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that reindex or recreating and index should not normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybe only VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interrupting service anyway.

I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be faster), but ran into a problem. A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection:
postgres: rods ICAT 130.102.163.141(58061) idle in transaction
And new clients block.

Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum analyse key tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to be infrequent. The database holds data representing a virtual filesystem structure with millions of file (and associated access controls, and information on underlying storage resources and replication). There is probably not much update or delete of the main data - at least compared with the total holdings and the new data/files which are regularly being added to the system.

Thanks again,

Gareth

Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you expect the create index to fail or to cause locking or just transient performance degradation?

#12Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Noname (#11)
Re: create index concurrently - duplicate index to reduce time without an index

Thanks Greg, Alban and others,

This has cleared up a misunderstanding I had about why one should reindex. Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that reindex or recreating and index should not normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybe only VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interrupting service anyway.

VACUUM FULL actually causes bloat to indexes. It rearranges the data in the tables so that any gaps get used, but while doing that it also needs to update the indices related to those tables.

Normal VACUUM and VACUUM ANALYSE don't have this problem though, they just mark table space that's no longer in use (transaction that deleted rows has committed them) as reusable, so that later INSERTs can put their data in there. This is a bit of a simplification of what's really going on - for the details check the documentation.

Autovacuum does VACUUM ANALYSE in the background, using multiple threads in recent versions. You can (and seeing your use of the database you probably should) tune how often it vacuums tables through several configuration parameters.

Of course, running ANALYSE when you _know_ data in a table has changed significantly means you don't have to wait for autovac to get around to analysing that table.

I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be faster), but ran into a problem. A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection:
postgres: rods ICAT 130.102.163.141(58061) idle in transaction
And new clients block.

Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

I'm not exactly sure why that is (I can't look into your database), but my guess is that the index is locked by a transaction. Apparently the transaction you refer to has uncommitted work that depends on the index at some point.

Keeping transactions open for a long time is usually a bad idea.

You saw that you can't drop an index in use by a transaction for example, but autovacuum is running into similar issues - it can't reclaim space until the transaction finishes as the transaction locks things that autovacuum will want to touch.
That probably means (I'm not sure it works that way, but it seems likely) that that autovacuum thread gets stuck at a lock and can't continue until the transaction holding the lock frees it.

For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum analyse key tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to be infrequent. The database holds data representing a virtual filesystem structure with millions of file (and associated access controls, and information on underlying storage resources and replication). There is probably not much update or delete of the main data - at least compared with the total holdings and the new data/files which are regularly being added to the system.

In practice VACUUM FULL and REINDEX are used to reclaim disk space. That of itself doesn't look much like it'd improve performance, but using less disk space also means that data gets more tightly packed in your disk cache, for example. REINDEX can mean an index that didn't fit into RAM now does. They're both rather intrusive operations though, so it's a matter of balancing the costs and benefits. Many databases don't need to bother with VACUUM FULL or REINDEX.

Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too.. Would you expect the create index to fail or to cause locking or just transient performance degradation?

I think what Greg was getting at is that there's a dependency tree between indexes and constraints: A primary key is implemented using a unique index. You can create a new (unique) index on the same columns concurrently, but you can't replace the primary key index with it as you're not allowed to drop the index without dropping the PK constraint. If you have any FK constraints pointing to that table, you can't drop the PK constraint without also dropping the FK constraints.

Quite a bit of trouble to go through to replace one index.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4c08c88410157954111193!

#13Greg Smith
gsmith@gregsmith.com
In reply to: Noname (#11)
Re: create index concurrently - duplicate index to reduce time without an index

Gareth.Williams@csiro.au wrote:

Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that reindex or recreating and index should not normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybe only VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interrupting service anyway.

It is a periodic preventative maintenance operation you can expect to
need occasionally, but certainly not often. Indexes maintain themselves
just fine most of the time. They can get out of whack if you delete a
lot of data out of them and there are some use patterns that tend to a
aggravate the problems here (like tables where you're always inserting
new data and deleting old), but it's certainly not something you run all
the time.

You should read http://wiki.postgresql.org/wiki/VACUUM_FULL to clear up
when it is needed and what the better alternatives are.

A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection:
postgres: rods ICAT 130.102.163.141(58061) idle in transaction
And new clients block. Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

You do need to be careful that there are no clients connected when you
try this, or yes this is expected behavior. One popular technique is to
put some sort of "block access to the database" switch in the
application itself, specifically to support small outages while keeping
the app from going crazy. You can flip that for a few second around
when you're doing the index switch.

Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you expect the create index to fail or to cause locking or just transient performance degradation?

The description Alban wrote covers what I was alluding to. You can't
just can't drop an index that supports a constraint, and that has some
(bad) implication for how you can rebuild it.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Greg Smith (#13)
Re: create index concurrently - duplicate index to reduce time without an index

Excerpts from Greg Smith's message of lun jun 07 12:23:44 -0400 2010:

It is a periodic preventative maintenance operation you can expect to
need occasionally, but certainly not often. Indexes maintain themselves
just fine most of the time. They can get out of whack if you delete a
lot of data out of them and there are some use patterns that tend to a
aggravate the problems here (like tables where you're always inserting
new data and deleting old), but it's certainly not something you run all
the time.

Indexes on which you always insert new data and delete old can keep
themselves in good shape too. The really problematic cases are those in
which you delete new data and delete most, but not all, old data. Those
cases would result in almost empty pages that can never be recycled (we
do not have btree page merging).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support