Invalid indexes should not consume update overhead

Started by Tomasz Ostrowskialmost 10 years ago14 messagesbugs
Jump to latest
#1Tomasz Ostrowski
tometzky+pg@ato.waw.pl

When index is build concurrently and this build fails the index is left
in invalid state. It's basically unusable for anything, but, according
to documentation:

If a problem arises while scanning the table, such as a deadlock or
a uniqueness violation in a unique index, the CREATE INDEX command
will fail but leave behind an "invalid" index. This index will be
ignored for querying purposes because it might be incomplete; however
it *will* *still* *consume* *update* *overhead*. The psql \d command
will report such an index as INVALID

I think this update overhead is actually wasted - there's no way to make
use of it, as the only way to make the index usable again is to reindex
it or drop and recreate.

In the other hand if invalid indexes would have no update overhead then
they may actually be useful. Please consider the following situation
(taken from the real world):

- I have a very large table with a large number of indexes on a server
with large number of CPUs.
- I need to add a new not null column and I need to do this in a fairly
small maintenance window.
- I tried to simply "alter table tablename add column columnname int not
null default 0", but it did not end in 24 hours - it updates these tens
of indexes in single process, using 100% of a single CPU - unacceptable.
- I mark all the indexes as invalid (currently I'm just saving and
dropping them).
- This time adding of the column takes an hour.
- I vacuum full or cluster the table, as it has now bloated at least to
200% - this is also fast, as indexes are not updated.
- I'm reindexing (currently recreating) all indexes, but instead of
calculating them on a single CPU I use all available CPU's (I can have
like 40 of them on Amazon AWS RDS cheaply if I need them only for a few
hours). This also ends in an hour.
- The world is saved and everybody celebrate.

Dropping and recreating indexes is not a very safe operation - their
definitions need to be saved somewhere out of the database and if
anything goes wrong it would not be easy to restore them. If it would be
possible to just set them invalid (feature request in passing) and if
invalid indexed would not add cost on updates, then this would be much
easier and safer.

--
Tomasz "Tometzky" Ostrowski

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomasz Ostrowski (#1)
Re: Invalid indexes should not consume update overhead

On Wed, Jul 13, 2016 at 4:40 PM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:

When index is build concurrently and this build fails the index is left in
invalid state. It's basically unusable for anything, but, according to
documentation:

If a problem arises while scanning the table, such as a deadlock or
a uniqueness violation in a unique index, the CREATE INDEX command
will fail but leave behind an "invalid" index. This index will be
ignored for querying purposes because it might be incomplete; however
it *will* *still* *consume* *update* *overhead*. The psql \d command
will report such an index as INVALID

I think this update overhead is actually wasted - there's no way to make use
of it, as the only way to make the index usable again is to reindex it or
drop and recreate.

In the other hand if invalid indexes would have no update overhead then they
may actually be useful. Please consider the following situation (taken from
the real world):

- I have a very large table with a large number of indexes on a server with
large number of CPUs.
- I need to add a new not null column and I need to do this in a fairly
small maintenance window.
- I tried to simply "alter table tablename add column columnname int not
null default 0", but it did not end in 24 hours - it updates these tens of
indexes in single process, using 100% of a single CPU - unacceptable.
- I mark all the indexes as invalid (currently I'm just saving and dropping
them).
- This time adding of the column takes an hour.
- I vacuum full or cluster the table, as it has now bloated at least to 200%
- this is also fast, as indexes are not updated.
- I'm reindexing (currently recreating) all indexes, but instead of
calculating them on a single CPU I use all available CPU's (I can have like
40 of them on Amazon AWS RDS cheaply if I need them only for a few hours).
This also ends in an hour.
- The world is saved and everybody celebrate.

Dropping and recreating indexes is not a very safe operation - their
definitions need to be saved somewhere out of the database and if anything
goes wrong it would not be easy to restore them. If it would be possible to
just set them invalid (feature request in passing) and if invalid indexed
would not add cost on updates, then this would be much easier and safer.

How can you consider marking invalid index as valid index without
reindexing it? It is quite possible that in the meantime the table
has been updated.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Bruce Momjian
bruce@momjian.us
In reply to: Tomasz Ostrowski (#1)
Re: Invalid indexes should not consume update overhead

I can't disagree with your conclusion but I can offer a bit of perspective
of how the current situation came about.

Invalid indexes are in the same state they're in while a concurrent index
build is in progress. As far as other queries are concerned they're
effectively assuming the index build is still in progress and will still
eventually be completed.

They could maybe determine that's not the case but then that would be an
extra check for them to do in the normal case so not necessarily a win.

The real solution imho is to actually clean up failed index builds when a
build fails. That's what normal transactions do when they abort after all.
This was always the intention but looked like it was going to be a pain and
was put off (ie I was lazy). It's probably just several layers of
PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: Invalid indexes should not consume update overhead

Greg Stark <stark@mit.edu> writes:

The real solution imho is to actually clean up failed index builds when a
build fails. That's what normal transactions do when they abort after all.
This was always the intention but looked like it was going to be a pain and
was put off (ie I was lazy). It's probably just several layers of
PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.

No, that wouldn't fix it if the reason the build failed was a crash.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

In reply to: Tom Lane (#4)
Re: Invalid indexes should not consume update overhead

On Sat, Jul 16, 2016 at 5:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <stark@mit.edu> writes:

The real solution imho is to actually clean up failed index builds when a
build fails. That's what normal transactions do when they abort after all.
This was always the intention but looked like it was going to be a pain and
was put off (ie I was lazy). It's probably just several layers of
PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.

No, that wouldn't fix it if the reason the build failed was a crash.

Could we just have crash recovery perform clean-up to cover that case?

--
Peter Geoghegan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Tomasz Ostrowski
tometzky+pg@ato.waw.pl
In reply to: Bruce Momjian (#3)
Re: Invalid indexes should not consume update overhead

On 2016-07-17 02:09, Greg Stark wrote:

The real solution imho is to actually clean up failed index builds when
a build fails.

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle
has this concept:
ALTER INDEX [INDEX_NAME] UNUSABLE;

Maybe, if an index is in invalid state, update can check which part of
table is already indexed and which part is not. Then it would only
update indexes of this already reindexed part of table. This way
purposely invalid indexes could be marked valid for blocks numbers less
than 0.

This might actually be a win during concurrent index creation as
concurrent updates would not have to update index for all updated rows.

But I don't know if it's feasible from concurrency perspective at all.

Regards,
Tomasz "Tometzky" Ostrowski

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

In reply to: Tomasz Ostrowski (#6)
Re: Invalid indexes should not consume update overhead

On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle has
this concept:
ALTER INDEX [INDEX_NAME] UNUSABLE;

I think that this must make the index unusable to the optimizer. The
idea being that you can see the impact of dropping the index without
actually doing so, reserving the ability to back out (mark the index
usable once more rather than actually dropping it) if it turns out
that the index is of some use.

If it simply made the index unusable while removing any ongoing
obligation for writes to maintain the index, then what's the point in
supporting this at all? You need to be able to mark it usable again.

--
Peter Geoghegan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#8Rader, David
davidr@openscg.com
In reply to: Peter Geoghegan (#7)
Re: Invalid indexes should not consume update overhead

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:

On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl <javascript:;>> wrote:

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle has
this concept:
ALTER INDEX [INDEX_NAME] UNUSABLE;

I think that this must make the index unusable to the optimizer. The
idea being that you can see the impact of dropping the index without
actually doing so, reserving the ability to back out (mark the index
usable once more rather than actually dropping it) if it turns out
that the index is of some use.

If it simply made the index unusable while removing any ongoing
obligation for writes to maintain the index, then what's the point in
supporting this at all? You need to be able to mark it usable again.

--
Peter Geoghegan

For example, in SQL Server you can "alter index disable". If you are about
to do a lot of bulk operations. But there is no "re-enable"; instead you
have to "alter index rebuild" because as has been said on this thread you
don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org <javascript:;>
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
--
David Rader
davidr@openscg.com

In reply to: Rader, David (#8)
Re: Invalid indexes should not consume update overhead

On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com> wrote:

For example, in SQL Server you can "alter index disable". If you are about
to do a lot of bulk operations. But there is no "re-enable"; instead you
have to "alter index rebuild" because as has been said on this thread you
don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.

That seems pretty pointless. Why not actually drop the index, then?

The only reason I can think of is that there is value in representing
that indexes should continue to have optimizer statistics (that would
happen for expression indexes in Postgres) without actually paying for
the ongoing maintenance of the index during write statements. Even
that seems like kind of a stretch, though.

--
Peter Geoghegan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#10Rader, David
davidr@openscg.com
In reply to: Peter Geoghegan (#9)
Re: Invalid indexes should not consume update overhead

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:

On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com
<javascript:;>> wrote:

For example, in SQL Server you can "alter index disable". If you are

about

to do a lot of bulk operations. But there is no "re-enable"; instead you
have to "alter index rebuild" because as has been said on this thread you
don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.

That seems pretty pointless. Why not actually drop the index, then?

The only reason I can think of is that there is value in representing
that indexes should continue to have optimizer statistics (that would
happen for expression indexes in Postgres) without actually paying for
the ongoing maintenance of the index during write statements. Even
that seems like kind of a stretch, though.

--
Peter Geoghegan

There's some DBA benefit in that the index disable also disables
constraints and foreign keys that depend on the index. instead of having
to drop and recreate dependent objects you can leave all the definitions in
place but disabled. So it makes laziness easier.

Of course then you have to be sure that your data is right when you bulk
load since the engine is not enforcing it.

--
--
David Rader
davidr@openscg.com

#11Jan Wieck
JanWieck@Yahoo.com
In reply to: Rader, David (#8)
Re: Invalid indexes should not consume update overhead

On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com> wrote:

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:

On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle

has

this concept:
ALTER INDEX [INDEX_NAME] UNUSABLE;

I think that this must make the index unusable to the optimizer. The
idea being that you can see the impact of dropping the index without
actually doing so, reserving the ability to back out (mark the index
usable once more rather than actually dropping it) if it turns out
that the index is of some use.

If it simply made the index unusable while removing any ongoing
obligation for writes to maintain the index, then what's the point in
supporting this at all? You need to be able to mark it usable again.

--
Peter Geoghegan

For example, in SQL Server you can "alter index disable". If you are about
to do a lot of bulk operations. But there is no "re-enable"; instead you
have to "alter index rebuild" because as has been said on this thread you
don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.

I would say that materially there is no difference. What would make a
difference would be
if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to
build them.
That is, it would be different if rebuilding multiple indexes at once had a
substantial advantage,
like let's say it would scan the heap only once, building all the sort sets
in parallel.

Regards, Jan

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
--
David Rader
davidr@openscg.com

--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

#12Jan Wieck
JanWieck@Yahoo.com
In reply to: Rader, David (#10)
Re: Invalid indexes should not consume update overhead

On Sun, Jul 17, 2016 at 5:06 PM, Rader, David <davidr@openscg.com> wrote:

Of course then you have to be sure that your data is right when you bulk
load since the engine is not enforcing it.

Correct. But that won't be different from a NOT VALID constraint (see
https://www.postgresql.org/docs/9.5/static/sql-altertable.html).

Jan

--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

#13Rader, David
davidr@openscg.com
In reply to: Jan Wieck (#11)
Re: Invalid indexes should not consume update overhead

On Sunday, July 17, 2016, Jan Wieck <jan@wi3ck.info> wrote:

On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com
<javascript:_e(%7B%7D,'cvml','davidr@openscg.com');>> wrote:

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com
<javascript:_e(%7B%7D,'cvml','pg@heroku.com');>> wrote:

On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle

has

this concept:
ALTER INDEX [INDEX_NAME] UNUSABLE;

I think that this must make the index unusable to the optimizer. The
idea being that you can see the impact of dropping the index without
actually doing so, reserving the ability to back out (mark the index
usable once more rather than actually dropping it) if it turns out
that the index is of some use.

If it simply made the index unusable while removing any ongoing
obligation for writes to maintain the index, then what's the point in
supporting this at all? You need to be able to mark it usable again.

--
Peter Geoghegan

For example, in SQL Server you can "alter index disable". If you are
about to do a lot of bulk operations. But there is no "re-enable"; instead
you have to "alter index rebuild" because as has been said on this
thread you don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.

I would say that materially there is no difference. What would make a
difference would be
if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to
build them.
That is, it would be different if rebuilding multiple indexes at once had
a substantial advantage,
like let's say it would scan the heap only once, building all the sort
sets in parallel.

Regards, Jan

Yes parallel multi index build would provide actual benefit. Otherwise

the disable/rebuild is just syntactic sugar that makes scripting bulk
operations simpler.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
--
David Rader
davidr@openscg.com <javascript:_e(%7B%7D,'cvml','davidr@openscg.com');>

--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

--
--
David Rader
davidr@openscg.com

#14Tomasz Ostrowski
tometzky+pg@ato.waw.pl
In reply to: Rader, David (#10)
Re: Invalid indexes should not consume update overhead

On 2016-07-17 23:06, Rader, David wrote:

That seems pretty pointless. Why not actually drop the index, then?

The only reason I can think of is that there is value in representing
that indexes should continue to have optimizer statistics (that would
happen for expression indexes in Postgres) without actually paying for
the ongoing maintenance of the index during write statements. Even
that seems like kind of a stretch, though.

There's some DBA benefit in that the index disable also disables
constraints and foreign keys that depend on the index. instead of
having to drop and recreate dependent objects you can leave all the
definitions in place but disabled. So it makes laziness easier.

Of course then you have to be sure that your data is right when you bulk
load since the engine is not enforcing it.

To make it clear - I don't postulate disabling indexes used for data
integrity - when the index can't be dropped, as for example some foreign
key depends on it, or table primary key is based on it, then it should
not be possible to turn it off.

Also I don't postulate turning them back on without doing a full reindex
- I just need to do this reindex for multiple indexes in parallel.

What I'd like to have isn't really different than just dropping the
indexes and recreating them back after bulk update. It's just that this
operation is not very safe:
- you have to save them somewhere else - using for example pg_dump,
- pg_dump is often not available or is in wrong version,
- when saving these indexes in some temporary directory you risk loosing
them in case of a failure or crash,
- provided that you're trying to code some application upgrade script,
it's hard to tell what to do when previous execution crashed - when you
save indexes again you risk overwriting your save with empty or
incomplete data; when you don't, then you can't be sure if it was from
some previous execution and there were some schema changes since.

It's just it's now hard to prepare this for support team in sufficiently
reliable way, so that they can do this bulk update on their own.

And believe me - this trick works when you have a large table with large
number (like tens) of indexes. Even more so if these indexes are of
unicode text data.

--
Regards,
Tomasz "Tometzky" Ostrowski

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs