Reducing relation locking overhead
In looking at the current pgbench results, I notice that one
considerable contribution to LWLock contention is access to the
heavyweight-lock manager. Almost all of that comes from taking
relation-level locks, so we could cut down the contention if we could
reduce the number of distinct locks taken. (The lmgr code already
avoids touching shared memory again if, say, you request another
AccessShareLock on a relation you've already AccessShareLocked in
the current transaction.)
I see several places where we might possibly make this kind of
savings:
1. In an UPDATE or DELETE query, we take RowExclusiveLock on the target
relation, and also take AccessShareLock in the scan node that reads the
relation. It wouldn't take much extra code to make the scan nodes avoid
taking AccessShareLock if the relation is already opened as the query
result relation. AFAICS there is no downside to this; any lock that
would conflict with AccessShareLock will also conflict with
RowExclusiveLock, so there's no real need to hold both lock types.
(Moreover, we already make a similar kind of optimization when accessing
system catalogs: those routines only take one lock not two.)
Does anyone have an objection to it?
2. In the same way, an index that is used to scan the target relation
will be locked in both RowExclusiveLock and AccessShareLock modes
(corresponding to its roles as both source and update target). We could
avoid taking both lock types, but this would require some restructuring
because the code responsible for locking the indexes doesn't currently
have access to the EState to find out whether an index belongs to a
target relation. What I'm thinking here is that maybe there's no point
in maintaining the read versus write distinction at all for indexes ---
we could just take AccessShareLock in both cases. Any thoughts on the
pros and cons there?
3. We could also save some trips to the lock manager if we adopt the
same position for user indexes as we do for user tables, namely that
locks once taken are held till end of transaction. Any thoughts about
that?
4. The only reason we need to take relation-level locks on indexes
at all is to make the world safe for REINDEX being done concurrently
with read-only accesses to the table (that don't use the index being
reindexed). If we went back to requiring exclusive lock for reindex we
could forget all about both #2 and #3. Particularly for updates of
relations with lots of indexes, this could be a pretty significant win.
However we'd definitely be giving up something that was seen as a
feature at one point, so I'm not sold on this idea ... unless someone
can see a way to reduce the overhead without giving up concurrent
REINDEX.
regards, tom lane
4. The only reason we need to take relation-level locks on indexes
at all is to make the world safe for REINDEX being done concurrently
with read-only accesses to the table (that don't use the index being
reindexed). If we went back to requiring exclusive lock for reindex we
could forget all about both #2 and #3. Particularly for updates of
relations with lots of indexes, this could be a pretty significant win.
However we'd definitely be giving up something that was seen as a
feature at one point, so I'm not sold on this idea ... unless someone
can see a way to reduce the overhead without giving up concurrent
REINDEX.
Surely in the real world REINDEX is run so rarely compared to all those
other operations it'd be a win...
Chris
* Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote:
4. The only reason we need to take relation-level locks on indexes
at all is to make the world safe for REINDEX being done concurrently
with read-only accesses to the table (that don't use the index being
reindexed). If we went back to requiring exclusive lock for reindex we
could forget all about both #2 and #3. Particularly for updates of
relations with lots of indexes, this could be a pretty significant win.
However we'd definitely be giving up something that was seen as a
feature at one point, so I'm not sold on this idea ... unless someone
can see a way to reduce the overhead without giving up concurrent
REINDEX.Surely in the real world REINDEX is run so rarely compared to all those
other operations it'd be a win...
Yeah, except that in the real world you don't want to bring everything
to a halt while you do a REINDEX. For my use cases it'd be fine but I
could see cases where it wouldn't be. Kinda makes me wish we could give
the user the option at runtime somehow but I'm not sure that could be
done...
Thanks,
Stephen
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
Surely in the real world REINDEX is run so rarely compared to all those other
operations it'd be a win...
It's not a question of frequency. We're not talking about something like a 10%
performance loss. You're talking about whether REINDEX is useful at all.
Consider installations where REINDEX will require shutting down business
critical operations for days...
It was a *major* new feature that many people were waiting for when Oracle
finally implemented live CREATE INDEX and REINDEX. The ability to run create
an index without blocking any operations on a table, even updates, was
absolutely critical for 24x7 operation.
--
greg
Greg Stark <gsstark@mit.edu> writes:
It was a *major* new feature that many people were waiting for when Oracle
finally implemented live CREATE INDEX and REINDEX. The ability to run create
an index without blocking any operations on a table, even updates, was
absolutely critical for 24x7 operation.
Well, we're still not in *that* ballpark and I haven't seen any serious
proposals to make us so. How "absolutely critical" is it really?
Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
actually have at the moment, an "absolutely critical" facility?
regards, tom lane
On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
Greg Stark <gsstark@mit.edu> writes:
It was a *major* new feature that many people were waiting for when Oracle
finally implemented live CREATE INDEX and REINDEX. The ability to run create
an index without blocking any operations on a table, even updates, was
absolutely critical for 24x7 operation.Well, we're still not in *that* ballpark and I haven't seen any serious
proposals to make us so. How "absolutely critical" is it really?
Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
actually have at the moment, an "absolutely critical" facility?
REINDEX isn't run that regularly, so perhaps might warrant special
attention. (I think there are other things we could do to avoid ever
needing to run a REINDEX.)
CREATE/DROP INDEX is important however, since we may want to try out new
index choices without stopping access altogether. But we do also want
the locking contention to be reduced also....
I know at least one other RDBMS that uses optimistic locking when
creating indexes. It checks the table description, builds the index with
a read lock, then checks the table description again before attempting
to lock the catalog, "create" the index and then complete. There is a
risk of getting a "table restructured error" after the build is nearly
complete. If we did that, then we wouldn't need to lock the indexes
because you wouldn't be able to see an index until it was built. Doing
something similar might allow us to have online CREATEs yet without a
locking overhead.
24x7 operation is actually fairly common. Maybe not with a strong SLA
for availability, but many websites and embedded apps are out there all
the time. The PostgreSQL claim to fame has concurrency at the top of the
list, so we should assume that in all we do.
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
Greg Stark <gsstark@mit.edu> writes:
It was a *major* new feature that many people were waiting for when Oracle
finally implemented live CREATE INDEX and REINDEX. The ability to run create
an index without blocking any operations on a table, even updates, was
absolutely critical for 24x7 operation.Well, we're still not in *that* ballpark and I haven't seen any serious
proposals to make us so. How "absolutely critical" is it really?
Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
actually have at the moment, an "absolutely critical" facility?
Alright, I'll grant Tom that "absolutely critical" was a bit of hyperbole.
I know at least one other RDBMS that uses optimistic locking when
creating indexes. It checks the table description, builds the index with
a read lock, then checks the table description again before attempting
to lock the catalog, "create" the index and then complete. There is a
risk of getting a "table restructured error" after the build is nearly
complete.
I suspect this comes out of a very different storage model from Postgres's.
Postgres would have no trouble building an index of the existing data using
only shared locks. The problem is that any newly inserted (or updated) records
could be missing from such an index.
To do it you would then have to gather up all those newly inserted records.
And of course while you're doing that new records could be inserted. And so
on. There's no guarantee it would ever finish, though I suppose you could
detect the situation if the size of the new batch wasn't converging to 0 and
throw an error.
One optimization would be to have a flag that disabled the use of the FSM,
forcing all inserts to extend the table and allocate new tuples at the end.
This would at least limit the amount the index build would have to scan. The
index build could just do one-by-one insertions for the remaining tuples until
it catches up to the head.
At the end of the index build there's also a problem upgrading locks to put in
place the new index. That would create a deadlock risk. Perhaps that's where
the "table restructured error" comes up in these other databases?
24x7 operation is actually fairly common. Maybe not with a strong SLA
for availability, but many websites and embedded apps are out there all
the time. The PostgreSQL claim to fame has concurrency at the top of the
list, so we should assume that in all we do.
Off the top of my head I would put these items on the list of "necessary for
24x7 operation":
. (non-FULL) VACUUM
. Online/PITR backups
. Partitioned Tables
. online index builds
Of which Postgres has 2.5 out of 4. And most of those have come in just the
last 12 months or so. Doing pretty damned good.
--
greg
On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
I suspect this comes out of a very different storage model from Postgres's.
Postgres would have no trouble building an index of the existing data using
only shared locks. The problem is that any newly inserted (or updated) records
could be missing from such an index.To do it you would then have to gather up all those newly inserted records.
And of course while you're doing that new records could be inserted. And so
on. There's no guarantee it would ever finish, though I suppose you could
detect the situation if the size of the new batch wasn't converging to 0 and
throw an error.
After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.
Are their enviroments which could not tolerate even this minimal hit?
Probably, which leaves the choice of telling them 'don't reindex then'
or providingaA knob which would tell it to never block (would just try
N times and then give up, failing the reindex).
Gregory Maxwell wrote:
On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
I suspect this comes out of a very different storage model from Postgres's.
Postgres would have no trouble building an index of the existing data using
only shared locks. The problem is that any newly inserted (or updated) records
could be missing from such an index.To do it you would then have to gather up all those newly inserted records.
And of course while you're doing that new records could be inserted. And so
on. There's no guarantee it would ever finish, though I suppose you could
detect the situation if the size of the new batch wasn't converging to 0 and
throw an error.After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.
The problem is that you need to upgrade the lock at the end of the
operation. This is very deadlock prone, and likely to abort the whole
operation just when it's going to finish. Is this a showstopper? Tom
seems to think it is. I'm not sure anyone is going to be happy if they
find that their two-day reindex was aborted just when it was going to
finish.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 2005-12-02 at 19:04 -0300, Alvaro Herrera wrote:
Gregory Maxwell wrote:
On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote:
I suspect this comes out of a very different storage model from Postgres's.
Postgres would have no trouble building an index of the existing data using
only shared locks. The problem is that any newly inserted (or updated) records
could be missing from such an index.To do it you would then have to gather up all those newly inserted records.
And of course while you're doing that new records could be inserted. And so
on.
CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
easily include rows added after the CREATE INDEX started. When the scan
was exhausted we could mark that last TID and return to it after the
sort/build.
There's no guarantee it would ever finish, though I suppose you could
detect the situation if the size of the new batch wasn't converging to 0 and
throw an error.After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.The problem is that you need to upgrade the lock at the end of the
operation. This is very deadlock prone, and likely to abort the whole
operation just when it's going to finish. Is this a showstopper? Tom
seems to think it is. I'm not sure anyone is going to be happy if they
find that their two-day reindex was aborted just when it was going to
finish.
If that is the only objection against such a seriously useful feature,
then we should look at making some exceptions. (I understand the lock
upgrade issue).
Greg has come up with an exceptional idea here, so can we look deeper?
We already know others have done it.
What types of statement would cause the index build to fail? How else
can we prevent them from executing while the index is being built?
Best Regards, Simon Riggs
On 12/2/05, Alvaro Herrera wrote:
Gregory Maxwell wrote:
After you're mostly caught up, change locking behavior to block
further updates while the final catchup happens. This could be driven
by a hurestic that says make up to N attempts to catch up without
blocking, after that just take a lock and finish the job. Presumably
the catchup would be short compared to the rest of the work.The problem is that you need to upgrade the lock at the end of the
operation. This is very deadlock prone, and likely to abort the whole
operation just when it's going to finish. Is this a showstopper? Tom
seems to think it is. I'm not sure anyone is going to be happy if they
find that their two-day reindex was aborted just when it was going to
finish.
How about the following sceanrio for building a new index:
- create an empty index
- flag it as incomplete
- commit it so it becomes visible to new transactions
- new transactions will update the index when inserting / updating
- the planner will not use it for queries because it is flagged as incomplete
- wait until the the index is visible to all running transactions
- start a new seqscan and insert all records in the index
- commit
- remove the incomplete flag
Wouldn't this overcome the lock upgrade problem?
Jochem
Simon Riggs <simon@2ndquadrant.com> writes:
CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
easily include rows added after the CREATE INDEX started. When the scan
was exhausted we could mark that last TID and return to it after the
sort/build.
And do what? This has nothing to do with the fundamental problem of
never being able to catch up unless you can upgrade your lock to exclude
writes. What's worse, once you have excluded writes you have to rescan
the entire table to be sure you haven't missed anything. So in the
scenarios where this whole thing is actually interesting, ie enormous
tables, you're still talking about a fairly long interval with writes
locked out. Maybe not as long as a complete REINDEX, but long.
regards, tom lane
Jochem van Dieten <jochemd@gmail.com> writes:
How about the following sceanrio for building a new index:
- create an empty index
- flag it as incomplete
- commit it so it becomes visible to new transactions
- new transactions will update the index when inserting / updating
- the planner will not use it for queries because it is flagged as incomplete
- wait until the the index is visible to all running transactions
- start a new seqscan and insert all records in the index
- commit
- remove the incomplete flag
Wouldn't this overcome the lock upgrade problem?
Doesn't really solve the problem for REINDEX, though. Presumably, the
reason that you are REINDEXing is that you would like to defragment the
existing index. Well, that requires collecting all the index entries
and sorting them. The above method is not going to produce a nicely
sorted index; whatever entries get made on-the-fly during the first
stage are going to determine the index shape.
This same problem applies to the build-lock-catchup paradigm, although
less severely since you can hope that the entries to be added on at the
end are only a small part of the total and will fit in the excess space
that you leave in the index leaf pages. If you spend too long catching
up, though (as in the multiple-pass ideas that various people were
advocating), you'll end up with an index messy enough that it's
questionable why you bothered.
regards, tom lane
On Fri, 2005-12-02 at 17:45 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
CREATE INDEX uses SnapshotAny, so the scan that feeds the build could
easily include rows added after the CREATE INDEX started. When the scan
was exhausted we could mark that last TID and return to it after the
sort/build.And do what? This has nothing to do with the fundamental problem of
never being able to catch up unless you can upgrade your lock to exclude
writes. What's worse, once you have excluded writes you have to rescan
the entire table to be sure you haven't missed anything. So in the
scenarios where this whole thing is actually interesting, ie enormous
tables, you're still talking about a fairly long interval with writes
locked out. Maybe not as long as a complete REINDEX, but long.
Those are technical difficulties that I believe can be solved.
Greg was right: online index builds are very useful (for us).
Best Regards, Simon Riggs
On Friday 02 December 2005 09:53, Simon Riggs wrote:
On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote:
Greg Stark <gsstark@mit.edu> writes:
It was a *major* new feature that many people were waiting for when
Oracle finally implemented live CREATE INDEX and REINDEX. The ability
to run create an index without blocking any operations on a table, even
updates, was absolutely critical for 24x7 operation.Well, we're still not in *that* ballpark and I haven't seen any serious
proposals to make us so. How "absolutely critical" is it really?
Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we
actually have at the moment, an "absolutely critical" facility?REINDEX isn't run that regularly, so perhaps might warrant special
attention. (I think there are other things we could do to avoid ever
needing to run a REINDEX.)CREATE/DROP INDEX is important however, since we may want to try out new
index choices without stopping access altogether. But we do also want
the locking contention to be reduced also....
Just thought I'd toss in this random data point... I know I still have a least
one 7.3 system running were reindexes are a part of the regular routine and
the ability to query against the table simultaneously is certainly
approaching "absolutly critical" territory. Hoping to get that system
upgraded by the end of the month, at which point the frequency of reindex
will surely decrease, but I'm not sure it's going to go away completly. I
could probably get by at that point with DROP/CREATE INDEX but it wouldn't be
my preferred way to do it.
--
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Tom Lane <tgl@sss.pgh.pa.us> writes:
What's worse, once you have excluded writes you have to rescan the entire
table to be sure you haven't missed anything. So in the scenarios where this
whole thing is actually interesting, ie enormous tables, you're still
talking about a fairly long interval with writes locked out. Maybe not as
long as a complete REINDEX, but long.
I was thinking you would set a flag to disable use of the FSM for
inserts/updates while the reindex was running. So you would know where to find
the new tuples, at the end of the table after the last tuple you read.
--
greg
Alvaro Herrera <alvherre@commandprompt.com> writes:
The problem is that you need to upgrade the lock at the end of the
operation. This is very deadlock prone, and likely to abort the whole
operation just when it's going to finish. Is this a showstopper? Tom
seems to think it is. I'm not sure anyone is going to be happy if they
find that their two-day reindex was aborted just when it was going to
finish.
How likely is this really to be a problem in this particular case? Obviously
if two people try to reindex the same index they'll have a problem, but that's
not really a problem. (Postgres should probably do something to block that up
front rather than wait until the end to fail.)
Other than that case is there any other case the reindex could deadlock with?
I'm a bit hampered thinking about this because I'm not really sure exactly
what locks a reindex needs and what else takes those locks.
--
greg
Greg Stark wrote:
Tom Lane <tgl@sss.pgh.pa.us> writes:
What's worse, once you have excluded writes you have to rescan the entire
table to be sure you haven't missed anything. So in the scenarios where this
whole thing is actually interesting, ie enormous tables, you're still
talking about a fairly long interval with writes locked out. Maybe not as
long as a complete REINDEX, but long.I was thinking you would set a flag to disable use of the FSM for
inserts/updates while the reindex was running. So you would know where to find
the new tuples, at the end of the table after the last tuple you
read.
If REINDEX works by seqscanning the table then the inclusion of new
tuples would happen for free if you turn off the FSM before beginning
the REINDEX operation -- you're guaranteed to see them last. But that
only works if REINDEX behaves this way.
Then it's a question of what to do with in-flight updates at the time
the REINDEX hits the end of the table.
Even if REINDEX hits the table in non-sequential order, turning off
the FSM should still work. REINDEX wouldn't need to acquire any
additional locks until after it has scanned the appended area. So the
way I (perhaps naively) envision it working is:
1. Acquire read lock on the table
2. Turn off FSM
3. Note the location of the end of the table
4. Release read lock on the table
5. Perform REINDEX operation
6. Read and index the bit of the table starting with the location
noted previously.
7. Note new end of table
8. Acquire read lock on the table
9. Scan any entries that have been appended past new end of table.
10. Release read lock on table
11. Turn on FSM
In the above for large relations, the bulk of the REINDEX should
happen without any locks being held by the REINDEX operation. For
small tables (where the amount of new insert activity can be a large
percentage of the total table size), it would almost certainly be more
efficient to just take a read lock for the whole operation. So it
might be wise to set up some sort of threshold, and to take a read
lock during the whole operation if the table size is smaller than the
threshold.
The reason the sequence I enumerate above involves taking any locks at
all is to avoid the issues that Tom brought up about having to rescan
the entire table to make sure nothing gets missed, to avoid possible
race conditions between steps 2 and 3, and to allow step 9 to
definitively complete, since otherwise in-flight updates would be
missed.
In the context of the original discussion (reduction of lock
acquisition), REINDEX isn't a common operation even if it is a
critical one, so acquisition of more than the usual number of locks
here shouldn't be a big deal.
--
Kevin Brown kevin@sysexperts.com
Greg Stark <gsstark@mit.edu> writes:
Tom Lane <tgl@sss.pgh.pa.us> writes:
What's worse, once you have excluded writes you have to rescan the entire
table to be sure you haven't missed anything.
I was thinking you would set a flag to disable use of the FSM for
inserts/updates while the reindex was running. So you would know where
to find the new tuples, at the end of the table after the last tuple
you read.
There are paths that do not consult the FSM, eg update where the new
tuple fits on the same page as the old, or where the backend has already
been given a target page by the FSM. And this "set a flag" idea in
itself has enormous costs, because the flag will become a central point
of contention: it *must* be touched by every single tuple insertion.
regards, tom lane