ALTER TABLE lock strength reduction patch is unsafe

Started by Tom Lanealmost 15 years ago101 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

If you set up a pgbench test case that hits the database with a lot of
concurrent selects and non-exclusive-locking ALTER TABLEs, 9.1 soon
falls over. For example:

$ cat foo.script
alter table pgbench_accounts set (fillfactor = 100);
SELECT abalance FROM pgbench_accounts WHERE aid = 525212;

$ createdb bench
$ pgbench -i -s 10 bench
...
$ pgbench -c 50 -t 1000000 -f foo.script bench
starting vacuum...end.
Client 10 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 5 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 44 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 3 aborted in state 1: ERROR: relation "pgbench_accounts" does not exist
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 45 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 15 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 34 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 43 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 49 aborted in state 1: ERROR: relation "pgbench_accounts" does not exist
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 12 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 23 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 14 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 6 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 11 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 4 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
... etc etc ...

On my four-core workstation, the failures are infrequent at up to 30
clients but come pretty fast and furious at 50.

What is happening here is this:

1. Some backend commits an ALTER TABLE and sends out an sinval message.

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command. It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time. Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning. By the time it
reaches the previous version of the row, however, backend #3
*has* committed. So that version isn't good according to SnapshotNow
either.

5. Thus, backend #2 fails to find any version of the pg_class row
that satisfies SnapshotNow, and it reports an error. Depending on just
when this happens during the cache load process, you can get any of
the errors displayed above, or probably some other ones.

The particular case I'm showing here only updates pg_class, but other
non-exclusive-lock variants of ALTER TABLE can probably provoke similar
failures with respect to other catalogs, leading to yet different
misbehaviors.

In typical cases where both versions of the row are on the same page,
the window for the concurrent commit to happen is very narrow --- that's
why you need so many clients to make it happen easily. With enough
clients there's a good chance of losing the CPU between tuple visits.
But of course Murphy's Law says this will happen in production
situations even if the load isn't so high.

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen. In the meantime, we have to go back to using
AccessExclusiveLock for table alterations. It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

regards, tom lane

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In typical cases where both versions of the row are on the same page,
the window for the concurrent commit to happen is very narrow --- that's
why you need so many clients to make it happen easily.  With enough
clients there's a good chance of losing the CPU between tuple visits.
But of course Murphy's Law says this will happen in production
situations even if the load isn't so high.

Thanks for doing the research on this. Much better we know now than
enter production like this.

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.  In the meantime, we have to go back to using
AccessExclusiveLock for table alterations.  It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

If we were to change ALTER TABLE so it takes a session lock rather
than a normal lock, then we can commit the change and then wait until
the relcache invalidations have been received before we release the
lock. That way we would be able to avoid the concurrent issues you
describe.

Or alternatively we could just re-scan if we can't find a valid row
when building the cache. We have time in the failure path...

Do you have stomach for any this in 9.1?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.  In the meantime, we have to go back to using
AccessExclusiveLock for table alterations.  It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

Ouch.

I wonder if we could avoid this anomaly by taking a throwaway MVCC
snapshot at the beginning of each system catalog scan and using it
just for the duration of that scan. If nothing that has touched the
catalog commits while the scan is open, then this is logically
equivalent to SnapshotNow. If something does commit in mid-scan, then
we might not get the latest version of the row, but we should end up
with exactly one. If it's not the latest one, we'll do the rebuild
again upon seeing the next sinval message; in the meantime, the
version we're using mustn't be too intolerably bad or it was an error
not to use AccessExclusiveLock in the first place.

IIUC, the problem with this approach is not correctness but
performance. Taking snapshots is (currently) expensive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command.  It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time.  Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

This part is the core of the problem:

We must not be able to update the catalog entry while a relcache
rebuild scan is in place.

So I'm prototyping something that allows
LockRelationDefinitionOid(targetRelId, ShareLock);

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#4)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 9:32 AM, simon <simon@2ndquadrant.com> wrote:

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command.  It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time.  Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

This part is the core of the problem:

We must not be able to update the catalog entry while a relcache rebuild scan is in place.

So I'm prototyping something that allows
LockRelationDefinitionOid(targetRelId, ShareLock);

Similar to the way we lock a relation for extension, as a sub-lock of
the main relation lock.

Relcache rebuilds use a ShareLock, ALTER TABLE uses an ExclusiveLock.

I've written the patch, just need to test later today.... gotta step out now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.

Ouch.

I wonder if we could avoid this anomaly by taking a throwaway MVCC
snapshot at the beginning of each system catalog scan and using it
just for the duration of that scan. If nothing that has touched the
catalog commits while the scan is open, then this is logically
equivalent to SnapshotNow. If something does commit in mid-scan, then
we might not get the latest version of the row, but we should end up
with exactly one. If it's not the latest one, we'll do the rebuild
again upon seeing the next sinval message; in the meantime, the
version we're using mustn't be too intolerably bad or it was an error
not to use AccessExclusiveLock in the first place.

Yeah, this seems like a possibly workable direction to explore. I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

IIUC, the problem with this approach is not correctness but
performance. Taking snapshots is (currently) expensive.

Yeah. After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array. Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.
Subsequently, the SnapshotNow code assumes that XID to be
still-in-progress without consulting its actual state. We reset the XID
list to empty when starting a new SnapshotNow scan. (We might be able
to do so less often than that, like only when we do
AcceptInvalidationMessages, but it's not clear to me that there's any
real benefit in hanging onto the state longer.)

This costs no performance; if anything it should be faster than now,
because we'll be replacing expensive transaction state probes with
relatively-cheap searches of an XID array that should almost always
be quite short.

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress. There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later). So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it). I
think this would be all right, since the scan would just seize on the
first good version it finds. As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1. I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2. We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, this seems like a possibly workable direction to explore.  I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

It would also avoid adding more lock manager traffic which - as
recently discussed on the relevant threads - turns out to be a
significant performance bottleneck for us right now on some workloads.

IIUC, the problem with this approach is not correctness but
performance.  Taking snapshots is (currently) expensive.

Yeah.  After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array.  Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.
Subsequently, the SnapshotNow code assumes that XID to be
still-in-progress without consulting its actual state.  We reset the XID
list to empty when starting a new SnapshotNow scan.  (We might be able
to do so less often than that, like only when we do
AcceptInvalidationMessages, but it's not clear to me that there's any
real benefit in hanging onto the state longer.)

I think that something like that might possibly work, but what if the
XID array overflows?

A while back I proposed the idea of a "lazy" snapshot, by which I had
in mind something similar to what you are suggesting but different in
detail. Initially, when asked to acquire a snapshot, the snapshot
manager acknowledges having taken one but does not actually do any
work. As long as it sees only XIDs that either precede the oldest XID
still running anywhere in the cluster, or have aborted, it can provide
answers that are 100% correct without any further data. If it ever
sees a newer, non-aborted XID then it goes and really gets an MVCC
snapshot at that point, which it can uses from that point onward. I
think that it might be possible to make such a system work even for
MVCC snapshots generally, but even if not, it might be sufficient for
this purpose. Unlike your approach, it would avoid both the "see no
rows" and the "see multiple rows" cases, which might be thought an
advantage.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1.  I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2.  We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage. Nor is there any certainty that this is the last
bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array. Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.

I think that something like that might possibly work, but what if the
XID array overflows?

Well, you repalloc it bigger. In either this idea or yours below,
I fear SnapshotNow snaps will have to become dynamically-allocated
structures instead of being simple references to a shared constant
object. (This is because we can sometimes do a SnapshotNow scan when
another one is already in progress, and we couldn't let the inner one
change the outer one's state.) That's not really a performance problem;
one more palloc to do a catalog scan is a non-issue. But it is likely
to be a large notational change compared to what we've got now.

A while back I proposed the idea of a "lazy" snapshot, by which I had
in mind something similar to what you are suggesting but different in
detail. Initially, when asked to acquire a snapshot, the snapshot
manager acknowledges having taken one but does not actually do any
work. As long as it sees only XIDs that either precede the oldest XID
still running anywhere in the cluster, or have aborted, it can provide
answers that are 100% correct without any further data. If it ever
sees a newer, non-aborted XID then it goes and really gets an MVCC
snapshot at that point, which it can uses from that point onward. I
think that it might be possible to make such a system work even for
MVCC snapshots generally, but even if not, it might be sufficient for
this purpose. Unlike your approach, it would avoid both the "see no
rows" and the "see multiple rows" cases, which might be thought an
advantage.

Hmm, yeah, I think this idea is probably better than mine, just because
of the less dubious semantics. I don't see how you'd make it work for
generic MVCC scans, because the behavior will be "the database state as
of some hard-to-predict time after the scan starts", which is not what
we want for MVCC. But it ought to be fine for SnapshotNow.

regards, tom lane

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 7:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, this seems like a possibly workable direction to explore.  I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

It would also avoid adding more lock manager traffic which - as
recently discussed on the relevant threads - turns out to be a
significant performance bottleneck for us right now on some workloads.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1.  I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2.  We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage.  Nor is there any certainty that this is the last
bug.

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

Any effects from the additional locking will only be felt by people
issuing a stream of DDL statements against a table. Even assuming
there are some effects of real note.

So there is no "collateral damage" and additional locking is a viable
solution for 9.1.

It's possible that we may have a better solution in 9.2+ but then
we've said that before and have it never happen, many times.

Having spent a few hours mulling through this, I think there is a
reasonable solution for 9.1 and I continue to work on it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm, yeah, I think this idea is probably better than mine, just because
of the less dubious semantics.  I don't see how you'd make it work for
generic MVCC scans, because the behavior will be "the database state as
of some hard-to-predict time after the scan starts", which is not what
we want for MVCC.  But it ought to be fine for SnapshotNow.

Department of second thoughts: I think I see a problem.

Suppose we have a tuple that has not been updated for a long time.
Its XMIN is committed and all-visible, and its XMAX is invalid. As
we're scanning the table (transaction T1), we see that tuple and say,
oh, it's visible. Now, another transaction (T2) begins, updates the
tuple, and commits. Our scan then reaches the page where the new
tuple is located, and says, oh, this is recent, I'd better take a real
snapshot. Of course, the new snapshot can see the new version of the
tuple, too. Of course, if T1 had taken its snapshot before starting
the scan, the second tuple would have been invisible. But since we
didn't take it until later, after T2 had already committed, we see a
duplicate.

That's still no worse than your idea, which rests on the theory that
duplicates don't matter anyway, but the case for it being better is a
lot thinner. I'd sure prefer something that had less crazy semantics
than either of these ideas, if we can think of something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#9)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 3:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

The extra locking would also occur when *initially* building relcache
entries. In other words, this would increase - likely quite
significantly - the overhead of backend startup. It's not going to be
sufficient to do this just for pg_class; I think you'll have to do it
for pg_attribute, pg_attrdef, pg_constraint, pg_index, pg_trigger,
pg_rewrite, and maybe a few others I'm not thinking of right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#11)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 17, 2011 at 3:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

The extra locking would also occur when *initially* building relcache
entries.  In other words, this would increase - likely quite
significantly - the overhead of backend startup.  It's not going to be
sufficient to do this just for pg_class; I think you'll have to do it
for pg_attribute, pg_attrdef, pg_constraint, pg_index, pg_trigger,
pg_rewrite, and maybe a few others I'm not thinking of right now.

Nothing you say here is accurate, regrettably.

The "extra locking" would be one call to the lock manager per
relation. Taken in shared mode, so it doesn't block.

I see no reason at all to have separate locks for each catalog table,
since it's the relation lock that is the top level lock.

Locking is a very well known solution to such problems. We use it
everywhere and we can use it here, and now.

I think you'd better wait to see the patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

Department of second thoughts: I think I see a problem.

Um, yeah, so that doesn't really work any better than my idea.

On further reflection, there's a problem at a higher level than this
anyway. Even if we can get a single SnapshotNow scan to produce
guaranteed-self-consistent results, that doesn't ensure consistency
between the results of scans occurring serially. An example here is
ALTER COLUMN DROP DEFAULT, which is currently imagined to impact only
writers. However, suppose that a concurrent relcache load fetches the
pg_attribute row, notes that it has atthasdef = true, and then the ALTER
commits before we start to scan pg_attrdef. The consistency checks in
AttrDefaultFetch() will complain about a missing pg_attrdef entry, and
rightly so. We could lobotomize those checks, but it doesn't feel right
to do so; and anyway there may be other cases that are harder to kluge up.

So really we need consistency across *at least* one entire relcache load
cycle. We could maybe arrange to take an MVCC snap (or some lighter
weight version of that) at the start, and use that for all the resulting
scans, but I think that would be notationally messy. It's not clear
that it'd solve everything anyhow. There are parts of a relcache entry
that we fetch only on-demand, so they are typically loaded later than
the core items, and probably couldn't use the same snapshot. Worse,
there are lots of places where we assume that use of catcache entries or
direct examination of the catalogs will yield results consistent with
the relcache.

I suspect these latter problems will impact Simon's idea as well.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Department of second thoughts: I think I see a problem.

Um, yeah, so that doesn't really work any better than my idea.

On further reflection, there's a problem at a higher level than this
anyway.  Even if we can get a single SnapshotNow scan to produce
guaranteed-self-consistent results, that doesn't ensure consistency
between the results of scans occurring serially.  An example here is
ALTER COLUMN DROP DEFAULT, which is currently imagined to impact only
writers.  However, suppose that a concurrent relcache load fetches the
pg_attribute row, notes that it has atthasdef = true, and then the ALTER
commits before we start to scan pg_attrdef.  The consistency checks in
AttrDefaultFetch() will complain about a missing pg_attrdef entry, and
rightly so.  We could lobotomize those checks, but it doesn't feel right
to do so; and anyway there may be other cases that are harder to kluge up.

So really we need consistency across *at least* one entire relcache load
cycle.  We could maybe arrange to take an MVCC snap (or some lighter
weight version of that) at the start, and use that for all the resulting
scans, but I think that would be notationally messy.  It's not clear
that it'd solve everything anyhow.  There are parts of a relcache entry
that we fetch only on-demand, so they are typically loaded later than
the core items, and probably couldn't use the same snapshot.  Worse,
there are lots of places where we assume that use of catcache entries or
direct examination of the catalogs will yield results consistent with
the relcache.

I suspect these latter problems will impact Simon's idea as well.

I suspect we're going to be told that they don't.

I suspect I'm not going to believe it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Tom Lane's message of vie jun 17 13:22:40 -0400 2011:

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress. There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later). So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it). I
think this would be all right, since the scan would just seize on the
first good version it finds. As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage. Nor is there any certainty that this is the last
bug.

And in fact, here's something else to worry about: consider pg_dump.
pg_dump is pretty heavily reliant on backend catalog-interpretation code
(such as ruleutils.c) that mostly runs on SnapshotNow time. But it also
does a fair amount of work on the basis of its own inspection of the
catalogs, which is done according to the serializable snapshot it gets
at the beginning of the dump run. If these two views of a table's
schema aren't consistent, you might get a pg_dump error, but it's at
least as likely that you'll get a silently incorrect dump.

pg_dump tries to minimize the risk by taking AccessShareLock right away
on each table it's going to dump. This is not perfect but it at least
results in a narrow window for conflicting table changes to occur.
However, that strategy has been blown out of the water by the ALTER
TABLE lock strength reduction. There is now a *very* wide window for
concurrent ALTERs to occur and possibly break the dump results.

As far as I can see, the only simple way to return pg_dump to its
previous level of safety while retaining this patch is to make it take
ShareUpdateExclusiveLocks, so that it will still block all forms of
ALTER TABLE. This is rather unpleasant, since it will also block
autovacuum for the duration of the dump.

In the long run, we really ought to fix things so that ruleutils.c
runs on the transaction snapshot, but that's a massive rewrite that is
certainly not getting done for 9.1, and will likely result in
considerable code bloat :-(.

(BTW, I just noticed that dumpSchema does a pretty fair amount of work
before it gets around to calling getTables, which is where said locks
get taken. Seems like we'd better rearrange the order of operations
there...)

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#15)
Re: ALTER TABLE lock strength reduction patch is unsafe

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

Yeah, I came to that same conclusion downthread.

regards, tom lane

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#17)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Tom Lane's message of vie jun 17 17:08:25 -0400 2011:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

Yeah, I came to that same conclusion downthread.

Something is seriously wrong with my email :-(

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 5:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as I can see, the only simple way to return pg_dump to its
previous level of safety while retaining this patch is to make it take
ShareUpdateExclusiveLocks, so that it will still block all forms of
ALTER TABLE.  This is rather unpleasant, since it will also block
autovacuum for the duration of the dump.

I have been thinking for a while now that it would be sensible to make
vacuum use a different lock type, much as we do for relation
extension. DROP TABLE and CLUSTER and at least some forms of ALTER
TABLE and maybe a few other things like CREATE INDEX would need to
grab that lock in addition to the ones they already acquire, but a
whole lot of other things wouldn't. In particular, it's currently not
possible to lock a table against SELECT without also locking it
against VACUUM - and booting off any auto-vacuum worker that happens
to already be processing it. If you imagine a large table with a
bunch of short-duration exclusive locks, it's not too hard to see how
you can end up with VACUUM starvation.

But that's not something I want to do in 9.1, and I doubt it would
completely solve this problem anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#19)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

I have been thinking for a while now that it would be sensible to make
vacuum use a different lock type, much as we do for relation
extension.

Hmm. I had just been toying with the idea of introducing a new
user-visible locking level to allow separation of anti-vacuum locks from
anti-schema-alteration locks. But I think you're probably right that it
could be done as a specialized LockTag. That would make it not easily
user-accessible, but it's hard to think of reasons for users to lock out
vacuum anyway, unless they want to lock out everything via
AccessExclusiveLock.

... In particular, it's currently not
possible to lock a table against SELECT without also locking it
against VACUUM

Well, it still wouldn't be, since AccessExclusiveLock certainly had
better lock out vacuum. As said above, I think the important thing
is to distinguish vacuum from schema changes.

But that's not something I want to do in 9.1,

Definitely.

regards, tom lane

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#23)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#25)
#27Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#27)
#29Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#30)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#30)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#31)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#33)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#31)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#34)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#35)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#38)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#36)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#40)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#42)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#44)
#46Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#46)
#48Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#47)
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#13)
#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#6)
#51Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#51)
#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#52)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#53)
#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#54)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#55)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#57)
#59Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#58)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#59)
#61Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#60)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#61)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#62)
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#63)
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#64)
#66Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#63)
#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#66)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#67)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#65)
#70Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#67)
#71Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noah Misch (#70)
#72Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#71)
#73Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#69)
#75Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Noah Misch (#72)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#75)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#74)
#78Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#77)
#79Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#79)
#81Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#67)
#82Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#81)
#83Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#83)
#85Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#84)
#86Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#85)
#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#85)
#88Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#87)
#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#84)
#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#88)
#91Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#90)
#92Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#91)
#93Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#88)
#94Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#92)
#95Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#93)
#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#95)
#97Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#90)
#98Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#89)
#99Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#98)
#100Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#52)
#101Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#100)