ALTER TABLE lock strength reduction patch is unsafe

Started by Tom Laneover 14 years ago101 messages
#1Tom Lane
tgl@sss.pgh.pa.us

If you set up a pgbench test case that hits the database with a lot of
concurrent selects and non-exclusive-locking ALTER TABLEs, 9.1 soon
falls over. For example:

$ cat foo.script
alter table pgbench_accounts set (fillfactor = 100);
SELECT abalance FROM pgbench_accounts WHERE aid = 525212;

$ createdb bench
$ pgbench -i -s 10 bench
...
$ pgbench -c 50 -t 1000000 -f foo.script bench
starting vacuum...end.
Client 10 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 5 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 44 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 3 aborted in state 1: ERROR: relation "pgbench_accounts" does not exist
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 45 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 15 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 34 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 43 aborted in state 1: ERROR: cache lookup failed for relation 46260
Client 49 aborted in state 1: ERROR: relation "pgbench_accounts" does not exist
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 12 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 23 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 14 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
Client 6 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 11 aborted in state 1: ERROR: could not open relation with OID 46260
LINE 1: SELECT abalance FROM pgbench_accounts WHERE aid = 525212;
^
Client 4 aborted in state 0: ERROR: relation "pgbench_accounts" does not exist
... etc etc ...

On my four-core workstation, the failures are infrequent at up to 30
clients but come pretty fast and furious at 50.

What is happening here is this:

1. Some backend commits an ALTER TABLE and sends out an sinval message.

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command. It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time. Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning. By the time it
reaches the previous version of the row, however, backend #3
*has* committed. So that version isn't good according to SnapshotNow
either.

5. Thus, backend #2 fails to find any version of the pg_class row
that satisfies SnapshotNow, and it reports an error. Depending on just
when this happens during the cache load process, you can get any of
the errors displayed above, or probably some other ones.

The particular case I'm showing here only updates pg_class, but other
non-exclusive-lock variants of ALTER TABLE can probably provoke similar
failures with respect to other catalogs, leading to yet different
misbehaviors.

In typical cases where both versions of the row are on the same page,
the window for the concurrent commit to happen is very narrow --- that's
why you need so many clients to make it happen easily. With enough
clients there's a good chance of losing the CPU between tuple visits.
But of course Murphy's Law says this will happen in production
situations even if the load isn't so high.

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen. In the meantime, we have to go back to using
AccessExclusiveLock for table alterations. It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

regards, tom lane

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In typical cases where both versions of the row are on the same page,
the window for the concurrent commit to happen is very narrow --- that's
why you need so many clients to make it happen easily.  With enough
clients there's a good chance of losing the CPU between tuple visits.
But of course Murphy's Law says this will happen in production
situations even if the load isn't so high.

Thanks for doing the research on this. Much better we know now than
enter production like this.

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.  In the meantime, we have to go back to using
AccessExclusiveLock for table alterations.  It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

If we were to change ALTER TABLE so it takes a session lock rather
than a normal lock, then we can commit the change and then wait until
the relcache invalidations have been received before we release the
lock. That way we would be able to avoid the concurrent issues you
describe.

Or alternatively we could just re-scan if we can't find a valid row
when building the cache. We have time in the failure path...

Do you have stomach for any this in 9.1?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.  In the meantime, we have to go back to using
AccessExclusiveLock for table alterations.  It doesn't help to have
a lower lock level if that means that concurrent transactions will
unpredictably fail instead of waiting.

Ouch.

I wonder if we could avoid this anomaly by taking a throwaway MVCC
snapshot at the beginning of each system catalog scan and using it
just for the duration of that scan. If nothing that has touched the
catalog commits while the scan is open, then this is logically
equivalent to SnapshotNow. If something does commit in mid-scan, then
we might not get the latest version of the row, but we should end up
with exactly one. If it's not the latest one, we'll do the rebuild
again upon seeing the next sinval message; in the meantime, the
version we're using mustn't be too intolerably bad or it was an error
not to use AccessExclusiveLock in the first place.

IIUC, the problem with this approach is not correctness but
performance. Taking snapshots is (currently) expensive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command.  It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time.  Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

This part is the core of the problem:

We must not be able to update the catalog entry while a relcache
rebuild scan is in place.

So I'm prototyping something that allows
LockRelationDefinitionOid(targetRelId, ShareLock);

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#4)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 9:32 AM, simon <simon@2ndquadrant.com> wrote:

On Thu, Jun 16, 2011 at 11:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. In response, some other backend starts to reload its relcache entry
for pgbench_accounts when it begins its next command.  It does an
indexscan with SnapshotNow on pg_class to find the updated pg_class row.

3. Meanwhile, some third backend commits another ALTER TABLE, updating
the pg_class row another time.  Since we have removed the
AccessExclusiveLock that all variants of ALTER TABLE used to take, this
commit can happen while backend #2 is in process of scanning pg_class.

This part is the core of the problem:

We must not be able to update the catalog entry while a relcache rebuild scan is in place.

So I'm prototyping something that allows
LockRelationDefinitionOid(targetRelId, ShareLock);

Similar to the way we lock a relation for extension, as a sub-lock of
the main relation lock.

Relcache rebuilds use a ShareLock, ALTER TABLE uses an ExclusiveLock.

I've written the patch, just need to test later today.... gotta step out now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.

Ouch.

I wonder if we could avoid this anomaly by taking a throwaway MVCC
snapshot at the beginning of each system catalog scan and using it
just for the duration of that scan. If nothing that has touched the
catalog commits while the scan is open, then this is logically
equivalent to SnapshotNow. If something does commit in mid-scan, then
we might not get the latest version of the row, but we should end up
with exactly one. If it's not the latest one, we'll do the rebuild
again upon seeing the next sinval message; in the meantime, the
version we're using mustn't be too intolerably bad or it was an error
not to use AccessExclusiveLock in the first place.

Yeah, this seems like a possibly workable direction to explore. I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

IIUC, the problem with this approach is not correctness but
performance. Taking snapshots is (currently) expensive.

Yeah. After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array. Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.
Subsequently, the SnapshotNow code assumes that XID to be
still-in-progress without consulting its actual state. We reset the XID
list to empty when starting a new SnapshotNow scan. (We might be able
to do so less often than that, like only when we do
AcceptInvalidationMessages, but it's not clear to me that there's any
real benefit in hanging onto the state longer.)

This costs no performance; if anything it should be faster than now,
because we'll be replacing expensive transaction state probes with
relatively-cheap searches of an XID array that should almost always
be quite short.

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress. There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later). So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it). I
think this would be all right, since the scan would just seize on the
first good version it finds. As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1. I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2. We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, this seems like a possibly workable direction to explore.  I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

It would also avoid adding more lock manager traffic which - as
recently discussed on the relevant threads - turns out to be a
significant performance bottleneck for us right now on some workloads.

IIUC, the problem with this approach is not correctness but
performance.  Taking snapshots is (currently) expensive.

Yeah.  After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array.  Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.
Subsequently, the SnapshotNow code assumes that XID to be
still-in-progress without consulting its actual state.  We reset the XID
list to empty when starting a new SnapshotNow scan.  (We might be able
to do so less often than that, like only when we do
AcceptInvalidationMessages, but it's not clear to me that there's any
real benefit in hanging onto the state longer.)

I think that something like that might possibly work, but what if the
XID array overflows?

A while back I proposed the idea of a "lazy" snapshot, by which I had
in mind something similar to what you are suggesting but different in
detail. Initially, when asked to acquire a snapshot, the snapshot
manager acknowledges having taken one but does not actually do any
work. As long as it sees only XIDs that either precede the oldest XID
still running anywhere in the cluster, or have aborted, it can provide
answers that are 100% correct without any further data. If it ever
sees a newer, non-aborted XID then it goes and really gets an MVCC
snapshot at that point, which it can uses from that point onward. I
think that it might be possible to make such a system work even for
MVCC snapshots generally, but even if not, it might be sufficient for
this purpose. Unlike your approach, it would avoid both the "see no
rows" and the "see multiple rows" cases, which might be thought an
advantage.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1.  I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2.  We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage. Nor is there any certainty that this is the last
bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array. Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.

I think that something like that might possibly work, but what if the
XID array overflows?

Well, you repalloc it bigger. In either this idea or yours below,
I fear SnapshotNow snaps will have to become dynamically-allocated
structures instead of being simple references to a shared constant
object. (This is because we can sometimes do a SnapshotNow scan when
another one is already in progress, and we couldn't let the inner one
change the outer one's state.) That's not really a performance problem;
one more palloc to do a catalog scan is a non-issue. But it is likely
to be a large notational change compared to what we've got now.

A while back I proposed the idea of a "lazy" snapshot, by which I had
in mind something similar to what you are suggesting but different in
detail. Initially, when asked to acquire a snapshot, the snapshot
manager acknowledges having taken one but does not actually do any
work. As long as it sees only XIDs that either precede the oldest XID
still running anywhere in the cluster, or have aborted, it can provide
answers that are 100% correct without any further data. If it ever
sees a newer, non-aborted XID then it goes and really gets an MVCC
snapshot at that point, which it can uses from that point onward. I
think that it might be possible to make such a system work even for
MVCC snapshots generally, but even if not, it might be sufficient for
this purpose. Unlike your approach, it would avoid both the "see no
rows" and the "see multiple rows" cases, which might be thought an
advantage.

Hmm, yeah, I think this idea is probably better than mine, just because
of the less dubious semantics. I don't see how you'd make it work for
generic MVCC scans, because the behavior will be "the database state as
of some hard-to-predict time after the scan starts", which is not what
we want for MVCC. But it ought to be fine for SnapshotNow.

regards, tom lane

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 7:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 17, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, this seems like a possibly workable direction to explore.  I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

It would also avoid adding more lock manager traffic which - as
recently discussed on the relevant threads - turns out to be a
significant performance bottleneck for us right now on some workloads.

I am not, however, particularly pleased with the idea of trying to make
this work in 9.1.  I still think that we should back off the attempt
to reduce lock strength in 9.1, and take it up for 9.2.  We need to be
stabilizing 9.1 for release, not introducing new untested mechanisms in
it.

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage.  Nor is there any certainty that this is the last
bug.

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

Any effects from the additional locking will only be felt by people
issuing a stream of DDL statements against a table. Even assuming
there are some effects of real note.

So there is no "collateral damage" and additional locking is a viable
solution for 9.1.

It's possible that we may have a better solution in 9.2+ but then
we've said that before and have it never happen, many times.

Having spent a few hours mulling through this, I think there is a
reasonable solution for 9.1 and I continue to work on it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm, yeah, I think this idea is probably better than mine, just because
of the less dubious semantics.  I don't see how you'd make it work for
generic MVCC scans, because the behavior will be "the database state as
of some hard-to-predict time after the scan starts", which is not what
we want for MVCC.  But it ought to be fine for SnapshotNow.

Department of second thoughts: I think I see a problem.

Suppose we have a tuple that has not been updated for a long time.
Its XMIN is committed and all-visible, and its XMAX is invalid. As
we're scanning the table (transaction T1), we see that tuple and say,
oh, it's visible. Now, another transaction (T2) begins, updates the
tuple, and commits. Our scan then reaches the page where the new
tuple is located, and says, oh, this is recent, I'd better take a real
snapshot. Of course, the new snapshot can see the new version of the
tuple, too. Of course, if T1 had taken its snapshot before starting
the scan, the second tuple would have been invisible. But since we
didn't take it until later, after T2 had already committed, we see a
duplicate.

That's still no worse than your idea, which rests on the theory that
duplicates don't matter anyway, but the case for it being better is a
lot thinner. I'd sure prefer something that had less crazy semantics
than either of these ideas, if we can think of something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#9)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 3:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

The extra locking would also occur when *initially* building relcache
entries. In other words, this would increase - likely quite
significantly - the overhead of backend startup. It's not going to be
sufficient to do this just for pg_class; I think you'll have to do it
for pg_attribute, pg_attrdef, pg_constraint, pg_index, pg_trigger,
pg_rewrite, and maybe a few others I'm not thinking of right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#11)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 17, 2011 at 3:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Not so. The extra locking would only occur on the first lock
acquisition after DDL operations occur. If that was common then your
other performance patch would not be an effective optimisation. There
is no additional locking from what I've proposed in the common code
path - that's why we have a relcache.

The extra locking would also occur when *initially* building relcache
entries.  In other words, this would increase - likely quite
significantly - the overhead of backend startup.  It's not going to be
sufficient to do this just for pg_class; I think you'll have to do it
for pg_attribute, pg_attrdef, pg_constraint, pg_index, pg_trigger,
pg_rewrite, and maybe a few others I'm not thinking of right now.

Nothing you say here is accurate, regrettably.

The "extra locking" would be one call to the lock manager per
relation. Taken in shared mode, so it doesn't block.

I see no reason at all to have separate locks for each catalog table,
since it's the relation lock that is the top level lock.

Locking is a very well known solution to such problems. We use it
everywhere and we can use it here, and now.

I think you'd better wait to see the patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

Department of second thoughts: I think I see a problem.

Um, yeah, so that doesn't really work any better than my idea.

On further reflection, there's a problem at a higher level than this
anyway. Even if we can get a single SnapshotNow scan to produce
guaranteed-self-consistent results, that doesn't ensure consistency
between the results of scans occurring serially. An example here is
ALTER COLUMN DROP DEFAULT, which is currently imagined to impact only
writers. However, suppose that a concurrent relcache load fetches the
pg_attribute row, notes that it has atthasdef = true, and then the ALTER
commits before we start to scan pg_attrdef. The consistency checks in
AttrDefaultFetch() will complain about a missing pg_attrdef entry, and
rightly so. We could lobotomize those checks, but it doesn't feel right
to do so; and anyway there may be other cases that are harder to kluge up.

So really we need consistency across *at least* one entire relcache load
cycle. We could maybe arrange to take an MVCC snap (or some lighter
weight version of that) at the start, and use that for all the resulting
scans, but I think that would be notationally messy. It's not clear
that it'd solve everything anyhow. There are parts of a relcache entry
that we fetch only on-demand, so they are typically loaded later than
the core items, and probably couldn't use the same snapshot. Worse,
there are lots of places where we assume that use of catcache entries or
direct examination of the catalogs will yield results consistent with
the relcache.

I suspect these latter problems will impact Simon's idea as well.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Department of second thoughts: I think I see a problem.

Um, yeah, so that doesn't really work any better than my idea.

On further reflection, there's a problem at a higher level than this
anyway.  Even if we can get a single SnapshotNow scan to produce
guaranteed-self-consistent results, that doesn't ensure consistency
between the results of scans occurring serially.  An example here is
ALTER COLUMN DROP DEFAULT, which is currently imagined to impact only
writers.  However, suppose that a concurrent relcache load fetches the
pg_attribute row, notes that it has atthasdef = true, and then the ALTER
commits before we start to scan pg_attrdef.  The consistency checks in
AttrDefaultFetch() will complain about a missing pg_attrdef entry, and
rightly so.  We could lobotomize those checks, but it doesn't feel right
to do so; and anyway there may be other cases that are harder to kluge up.

So really we need consistency across *at least* one entire relcache load
cycle.  We could maybe arrange to take an MVCC snap (or some lighter
weight version of that) at the start, and use that for all the resulting
scans, but I think that would be notationally messy.  It's not clear
that it'd solve everything anyhow.  There are parts of a relcache entry
that we fetch only on-demand, so they are typically loaded later than
the core items, and probably couldn't use the same snapshot.  Worse,
there are lots of places where we assume that use of catcache entries or
direct examination of the catalogs will yield results consistent with
the relcache.

I suspect these latter problems will impact Simon's idea as well.

I suspect we're going to be told that they don't.

I suspect I'm not going to believe it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#6)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Tom Lane's message of vie jun 17 13:22:40 -0400 2011:

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress. There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later). So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it). I
think this would be all right, since the scan would just seize on the
first good version it finds. As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

I like this feature a lot, but it's hard to imagine that any of the
fixes anyone has so far suggested can be implemented without
collateral damage. Nor is there any certainty that this is the last
bug.

And in fact, here's something else to worry about: consider pg_dump.
pg_dump is pretty heavily reliant on backend catalog-interpretation code
(such as ruleutils.c) that mostly runs on SnapshotNow time. But it also
does a fair amount of work on the basis of its own inspection of the
catalogs, which is done according to the serializable snapshot it gets
at the beginning of the dump run. If these two views of a table's
schema aren't consistent, you might get a pg_dump error, but it's at
least as likely that you'll get a silently incorrect dump.

pg_dump tries to minimize the risk by taking AccessShareLock right away
on each table it's going to dump. This is not perfect but it at least
results in a narrow window for conflicting table changes to occur.
However, that strategy has been blown out of the water by the ALTER
TABLE lock strength reduction. There is now a *very* wide window for
concurrent ALTERs to occur and possibly break the dump results.

As far as I can see, the only simple way to return pg_dump to its
previous level of safety while retaining this patch is to make it take
ShareUpdateExclusiveLocks, so that it will still block all forms of
ALTER TABLE. This is rather unpleasant, since it will also block
autovacuum for the duration of the dump.

In the long run, we really ought to fix things so that ruleutils.c
runs on the transaction snapshot, but that's a massive rewrite that is
certainly not getting done for 9.1, and will likely result in
considerable code bloat :-(.

(BTW, I just noticed that dumpSchema does a pretty fair amount of work
before it gets around to calling getTables, which is where said locks
get taken. Seems like we'd better rearrange the order of operations
there...)

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#15)
Re: ALTER TABLE lock strength reduction patch is unsafe

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

Yeah, I came to that same conclusion downthread.

regards, tom lane

#18Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#17)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Tom Lane's message of vie jun 17 17:08:25 -0400 2011:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, would there be a problem if a scan on catalog A yields results from
supposedly-running transaction X but another scan on catalog B yields
result from transaction Y? (X != Y) For example, a scan on pg_class
says that there are N triggers but scanning pg_trigger says N-1?

Yeah, I came to that same conclusion downthread.

Something is seriously wrong with my email :-(

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#16)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 5:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as I can see, the only simple way to return pg_dump to its
previous level of safety while retaining this patch is to make it take
ShareUpdateExclusiveLocks, so that it will still block all forms of
ALTER TABLE.  This is rather unpleasant, since it will also block
autovacuum for the duration of the dump.

I have been thinking for a while now that it would be sensible to make
vacuum use a different lock type, much as we do for relation
extension. DROP TABLE and CLUSTER and at least some forms of ALTER
TABLE and maybe a few other things like CREATE INDEX would need to
grab that lock in addition to the ones they already acquire, but a
whole lot of other things wouldn't. In particular, it's currently not
possible to lock a table against SELECT without also locking it
against VACUUM - and booting off any auto-vacuum worker that happens
to already be processing it. If you imagine a large table with a
bunch of short-duration exclusive locks, it's not too hard to see how
you can end up with VACUUM starvation.

But that's not something I want to do in 9.1, and I doubt it would
completely solve this problem anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#19)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

I have been thinking for a while now that it would be sensible to make
vacuum use a different lock type, much as we do for relation
extension.

Hmm. I had just been toying with the idea of introducing a new
user-visible locking level to allow separation of anti-vacuum locks from
anti-schema-alteration locks. But I think you're probably right that it
could be done as a specialized LockTag. That would make it not easily
user-accessible, but it's hard to think of reasons for users to lock out
vacuum anyway, unless they want to lock out everything via
AccessExclusiveLock.

... In particular, it's currently not
possible to lock a table against SELECT without also locking it
against VACUUM

Well, it still wouldn't be, since AccessExclusiveLock certainly had
better lock out vacuum. As said above, I think the important thing
is to distinguish vacuum from schema changes.

But that's not something I want to do in 9.1,

Definitely.

regards, tom lane

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#1)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning.  By the time it
reaches the previous version of the row, however, backend #3
*has* committed.  So that version isn't good according to SnapshotNow
either.

<thinks some more>

Why isn't this a danger for every pg_class update? For example, it
would seem that if VACUUM updates relpages/reltuples, it would be
prone to this same hazard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning. �By the time it
reaches the previous version of the row, however, backend #3
*has* committed. �So that version isn't good according to SnapshotNow
either.

<thinks some more>

Why isn't this a danger for every pg_class update? For example, it
would seem that if VACUUM updates relpages/reltuples, it would be
prone to this same hazard.

VACUUM does that with an in-place, nontransactional update. But yes,
this is a risk for every transactional catalog update.

regards, tom lane

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning.  By the time it
reaches the previous version of the row, however, backend #3
*has* committed.  So that version isn't good according to SnapshotNow
either.

<thinks some more>

Why isn't this a danger for every pg_class update?  For example, it
would seem that if VACUUM updates relpages/reltuples, it would be
prone to this same hazard.

VACUUM does that with an in-place, nontransactional update.  But yes,
this is a risk for every transactional catalog update.

Well, after various efforts to fix the problem, I notice that there
are two distinct problems brought out by your test case.

One of them is caused by my patch, one of them was already there in
the code - this latter one is actually the hardest to fix.

It took me about an hour to fix the first bug, but its taken a while
of thinking about the second before I realised it was a pre-existing
bug.

The core problem is, as you observed that a pg_class update can cause
rows to be lost with concurrent scans.
We scan pg_class in two ways: to rebuild a relcache entry based on a
relation's oid (easy fix). We also scan pg_class to resolve the name
to oid mapping. The name to oid mapping is performed *without* a lock
on the relation, since we don't know which relation to lock. So the
name lookup can fail if we are in the middle of a pg_class update.
This is an existing potential bug in Postgres unrelated to my patch.
Ref: SearchCatCache()

I've been looking at ways to lock the relation name and namespace
prior to the lookup (or more precisely the hash), but its worth
discussing whether we want that at all?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#23)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Sun, Jun 19, 2011 at 10:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

pre-existing bug

I've been able to reproduce the bug on 9.0 stable as well, using the
OP's test script.
Multiple errors like this...

ERROR: relation "pgbench_accounts" does not exist
STATEMENT: alter table pgbench_accounts set (fillfactor = 100);

Presume we want to fix both bugs, not just the most recent one so will
continue with both patches.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#25Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#23)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Sun, Jun 19, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jun 17, 2011 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

4. Backend #2 visits the new, about-to-be-committed version of
pgbench_accounts' pg_class row just before backend #3 commits.
It sees the row as not good and keeps scanning.  By the time it
reaches the previous version of the row, however, backend #3
*has* committed.  So that version isn't good according to SnapshotNow
either.

<thinks some more>

Why isn't this a danger for every pg_class update?  For example, it
would seem that if VACUUM updates relpages/reltuples, it would be
prone to this same hazard.

VACUUM does that with an in-place, nontransactional update.  But yes,
this is a risk for every transactional catalog update.

Well, after various efforts to fix the problem, I notice that there
are two distinct problems brought out by your test case.

One of them is caused by my patch, one of them was already there in
the code - this latter one is actually the hardest to fix.

It took me about an hour to fix the first bug, but its taken a while
of thinking about the second before I realised it was a pre-existing
bug.

The core problem is, as you observed that a pg_class update can cause
rows to be lost with concurrent scans.
We scan pg_class in two ways: to rebuild a relcache entry based on a
relation's oid (easy fix). We also scan pg_class to resolve the name
to oid mapping. The name to oid mapping is performed *without* a lock
on the relation, since we don't know which relation to lock. So the
name lookup can fail if we are in the middle of a pg_class update.
This is an existing potential bug in Postgres unrelated to my patch.
Ref: SearchCatCache()

I've been looking at ways to lock the relation name and namespace
prior to the lookup (or more precisely the hash), but its worth
discussing whether we want that at all?

If this is a pre-existing bug, then it's not clear to me why we need
to do anything about it at all right now. I mean, it would be nice to
have a fix, but it's hard to imagine that any proposed fix will be
low-risk, and I don't remember user complaints about this. I continue
to think that the root of the problem here is that SnapshotNow is Evil
(TM). If we get rid of that, then this problem goes away, but that
strikes me as a long-term project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#25)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 2:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

If this is a pre-existing bug, then it's not clear to me why we need
to do anything about it at all right now.  I mean, it would be nice to
have a fix, but it's hard to imagine that any proposed fix will be
low-risk, and I don't remember user complaints about this.  I continue
to think that the root of the problem here is that SnapshotNow is Evil
(TM).  If we get rid of that, then this problem goes away, but that
strikes me as a long-term project.

There are 2 bugs, one caused by my patch in 9.1, one that is pre-existing.

The 9.1 bug can be fixed easily. I will edit my patch down and repost
here shortly.

The pre-existing bug is slightly harder/contentious because we have to
lock the name of a possible relation, even before we know it exists.
I've been looking to implement that as a lock on the uint32 hash of
the relation's name and namespace. I'm looking for opinions ranging
from fix-now-and-backpatch thru to ignore and discuss for 9.2.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#27Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#26)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> wrote:

I'm looking for opinions ranging from fix-now-and-backpatch thru
to ignore and discuss for 9.2.

If it's a pre-existing bug I would think that one option would be to
put it into the next bug-fix release of each supported major release
in which it is manifest. Of course, if it is *safe* to work it into
9.1, that'd be great.

-Kevin

#28Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#27)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 9:42 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Simon Riggs <simon@2ndQuadrant.com> wrote:

I'm looking for opinions ranging from fix-now-and-backpatch thru
to ignore and discuss for 9.2.

If it's a pre-existing bug I would think that one option would be to
put it into the next bug-fix release of each supported major release
in which it is manifest.  Of course, if it is *safe* to work it into
9.1, that'd be great.

I'm currently on the other end of the spectrum: ignore and consider for 9.2.

But that's mostly based on the belief that there isn't going to be a
way of fixing this that isn't far too invasive to back-patch. Should
that turn out to be incorrect, that's a different matter, of course...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#28)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> wrote:

I'm currently on the other end of the spectrum: ignore and
consider for 9.2.

I guess part of my point was that if we can't safely get something
into the initial 9.1 release, it doesn't mean we necessarily need to
wait for 9.2. Bugs can be fixed along the way in minor releases. A
9.1.1 fix might give us more time to work through details and ensure
that it is a safe and well-targeted fix.

-Kevin

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Jun 19, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

We scan pg_class in two ways: to rebuild a relcache entry based on a
relation's oid (easy fix). We also scan pg_class to resolve the name
to oid mapping. The name to oid mapping is performed *without* a lock
on the relation, since we don't know which relation to lock. So the
name lookup can fail if we are in the middle of a pg_class update.
This is an existing potential bug in Postgres unrelated to my patch.

If this is a pre-existing bug, then it's not clear to me why we need
to do anything about it at all right now.

Yeah. This behavior has been there since day zero, and there have been
very few complaints about it. But note that there's only a risk for
pg_class updates, not any other catalog, and there is exactly one kind
of failure with very predictable consequences. The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

BTW, it seems to me that this issue is closely related to what Noah is
trying to fix here:
http://archives.postgresql.org/message-id/20110612191843.GF21098@tornado.leadboat.com

regards, tom lane

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#30)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 10:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Jun 19, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

We scan pg_class in two ways: to rebuild a relcache entry based on a
relation's oid (easy fix). We also scan pg_class to resolve the name
to oid mapping. The name to oid mapping is performed *without* a lock
on the relation, since we don't know which relation to lock. So the
name lookup can fail if we are in the middle of a pg_class update.
This is an existing potential bug in Postgres unrelated to my patch.

If this is a pre-existing bug, then it's not clear to me why we need
to do anything about it at all right now.

Yeah.  This behavior has been there since day zero, and there have been
very few complaints about it.  But note that there's only a risk for
pg_class updates, not any other catalog, and there is exactly one kind
of failure with very predictable consequences.

I agree we shouldn't do anything about the name lookups for 9.1
That is SearchCatCache using RELNAMENSP lookups, to be precise, as
well as triggers and few other similar call types.

The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

I agree the scope for RELOID errors increased with my 9.1 patch. I'm
now happy with the locking patch (attached), which significantly
reduces the scope - back to the original error scope, in my testing.

I tried to solve both, but I think that's a step too far given the timing.

It seems likely that there will be objections to this patch. All I
would say is that issuing a stream of ALTER TABLEs against the same
table is not a common situation; if it were we would have seen more of
the pre-existing bug. ALTER TABLE command encompasses many subcommands
and we should evaluate each subcommand differently when we decide what
to do.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

relation_definition_lock.v3.patchapplication/octet-stream; name=relation_definition_lock.v3.patchDownload
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 912f45c..c804f87 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2461,6 +2461,22 @@ AlterTable(AlterTableStmt *stmt)
 
 	CheckTableNotInUse(rel, "ALTER TABLE");
 
+	/*
+	 * Catalog relations are read using SnapshotNow, which can mislead
+	 * concurrent readers of the relation's description into thinking the
+	 * relaton does not exist. We avoid those problems by ensuring that
+	 * nobody can re-read the relation description while we modify it.
+	 * We lock the relation's description by oid to exclude readers
+	 * searching for the relation by oid. Some errors are still possible
+	 * since SearchCatCache() searches pg_class by name, so we do not
+	 * to protect against concurrent reads at that point.
+	 *
+	 * Some long running command types that use concurrent lock mode
+	 * have specific locking to avoid holding the Relation Definition lock
+	 * for long periods.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 	/* Check relation type against type specified in the ALTER command */
 	switch (stmt->relkind)
 	{
@@ -3389,11 +3405,25 @@ ATRewriteTables(List **wqueue, LOCKMODE lockmode)
 				 */
 				refrel = heap_open(con->refrelid, ShareRowExclusiveLock);
 
+				/*
+				 * Release the lock so we can validate the constraint without holding
+				 * up other people reading the definition. If we successfully validate
+				 * the constraint we will re-acquire in exclusive mode to allow us to
+				 * mark the constraint validated.
+				 */
+				UnlockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 				validateForeignKeyConstraint(fkconstraint->conname, rel, refrel,
 											 con->refindid,
 											 con->conid);
 
 				/*
+				 * Now the constraint is validated we lock the relation's definition
+				 * until end of transaction.
+				 */
+				LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
+				/*
 				 * No need to mark the constraint row as validated, we did
 				 * that when we inserted the row earlier.
 				 */
@@ -5810,11 +5840,25 @@ ATExecValidateConstraint(Relation rel, char *constrName)
 		 */
 		refrel = heap_open(con->confrelid, RowShareLock);
 
+		/*
+		 * Release the lock so we can validate the constraint without holding
+		 * up other people reading the definition. If we successfully validate
+		 * the constraint we will re-acquire in exclusive mode to allow us to
+		 * mark the constraint validated.
+		 */
+		UnlockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 		validateForeignKeyConstraint(constrName, rel, refrel,
 									 con->conindid,
 									 conid);
 
 		/*
+		 * Now the constraint is validated we lock the relation's definition
+		 * until end of transaction.
+		 */
+		LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
+		/*
 		 * Now update the catalog, while we have the door open.
 		 */
 		copy_con->convalidated = true;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 798d8a8..78dcdb2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -43,6 +43,8 @@
 #include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/lock.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -632,6 +634,11 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 	pfree(DatumGetPointer(values[Anum_pg_trigger_tgattr - 1]));
 
 	/*
+	 * Lock the relation's definition until end of transaction.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
+	/*
 	 * Update relation's pg_class entry.  Crucial side-effect: other backends
 	 * (and this one too!) are sent SI message to make them rebuild relcache
 	 * entries.
@@ -1343,6 +1350,11 @@ EnableDisableTrigger(Relation rel, const char *tgname,
 	else
 		nkeys = 1;
 
+	/*
+	 * Lock the relation's definition until end of transaction.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
 								SnapshotNow, nkeys, keys);
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 859b385..a237417 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -304,6 +304,39 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationDefinitionOid
+ *
+ * This lock tag is used to interlock reading/update of relid defintions.
+ * We need such locking because relcache invalidation and update is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationDefinitionOid(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_DEFINITION(tag, relid);
+
+	(void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
+ *		UnlockRelationDefinitionOid
+ */
+void
+UnlockRelationDefinitionOid(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_DEFINITION(tag, relid);
+
+	LockRelease(&tag, lockmode, false);
+}
+
+/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -727,6 +760,12 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
+		case LOCKTAG_RELATION_DEFINITION:
+			appendStringInfo(buf,
+							 _("definition of relation %u of database %u"),
+							 tag->locktag_field2,
+							 tag->locktag_field1);
+			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 6d7d4f4..7a464db 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -24,6 +24,7 @@
 static const char *const LockTagTypeNames[] = {
 	"relation",
 	"extend",
+	"definition",
 	"page",
 	"tuple",
 	"transactionid",
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 350e040..ebc9c58 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1189,6 +1189,17 @@ SearchCatCache(CatCache *cache,
 	 */
 	relation = heap_open(cache->cc_reloid, AccessShareLock);
 
+	/*
+	 * If we are filling the cache for a relation oid we need to protect
+	 * against concurrent updates because they can lead to missing a
+	 * value that should have been visible.
+	 */
+	if (cache->id == RELOID)
+	{
+		Oid		reloid = DatumGetObjectId(v1);
+		LockRelationDefinitionOid(reloid, ShareLock);
+	}
+
 	scandesc = systable_beginscan(relation,
 								  cache->cc_indexoid,
 								  IndexScanOK(cache, cur_skey),
@@ -1212,6 +1223,12 @@ SearchCatCache(CatCache *cache,
 
 	systable_endscan(scandesc);
 
+	if (cache->id == RELOID)
+	{
+		Oid		reloid = DatumGetObjectId(v1);
+		UnlockRelationDefinitionOid(reloid, ShareLock);
+	}
+
 	heap_close(relation, AccessShareLock);
 
 	/*
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d7e94ff..23c45c6 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -812,6 +812,13 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	Form_pg_class relp;
 
 	/*
+	 * Obtain a short duration lock on the relation's oid to ensure
+	 * no concurrent updates of the definition. See ScanPgRelation().
+	 * We take the lock here to protect all related information.
+	 */
+	LockRelationDefinitionOid(targetRelId, ShareLock);
+
+	/*
 	 * find the tuple in pg_class corresponding to the given relation id
 	 */
 	pg_class_tuple = ScanPgRelation(targetRelId, true);
@@ -907,6 +914,11 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	RelationParseRelOptions(relation, pg_class_tuple);
 
 	/*
+	 * We've finished accessing catalog tables, so unlock
+	 */
+	UnlockRelationDefinitionOid(targetRelId, ShareLock);
+
+	/*
 	 * initialize the relation lock manager information
 	 */
 	RelationInitLockInfo(relation);		/* see lmgr.c */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index bd44d92..99fb52b 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -39,6 +39,10 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
 
+/* Lock a relation definition for relcache rebuild */
+extern void LockRelationDefinitionOid(Oid relid, LOCKMODE lockmode);
+extern void UnlockRelationDefinitionOid(Oid relid, LOCKMODE lockmode);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 7ec961f..185475f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -170,6 +170,8 @@ typedef enum LockTagType
 	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
 	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
+	LOCKTAG_RELATION_DEFINITION,/* the right to rebuild the relcache for a relation */
+	/* ID is relation oid */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
 	LOCKTAG_TUPLE,				/* one physical tuple */
@@ -231,6 +233,14 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+#define SET_LOCKTAG_RELATION_DEFINITION(locktag,reloid) \
+	((locktag).locktag_field1 = MyDatabaseId, \
+	 (locktag).locktag_field2 = (reloid), \
+	 (locktag).locktag_field3 = 0, \
+	 (locktag).locktag_field4 = 0, \
+	 (locktag).locktag_type = LOCKTAG_RELATION_DEFINITION, \
+	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#31)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 11:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree we shouldn't do anything about the name lookups for 9.1
That is SearchCatCache using RELNAMENSP lookups, to be precise, as
well as triggers and few other similar call types.

Name lookups give ERRORs that look like this...

relation "pgbench_accounts" does not exist

The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

I agree the scope for RELOID errors increased with my 9.1 patch.

Which originally generated ERRORs like

cache lookup failed for relation 16390
or
could not open relation with OID 16390

which should now be absent from the server log.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#33Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#30)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 5:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Jun 19, 2011 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

We scan pg_class in two ways: to rebuild a relcache entry based on a
relation's oid (easy fix). We also scan pg_class to resolve the name
to oid mapping. The name to oid mapping is performed *without* a lock
on the relation, since we don't know which relation to lock. So the
name lookup can fail if we are in the middle of a pg_class update.
This is an existing potential bug in Postgres unrelated to my patch.

If this is a pre-existing bug, then it's not clear to me why we need
to do anything about it at all right now.

Yeah.  This behavior has been there since day zero, and there have been
very few complaints about it.  But note that there's only a risk for
pg_class updates, not any other catalog, and there is exactly one kind
of failure with very predictable consequences.  The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

It's not entirely clear to me how many additional failure cases we've
bought ourselves with this patch. The particular one you've
demonstrated seems pretty similar to the on we already had, although
possibly the window for it is wider. Did you run the same test you
used on 9.1 on 9.0 for comparison?

BTW, it seems to me that this issue is closely related to what Noah is
trying to fix here:
http://archives.postgresql.org/message-id/20110612191843.GF21098@tornado.leadboat.com

I think there are two general problems here. One, we use SnapshotNow
to scan system catalogs, and SnapshotNow semantics are a mess. Two,
even if we used an MVCC snapshot to scan the system catalogs, it's not
necessarily safe or correct to latch onto an old row version, because
the row update is combined with other actions that are not MVCC-safe,
like removing files on disk. Breaking it down a bit more:

A. The problem with using a DDL lock level < AccessExclusiveLock,
AFAICS, is entirely due to SnapshotNow semantics. Any row version we
can possibly see is OK, but we had better see exactly one. Or at
least not less than one.

B. The problem with name lookups failing in the middle of a pg_class
update is also entirely due to SnapshotNow semantics.

C. The problem Noah is complaining about is NOT due to SnapshotNow
semantics. If someone does BEGIN; DROP TABLE foo; CREATE TABLE foo();
COMMIT, it is no longer OK for anyone to be looking at the old foo.
An MVCC snapshot is enough to guarantee that we'll see some row, but
examining the old one won't cut it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#31)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jun 20, 2011 at 6:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree the scope for RELOID errors increased with my 9.1 patch. I'm
now happy with the locking patch (attached), which significantly
reduces the scope - back to the original error scope, in my testing.

I tried to solve both, but I think that's a step too far given the timing.

It seems likely that there will be objections to this patch. All I
would say is that issuing a stream of ALTER TABLEs against the same
table is not a common situation; if it were we would have seen more of
the pre-existing bug. ALTER TABLE command encompasses many subcommands
and we should evaluate each subcommand differently when we decide what
to do.

Well, my principal objection is that I think heavyweight locking is an
excessively expensive solution to this problem. I think the patch is
simple enough that I wouldn't object to applying it on those grounds
even at this late date, but I bet if we do some benchmarking on the
right workload we'll find a significant performance regression.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#33)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 20, 2011 at 5:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. This behavior has been there since day zero, and there have been
very few complaints about it. But note that there's only a risk for
pg_class updates, not any other catalog, and there is exactly one kind
of failure with very predictable consequences. The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

It's not entirely clear to me how many additional failure cases we've
bought ourselves with this patch. The particular one you've
demonstrated seems pretty similar to the on we already had, although
possibly the window for it is wider.

It's not so much the probability of failure that is bothering me, as
the variety of possible symptoms. There was exactly one failure mode
before, namely "no such relation". I'm not sure how many possible
symptoms there are now, but there's a lot, and most of them are going
to be weird "what the heck was that??" behaviors. If we let 9.1 ship
like this, we are going to be creating a support headache. Even worse,
knowing that those bugs exist will tempt us to write off reports of
weird cache lookup failures as being instances of this problem, when
closer investigation might show that they're something else.

Please note that this position should not be regarded as support for
Simon's proposed patch. I still think the right decision is to revert
the ALTER TABLE feature, mainly because I do not believe this is the
last bug in it. And the fact that there's a pre-existing bug with a
vaguely similar symptom is no justification for introducing more bugs.

regards, tom lane

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#31)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jun 20, 2011 at 10:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

I agree the scope for RELOID errors increased with my 9.1 patch. I'm
now happy with the locking patch (attached), which significantly
reduces the scope - back to the original error scope, in my testing.

I tried to solve both, but I think that's a step too far given the timing.

It seems likely that there will be objections to this patch.

Yup, you're right. Having read this patch, I have absolutely zero
confidence in it. It introduces some locks in random places, with no
rhyme or reason that I can see. There is no reason to think that this
is a complete solution, and considerable reason to think that it isn't
(notably, the RELOID syscache is hardly the only one at risk). Worse,
it's adding more locking in performance-critical places, which seems
to me to severely degrade the argument for the original feature,
namely that it was supposed to give us *less* locking.

regards, tom lane

#37Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#34)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Robert Haas's message of mar jun 21 09:40:16 -0400 2011:

On Mon, Jun 20, 2011 at 6:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I agree the scope for RELOID errors increased with my 9.1 patch. I'm
now happy with the locking patch (attached), which significantly
reduces the scope - back to the original error scope, in my testing.

I tried to solve both, but I think that's a step too far given the timing.

It seems likely that there will be objections to this patch. All I
would say is that issuing a stream of ALTER TABLEs against the same
table is not a common situation; if it were we would have seen more of
the pre-existing bug. ALTER TABLE command encompasses many subcommands
and we should evaluate each subcommand differently when we decide what
to do.

Well, my principal objection is that I think heavyweight locking is an
excessively expensive solution to this problem. I think the patch is
simple enough that I wouldn't object to applying it on those grounds
even at this late date, but I bet if we do some benchmarking on the
right workload we'll find a significant performance regression.

Yeah, taking a hw lock at each relcache item build is likely to be
prohibitively expensive. Heck, relation extension is expensive already
in some loads. (I'm guessing that things will tank when there's a
relcache reset). Still, this seems to be an overall better approach to
the problem than what's been proposed elsewhere. Maybe we can do
something with a map of relations that are protected by a bunch of
lwlocks instead. (We could use that for relation extension too; that'd
rock.)

The patch may be simple, but is it complete? I think you need to have
lock acquisition on create rule and create index too. Right now it only
has locks on trigger stuff and alter table.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#38Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#35)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Tom Lane's message of mar jun 21 11:06:22 -0400 2011:

Please note that this position should not be regarded as support for
Simon's proposed patch. I still think the right decision is to revert
the ALTER TABLE feature, mainly because I do not believe this is the
last bug in it. And the fact that there's a pre-existing bug with a
vaguely similar symptom is no justification for introducing more bugs.

Note that this feature can be disabled by tweaking
AlterTableGetLockLevel so that it always returns AccessExclusive.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#38)
Re: ALTER TABLE lock strength reduction patch is unsafe

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Tom Lane's message of mar jun 21 11:06:22 -0400 2011:

Please note that this position should not be regarded as support for
Simon's proposed patch. I still think the right decision is to revert
the ALTER TABLE feature, mainly because I do not believe this is the
last bug in it. And the fact that there's a pre-existing bug with a
vaguely similar symptom is no justification for introducing more bugs.

Note that this feature can be disabled by tweaking
AlterTableGetLockLevel so that it always returns AccessExclusive.

I think Simon had also hacked a couple of other places such as CREATE
TRIGGER, but yeah, I was thinking of just lobotomizing that function
with an #ifdef. When and if we get these problems worked out, it'll
be easy to re-enable the feature.

regards, tom lane

#40Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#36)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jun 21, 2011 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jun 20, 2011 at 10:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The ALTER TABLE patch
has greatly expanded the scope of the issue, and that *is* a regression
compared to prior releases.

I agree the scope for RELOID errors increased with my 9.1 patch. I'm
now happy with the locking patch (attached), which significantly
reduces the scope - back to the original error scope, in my testing.

I tried to solve both, but I think that's a step too far given the timing.

It seems likely that there will be objections to this patch.

Yup, you're right.  Having read this patch, I have absolutely zero
confidence in it.  It introduces some locks in random places, with no
rhyme or reason that I can see.

There is something to be salvaged here and I have additional analysis
and suggestions. Please read all the way through.

We already allowed INHERIT/dis INHERIT with a lower lock level than
AccessExclusiveLock and have received no complaints to speak of. We
aren't discussing moving that operation to AccessExclusiveLock in the
light of this discussion, nor are we discussing TYPE OF operations. I
don't think its sensible to reset those operations to
AccessExclusiveLock.

The patch has extensive comments that explain the carefully analysed
and tested locations for the locking.

We take an ExclusiveLock on the RelationDefinition during an ALTER TABLE.
We take a ShareLock on CatCache and RelCache rebuilds.

The only time these can block is during an ALTER TABLE with a lower
lock than AccessExclusiveLock that occurs immediately after other DDL.
The locking has been placed to minimise the lock window for simple
ALTER TABLE statements.

There is no reason to think that this
is a complete solution, and considerable reason to think that it isn't
(notably, the RELOID syscache is hardly the only one at risk).

The whole of the relcache rebuild is protected, not just access to
pg_class, so that is safe, no matter which catalog tables are
accessed.

Catcache does look more complex, so your fears here are worth looking
into. I was already doing that since my last post and have now
finished.

Based on the above, I think we should limit the patch to these areas:

1. Simple operations on pg_class and/or pg_attribute. This includes
case AT_SetStatistics: ATTNAME catcache
case AT_SetOptions: ATTNAME catcache
case AT_ResetOptions: ATTNAME catcache
case AT_SetStorage: ATTNAME catcache
case AT_SetRelOptions: RELOID catcache
case AT_ResetRelOptions: RELOID catcache

2. FK VALIDATION which touches CONSTROID

Limiting the scope of the patch to those areas provides good benefit,
whilst at the same time limiting our risk footprint. Doing that means
we can evaluate the patch against those aspects.

For people that still think there is still risk, I've added a
parameter called "relaxed_ddl_locking" which defaults to "off". That
way people can use this feature in an informed way without exposing us
to support risks from its use.

[Patch needs minor docs and test documentation changes prior to commit]

Worse,
it's adding more locking in performance-critical places, which seems
to me to severely degrade the argument for the original feature,
namely that it was supposed to give us *less* locking.

I think we should be clear that this adds *one* lock/unlock for each
object for each session, ONCE only after each transaction that
executed a DDL statement against an object. So with 200 sessions, a
typical ALTER TABLE statement will cause 400 locks. The current lock
manager maxes out above 50,000 locks per second, so any comparative
analysis will show this is a minor blip on even a heavy workload.

It isn't in a performance critical place and the lock held in case of
a rebuild is a ShareLock so won't block, accept against an ALTER TABLE
statement.

This is a marked *reduction* in locking overhead, in comparison with
ALTER TABLE being an AccessExclusiveLock which would make the SQL
statement wait for potentially a very long time.

With relaxed_ddl_locking = off (default) there is zero effect.

So that is not a valid argument to refuse this patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

relaxed_ddl_locking.v4.patchapplication/octet-stream; name=relaxed_ddl_locking.v4.patchDownload
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 912f45c..0767aca 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2461,6 +2461,22 @@ AlterTable(AlterTableStmt *stmt)
 
 	CheckTableNotInUse(rel, "ALTER TABLE");
 
+	/*
+	 * Catalog relations are read using SnapshotNow, which can mislead
+	 * concurrent readers of the relation's description into thinking the
+	 * relaton does not exist. We avoid those problems by ensuring that
+	 * nobody can re-read the relation description while we modify it.
+	 * We lock the relation's description by oid to exclude readers
+	 * searching for the relation by oid. Some errors are still possible
+	 * since SearchCatCache() searches pg_class by name, so we do not
+	 * to protect against concurrent reads at that point.
+	 *
+	 * Some long running command types that use concurrent lock mode
+	 * have specific locking to avoid holding the Relation Definition lock
+	 * for long periods.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 	/* Check relation type against type specified in the ALTER command */
 	switch (stmt->relkind)
 	{
@@ -2568,12 +2584,22 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
  * ALTER TABLE RENAME which is treated as a different statement type T_RenameStmt
  * and does not travel through this section of code and cannot be combined with
  * any of the subcommands given here.
+ *
+ * Any reduction in lock level below AccessExclusiveLock must also consider
+ * that some protection is required to avoid SnapshotNow accesses from missing
+ * rows while the catalog updates take place. RelationDefinition locks are
+ * provided for that purpose.
  */
 LOCKMODE
 AlterTableGetLockLevel(List *cmds)
 {
 	ListCell   *lcmd;
-	LOCKMODE	lockmode = ShareUpdateExclusiveLock;
+	LOCKMODE	lockmode;
+
+	if (!relaxed_ddl_locking)
+		return AccessExclusiveLock;
+
+	lockmode = ShareUpdateExclusiveLock;
 
 	foreach(lcmd, cmds)
 	{
@@ -2625,7 +2651,7 @@ AlterTableGetLockLevel(List *cmds)
 			case AT_DisableTrigUser:
 			case AT_AddIndex:	/* from ADD CONSTRAINT */
 			case AT_AddIndexConstraint:
-				cmd_lockmode = ShareRowExclusiveLock;
+				cmd_lockmode = AccessExclusiveLock;
 				break;
 
 			case AT_AddConstraint:
@@ -2645,7 +2671,7 @@ AlterTableGetLockLevel(List *cmds)
 							 * we can work out how to allow concurrent catalog
 							 * updates.
 							 */
-							cmd_lockmode = ShareRowExclusiveLock;
+							cmd_lockmode = AccessExclusiveLock;
 							break;
 						case CONSTR_FOREIGN:
 
@@ -2654,11 +2680,11 @@ AlterTableGetLockLevel(List *cmds)
 							 * Foreign Key, so the lock level must be at least
 							 * as strong as CREATE TRIGGER.
 							 */
-							cmd_lockmode = ShareRowExclusiveLock;
+							cmd_lockmode = AccessExclusiveLock;
 							break;
 
 						default:
-							cmd_lockmode = ShareRowExclusiveLock;
+							cmd_lockmode = AccessExclusiveLock;
 					}
 				}
 				break;
@@ -3389,11 +3415,25 @@ ATRewriteTables(List **wqueue, LOCKMODE lockmode)
 				 */
 				refrel = heap_open(con->refrelid, ShareRowExclusiveLock);
 
+				/*
+				 * Release the lock so we can validate the constraint without holding
+				 * up other people reading the definition. If we successfully validate
+				 * the constraint we will re-acquire in exclusive mode to allow us to
+				 * mark the constraint validated.
+				 */
+				UnlockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 				validateForeignKeyConstraint(fkconstraint->conname, rel, refrel,
 											 con->refindid,
 											 con->conid);
 
 				/*
+				 * Now the constraint is validated we lock the relation's definition
+				 * until end of transaction.
+				 */
+				LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
+				/*
 				 * No need to mark the constraint row as validated, we did
 				 * that when we inserted the row earlier.
 				 */
@@ -5810,11 +5850,26 @@ ATExecValidateConstraint(Relation rel, char *constrName)
 		 */
 		refrel = heap_open(con->confrelid, RowShareLock);
 
+		/*
+		 * Release the lock so we can validate the constraint without holding
+		 * up other people reading the definition. If we successfully validate
+		 * the constraint we will re-acquire in exclusive mode to allow us to
+		 * mark the constraint validated.
+		 */
+		UnlockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 		validateForeignKeyConstraint(constrName, rel, refrel,
 									 con->conindid,
 									 conid);
 
 		/*
+		 * Now the constraint is validated we lock the relation's definition
+		 * until end of transaction for both the class and the constraint.
+		 */
+		LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+		LockRelationDefinitionOid(conid, ExclusiveLock);
+
+		/*
 		 * Now update the catalog, while we have the door open.
 		 */
 		copy_con->convalidated = true;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 798d8a8..78dcdb2 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -43,6 +43,8 @@
 #include "pgstat.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/lock.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -632,6 +634,11 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 	pfree(DatumGetPointer(values[Anum_pg_trigger_tgattr - 1]));
 
 	/*
+	 * Lock the relation's definition until end of transaction.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
+	/*
 	 * Update relation's pg_class entry.  Crucial side-effect: other backends
 	 * (and this one too!) are sent SI message to make them rebuild relcache
 	 * entries.
@@ -1343,6 +1350,11 @@ EnableDisableTrigger(Relation rel, const char *tgname,
 	else
 		nkeys = 1;
 
+	/*
+	 * Lock the relation's definition until end of transaction.
+	 */
+	LockRelationDefinitionOid(RelationGetRelid(rel), ExclusiveLock);
+
 	tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true,
 								SnapshotNow, nkeys, keys);
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 859b385..a3f1ae1 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -24,6 +24,7 @@
 #include "storage/procarray.h"
 #include "utils/inval.h"
 
+bool relaxed_ddl_locking;
 
 /*
  * RelationInitLockInfo
@@ -304,6 +305,45 @@ UnlockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationDefinitionOid
+ *
+ * This lock tag is used to interlock reading/update of relid defintions.
+ * We need such locking because relcache invalidation and update is not
+ * race-condition-proof.
+ *
+ * We assume the caller is already holding some type of regular lock on
+ * the relation, so no AcceptInvalidationMessages call is needed here.
+ */
+void
+LockRelationDefinitionOid(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	if (!relaxed_ddl_locking)
+		return;
+
+	SET_LOCKTAG_RELATION_DEFINITION(tag, relid);
+
+	(void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
+ *		UnlockRelationDefinitionOid
+ */
+void
+UnlockRelationDefinitionOid(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	if (!relaxed_ddl_locking)
+		return;
+
+	SET_LOCKTAG_RELATION_DEFINITION(tag, relid);
+
+	LockRelease(&tag, lockmode, false);
+}
+
+/*
  *		LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -727,6 +767,12 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
 							 tag->locktag_field2,
 							 tag->locktag_field1);
 			break;
+		case LOCKTAG_RELATION_DEFINITION:
+			appendStringInfo(buf,
+							 _("definition of relation %u of database %u"),
+							 tag->locktag_field2,
+							 tag->locktag_field1);
+			break;
 		case LOCKTAG_PAGE:
 			appendStringInfo(buf,
 							 _("page %u of relation %u of database %u"),
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index 6d7d4f4..7a464db 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -24,6 +24,7 @@
 static const char *const LockTagTypeNames[] = {
 	"relation",
 	"extend",
+	"definition",
 	"page",
 	"tuple",
 	"transactionid",
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 350e040..229df3a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1189,6 +1189,19 @@ SearchCatCache(CatCache *cache,
 	 */
 	relation = heap_open(cache->cc_reloid, AccessShareLock);
 
+	/*
+	 * If we are filling the cache for a relation oid we need to protect
+	 * against concurrent updates because they can lead to missing a
+	 * value that should have been visible. When we are refreshing the
+	 * RELOID and ATTNAME caches v1 will be the class oid. When refreshing
+	 * the CONSTROID v1 is the constraint oid.
+	 */
+	if (cache->id == RELOID || cache->id == ATTNAME || cache->id == CONSTROID)
+	{
+		Oid		reloid = DatumGetObjectId(v1);
+		LockRelationDefinitionOid(reloid, ShareLock);
+	}
+
 	scandesc = systable_beginscan(relation,
 								  cache->cc_indexoid,
 								  IndexScanOK(cache, cur_skey),
@@ -1212,6 +1225,12 @@ SearchCatCache(CatCache *cache,
 
 	systable_endscan(scandesc);
 
+	if (cache->id == RELOID || cache->id == ATTNAME || cache->id == CONSTROID)
+	{
+		Oid		reloid = DatumGetObjectId(v1);
+		UnlockRelationDefinitionOid(reloid, ShareLock);
+	}
+
 	heap_close(relation, AccessShareLock);
 
 	/*
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d7e94ff..23c45c6 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -812,6 +812,13 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	Form_pg_class relp;
 
 	/*
+	 * Obtain a short duration lock on the relation's oid to ensure
+	 * no concurrent updates of the definition. See ScanPgRelation().
+	 * We take the lock here to protect all related information.
+	 */
+	LockRelationDefinitionOid(targetRelId, ShareLock);
+
+	/*
 	 * find the tuple in pg_class corresponding to the given relation id
 	 */
 	pg_class_tuple = ScanPgRelation(targetRelId, true);
@@ -907,6 +914,11 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	RelationParseRelOptions(relation, pg_class_tuple);
 
 	/*
+	 * We've finished accessing catalog tables, so unlock
+	 */
+	UnlockRelationDefinitionOid(targetRelId, ShareLock);
+
+	/*
 	 * initialize the relation lock manager information
 	 */
 	RelationInitLockInfo(relation);		/* see lmgr.c */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 92391ed..eb4e4a4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -59,6 +59,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
+#include "storage/lmgr.h"
 #include "storage/standby.h"
 #include "storage/fd.h"
 #include "storage/predicate.h"
@@ -1387,6 +1388,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"relaxed_ddl_locking", PGC_SIGHUP, LOCK_MANAGEMENT,
+			gettext_noop("Allows relaxed locking of some subcommands of ALTER TABLE."),
+			NULL
+		},
+		&relaxed_ddl_locking,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Allows modifications of the structure of system tables."),
 			NULL,
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index bd44d92..3627655 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -39,6 +39,12 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
 
+/* Lock a relation definition for relcache rebuild */
+extern void LockRelationDefinitionOid(Oid relid, LOCKMODE lockmode);
+extern void UnlockRelationDefinitionOid(Oid relid, LOCKMODE lockmode);
+
+extern bool relaxed_ddl_locking;
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 7ec961f..185475f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -170,6 +170,8 @@ typedef enum LockTagType
 	/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */
 	LOCKTAG_RELATION_EXTEND,	/* the right to extend a relation */
 	/* same ID info as RELATION */
+	LOCKTAG_RELATION_DEFINITION,/* the right to rebuild the relcache for a relation */
+	/* ID is relation oid */
 	LOCKTAG_PAGE,				/* one page of a relation */
 	/* ID info for a page is RELATION info + BlockNumber */
 	LOCKTAG_TUPLE,				/* one physical tuple */
@@ -231,6 +233,14 @@ typedef struct LOCKTAG
 	 (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
 	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+#define SET_LOCKTAG_RELATION_DEFINITION(locktag,reloid) \
+	((locktag).locktag_field1 = MyDatabaseId, \
+	 (locktag).locktag_field2 = (reloid), \
+	 (locktag).locktag_field3 = 0, \
+	 (locktag).locktag_field4 = 0, \
+	 (locktag).locktag_type = LOCKTAG_RELATION_DEFINITION, \
+	 (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
 	((locktag).locktag_field1 = (dboid), \
 	 (locktag).locktag_field2 = (reloid), \
#41Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#40)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

For people that still think there is still risk, I've added a
parameter called "relaxed_ddl_locking" which defaults to "off". That
way people can use this feature in an informed way without exposing us
to support risks from its use.

I can't get excited about adding a GUC that says "you can turn this
off if you want but don't blame us if it breaks". That just doesn't
seem like good software engineering to me.

I think we should be clear that this adds *one* lock/unlock for each
object for each session, ONCE only after each transaction that
executed a DDL statement against an object. So with 200 sessions, a
typical ALTER TABLE statement will cause 400 locks. The current lock
manager maxes out above 50,000 locks per second, so any comparative
analysis will show this is a minor blip on even a heavy workload.

I think that using locks to address this problem is the wrong
approach. I think the right fix is to use something other than
SnapshotNow to do the system catalog scans. However, if we were going
to use locking, then we'd need to ensure that:

(1) No transaction which has made changes to a system catalog may
commit while some other transaction is in the middle of scanning that
catalog.
(2) No transaction which has made changes to a set of system catalogs
may commit while some other transaction is in the midst of fetching
interrelated data from a subset of those catalogs.

It's important to note, I think, that the problem here doesn't occur
at the time that the table rows are updated, because those rows aren't
visible yet. The problem happens at commit time. So unless I'm
missing something, ALTER TABLE would only need to acquire the relation
definition lock after it has finished all of its other work. In fact,
it doesn't even really need to acquire it then, either. It could just
queue up a list of relation definition locks to acquire immediately
before commit, which would be advantageous if the transaction went on
to abort, since in that case we'd skip the work of acquiring the locks
at all.

In fact, without doing that, this patch would undo much of the point
of the original ALTER TABLE lock strength reduction:

begin;
alter table foo alter column a set storage plain;
<start a new psql session in another window>
select * from foo;

In master, the SELECT completes without blocking. With this patch
applied, the SELECT blocks, just as it did in 9.0, because it can't
rebuild the relcache entry without getting the relation definition
lock, which the alter table will hold until commit. In fact, the same
thing happens even if foo has been accessed previously in the same
session. If there is already an open *transaction* that has accessed
foo, then, it seems, it can continue to do so until it commits. But
as soon as it tries to start a new transaction, it blocks again. I
don't quite understand why that is - I didn't think we invalidated the
entire relcache on every commit. But that's what happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#41)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Jun 23, 2011 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

For people that still think there is still risk, I've added a
parameter called "relaxed_ddl_locking" which defaults to "off". That
way people can use this feature in an informed way without exposing us
to support risks from its use.

I can't get excited about adding a GUC that says "you can turn this
off if you want but don't blame us if it breaks".  That just doesn't
seem like good software engineering to me.

Nobody is suggesting we fix the pre-existing bug - we're just going to
leave it. That doesn't sound like good software engineering either,
but I'm not going to complain because I understand.

We're in a bind here and I'm trying to suggest practical ways out, as
well as cater for the levels of risk people are willing to accept. I
don't think we need that parameter at all, but if you do then I'm
willing to tolerate it.

I think we should be clear that this adds *one* lock/unlock for each
object for each session, ONCE only after each transaction that
executed a DDL statement against an object. So with 200 sessions, a
typical ALTER TABLE statement will cause 400 locks. The current lock
manager maxes out above 50,000 locks per second, so any comparative
analysis will show this is a minor blip on even a heavy workload.

I think that using locks to address this problem is the wrong
approach.  I think the right fix is to use something other than
SnapshotNow to do the system catalog scans.

I agree with that, but its a much bigger job than you think and I
really doubt that you will do it for 9.2, definitely not for 9.1.

There are so many call points, I would not bother personally to
attempt a such an critical and widely invasive fix.

So I'd put that down as a Blue Sky will-fix-one-day thing.

However, if we were going
to use locking, then we'd need to ensure that:

(1) No transaction which has made changes to a system catalog may
commit while some other transaction is in the middle of scanning that
catalog.
(2) No transaction which has made changes to a set of system catalogs
may commit while some other transaction is in the midst of fetching
interrelated data from a subset of those catalogs.

It's important to note, I think, that the problem here doesn't occur
at the time that the table rows are updated, because those rows aren't
visible yet.  The problem happens at commit time.  So unless I'm
missing something, ALTER TABLE would only need to acquire the relation
definition lock after it has finished all of its other work.  In fact,
it doesn't even really need to acquire it then, either.  It could just
queue up a list of relation definition locks to acquire immediately
before commit, which would be advantageous if the transaction went on
to abort, since in that case we'd skip the work of acquiring the locks
at all.

That would work if the only thing ALTER TABLE does is write. There are
various places where we read catalogs and all of those catalogs need
to be relationally integrated. So the only safe way, without lots of
incredibly detailed analysis (which you would then find fault with) is
to lock at the top and hold the lock throughout most of the
processing. That's the safe thing to do, at least.

I do already use the point you make in the patch, where it's used to
unlock before and then relock after the FK constraint check, so that
the RelationDefinition lock is never held for long periods in any code
path.

In fact, without doing that, this patch would undo much of the point
of the original ALTER TABLE lock strength reduction:

begin;
alter table foo alter column a set storage plain;
<start a new psql session in another window>
select * from foo;

In master, the SELECT completes without blocking.  With this patch
applied, the SELECT blocks, just as it did in 9.0, because it can't
rebuild the relcache entry without getting the relation definition
lock, which the alter table will hold until commit.

It's not the same at all. Yes, they are both locks; what counts is the
frequency and duration of locking.

With AccessExclusiveLock we prevent all SELECTs from accessing the
table while we queue for the lock, which can be blocked behind a long
running query. So the total outage to the application for one minor
change can be hours - unpredictably. (Which is why I never wrote the
ALTER TABLE set ndistinct myself, even though I suggested it, cos its
mostly unusable).

With the reduced locking level AND this patch the *rebuild* can be
blocked behind the DDL. But the DDL itself is not blocked in the same
way, so the total outage is much reduced and the whole set of actions
goes through fairly quickly.

If you only submit one DDL operation then the rebuild has nothing to
block against, so the whole thing goes through smoothly.

In fact, the same
thing happens even if foo has been accessed previously in the same
session.  If there is already an open *transaction* that has accessed
foo, then, it seems, it can continue to do so until it commits.  But
as soon as it tries to start a new transaction, it blocks again.  I
don't quite understand why that is - I didn't think we invalidated the
entire relcache on every commit.

We don't invalidate the relcache on every commit.

But that's what happens.

Test case please. I don't understand the problem you're describing.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#43Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#42)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Test case please. I don't understand the problem you're describing.

S1: select * from foo;
S2: begin;
S2: alter table foo alter column a set storage plain;
S1: select * from foo;
<blocks>

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#43)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 24, 2011 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Test case please. I don't understand the problem you're describing.

S1: select * from foo;
S2: begin;
S2: alter table foo alter column a set storage plain;
S1: select * from foo;
<blocks>

Er,,.yes, that what locks do. Where is the bug?

We have these choices of behaviour
1. It doesn't error and doesn't block - not possible for 9.1, probably
not for 9.2 either
2. It doesn't block, but may throw an error sometimes - the reported bug
3. It blocks in some cases for short periods where people do repeated
DDL, but never throws errors - this patch
4. Full scale locking - human sacrifice, cats and dogs, living
together, mass hysteria

If you want to avoid the blocking, then don't hold open the transaction.

Do this

S1: select * from foo
S2: alter table.... run in its own transaction
S1: select * from foo

Doesn't block, no errors. Which is exactly what most people do on
their production servers. The ALTER TABLE statements we're talking
about are not schema changes. They don't need to be coordinated with
other DDL.

This patch has locking, but its the most reduced form of locking that
is available for a non invasive patch for 9.1

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#45Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#44)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 4:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jun 24, 2011 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 24, 2011 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Test case please. I don't understand the problem you're describing.

S1: select * from foo;
S2: begin;
S2: alter table foo alter column a set storage plain;
S1: select * from foo;
<blocks>

Er,,.yes, that what locks do. Where is the bug?

I didn't say it was a bug. I said that the additional locking you
added acted to remove the much of the benefit of reducing the lock
strength in the first place. If a brief exclusive lock (such as would
be taken by ALTER TABLE SET STORAGE in 9.0 and prior release) wasn't
acceptable, a brief exclusive lock on a different lock tag that blocks
most of the same operations isn't going to be any better. You might
still see some improvement in the cases where some complicated
operation like scanning or rewriting the table can be performed
without holding the relation definition lock, and then only get the
relation definition lock afterward. But for something like the above
example (or the ALTER TABLE SET (n_distinct) feature you mentioned in
your previous email) the benefit is going to be darned close to zero.

This patch has locking, but its the most reduced form of locking that
is available for a non invasive patch for 9.1

I don't like the extra lock manager traffic this adds to operations
that aren't even performing DDL, I'm not convinced that it is safe,
the additional locking negates a significant portion of the benefit of
the original patch, and Tom's made a fairly convincing argument that
even with this pg_dump will still become significantly less reliable
than heretofore even if we did apply it. In my view, what we need to
do is revert to using AccessExclusiveLock until some of these other
details are worked out. I wasn't entirely convinced that that was
necessary when Tom first posted this email, but having thought through
the issues and looked over your patch, it's now my position that that
is the best way forward. The fact that your proposed patch appears
not to be thinking very clearly about when locks need to be taken and
released only adds to my feeling that we are in for a bad time if we
go down this route.

In short: -1 from me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#45)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 10:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 24, 2011 at 4:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jun 24, 2011 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 24, 2011 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Test case please. I don't understand the problem you're describing.

S1: select * from foo;
S2: begin;
S2: alter table foo alter column a set storage plain;
S1: select * from foo;
<blocks>

Er,,.yes, that what locks do. Where is the bug?

I didn't say it was a bug.  I said that the additional locking you
added acted to remove the much of the benefit of reducing the lock
strength in the first place.  If a brief exclusive lock (such as would
be taken by ALTER TABLE SET STORAGE in 9.0 and prior release) wasn't
acceptable, a brief exclusive lock on a different lock tag that blocks
most of the same operations isn't going to be any better.  You might
still see some improvement in the cases where some complicated
operation like scanning or rewriting the table can be performed
without holding the relation definition lock, and then only get the
relation definition lock afterward.  But for something like the above
example (or the ALTER TABLE SET (n_distinct) feature you mentioned in
your previous email) the benefit is going to be darned close to zero.

This patch has locking, but its the most reduced form of locking that
is available for a non invasive patch for 9.1

I don't like the extra lock manager traffic this adds to operations
that aren't even performing DDL, I'm not convinced that it is safe,
the additional locking negates a significant portion of the benefit of
the original patch, and Tom's made a fairly convincing argument that
even with this pg_dump will still become significantly less reliable
than heretofore even if we did apply it.  In my view, what we need to
do is revert to using AccessExclusiveLock until some of these other
details are worked out.  I wasn't entirely convinced that that was
necessary when Tom first posted this email, but having thought through
the issues and looked over your patch, it's now my position that that
is the best way forward.  The fact that your proposed patch appears
not to be thinking very clearly about when locks need to be taken and
released only adds to my feeling that we are in for a bad time if we
go down this route.

In short: -1 from me.

I've explained all of the above points to you already and you're wrong.

I don't think you understand this patch at all, nor even the original feature.

Locking the table is completely different from locking the definition
of a table.

Do you advocate that all ALTER TABLE operations use
AccessExclusiveLock, or just the operations I added? If you see
problems here, then you must also be willing to increase the lock
strength of things such as INHERITS, and back patch them to where the
same problems exist. You'll wriggle out of that, I'm sure. There are
regrettably, many bugs here and they can't be fixed in the simple
manner you propose.

It's not me you block Robert, I'm not actually a user and I will sleep
well whatever happens, happy that I tried to resolve this. Users watch
and remember.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#47Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#46)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 5:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I've explained all of the above points to you already and you're wrong.

We're going to have to agree to disagree on that point.

Do you advocate that all ALTER TABLE operations use
AccessExclusiveLock, or just the operations I added? If you see
problems here, then you must also be willing to increase the lock
strength of things such as INHERITS, and back patch them to where the
same problems exist. You'll wriggle out of that, I'm sure. There are
regrettably, many bugs here and they can't be fixed in the simple
manner you propose.

I think there is quite a lot of difference between realizing that we
can't fix every problem, and deciding to put out a release that adds a
whole lot more of them that we have no plans to fix.

It's not me you block Robert, I'm not actually a user and I will sleep
well whatever happens, happy that I tried to resolve this. Users watch
and remember.

If you are proposing that I should worry about a posse of angry
PostgreSQL users hunting me down (or abandoning the product) because I
agreed with Tom Lane on the necessity of reverting one of your
patches, then I'm willing to take that chance. For one thing, there's
a pretty good chance they'll go after Tom first. For two things,
there's at least an outside chance I might be rescued by an
alternative posse who supports our tradition of putting out high
quality releases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#47)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 24, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It's not me you block Robert, I'm not actually a user and I will sleep
well whatever happens, happy that I tried to resolve this. Users watch
and remember.

If you are proposing that I should worry about a posse of angry
PostgreSQL users hunting me down (or abandoning the product) because I
agreed with Tom Lane on the necessity of reverting one of your
patches, then I'm willing to take that chance.  For one thing, there's
a pretty good chance they'll go after Tom first.  For two things,
there's at least an outside chance I might be rescued by an
alternative posse who supports our tradition of putting out high
quality releases.

Not sure where anger and violence entered the discussion, but a posse
does sound amusing.

If such groups ever met, I'm sure both would support the tradition of
high quality, but I suspect they might define it differently and that
would be the source of the problem. Much like Life of Brian.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#13)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 8:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Department of second thoughts: I think I see a problem.

Um, yeah, so that doesn't really work any better than my idea.

On further reflection, there's a problem at a higher level than this
anyway.  Even if we can get a single SnapshotNow scan to produce
guaranteed-self-consistent results, that doesn't ensure consistency
between the results of scans occurring serially.  An example here is
ALTER COLUMN DROP DEFAULT, which is currently imagined to impact only
writers.  However, suppose that a concurrent relcache load fetches the
pg_attribute row, notes that it has atthasdef = true, and then the ALTER
commits before we start to scan pg_attrdef.  The consistency checks in
AttrDefaultFetch() will complain about a missing pg_attrdef entry, and
rightly so.  We could lobotomize those checks, but it doesn't feel right
to do so; and anyway there may be other cases that are harder to kluge up.

Locking the whole definition is at least one way of solving this
problem. My locking fix does that.

So really we need consistency across *at least* one entire relcache load
cycle.  We could maybe arrange to take an MVCC snap (or some lighter
weight version of that) at the start, and use that for all the resulting
scans, but I think that would be notationally messy.  It's not clear
that it'd solve everything anyhow.  There are parts of a relcache entry
that we fetch only on-demand, so they are typically loaded later than
the core items, and probably couldn't use the same snapshot.  Worse,
there are lots of places where we assume that use of catcache entries or
direct examination of the catalogs will yield results consistent with
the relcache.

I suspect these latter problems will impact Simon's idea as well.

I think you're probably right, or at least, the suspicion is not
something I can address quickly enough to be safe.

I will revert to the AccessExclusiveLocks.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#6)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Jun 17, 2011 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 16, 2011 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I believe that this is fundamentally unavoidable so long as we use
SnapshotNow to read catalogs --- which is something we've talked about
changing, but it will require a pretty major R&D effort to make it
happen.

Ouch.

I wonder if we could avoid this anomaly by taking a throwaway MVCC
snapshot at the beginning of each system catalog scan and using it
just for the duration of that scan.  If nothing that has touched the
catalog commits while the scan is open, then this is logically
equivalent to SnapshotNow.  If something does commit in mid-scan, then
we might not get the latest version of the row, but we should end up
with exactly one.  If it's not the latest one, we'll do the rebuild
again upon seeing the next sinval message; in the meantime, the
version we're using mustn't be too intolerably bad or it was an error
not to use AccessExclusiveLock in the first place.

Yeah, this seems like a possibly workable direction to explore.  I like
this better than what Simon is proposing, because it would fix the
generic issue for all types of catalog SnapshotNow scans.

IIUC, the problem with this approach is not correctness but
performance.  Taking snapshots is (currently) expensive.

Yeah.  After mulling it for awhile, what about this idea: we could
redefine SnapshotNow as a snapshot type that includes a list of
transactions-in-progress, somewhat like an MVCC snapshot, but we don't
fill that list from the PGPROC array.  Instead, while running a scan
with SnapshotNow, anytime we determine that a particular XID is
still-in-progress, we add that XID to the snapshot's list.
Subsequently, the SnapshotNow code assumes that XID to be
still-in-progress without consulting its actual state.  We reset the XID
list to empty when starting a new SnapshotNow scan.  (We might be able
to do so less often than that, like only when we do
AcceptInvalidationMessages, but it's not clear to me that there's any
real benefit in hanging onto the state longer.)

This costs no performance; if anything it should be faster than now,
because we'll be replacing expensive transaction state probes with
relatively-cheap searches of an XID array that should almost always
be quite short.

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress.  There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later).  So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it).  I
think this would be all right, since the scan would just seize on the
first good version it finds.  As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

I liked this idea, so began to prototype the code. My rough hack is
attached, for the record.

One thing that occurs to me about this is that SnapshotNow with or
without these changes returns the latest committed row and ignores
in-progress changes.

Accepting an older version of the definition will always be
potentially dangerous. I can't see a way of doing this that doesn't
require locking - for changes such as new constraints we need to wait
until in progress changes are complete.

So maybe this idea is worth doing, but I don't think it helps us much
reduce lock levels for DDL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapshotnow_consistent.v1.patchapplication/octet-stream; name=snapshotnow_consistent.v1.patchDownload
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index db04e26..d66f434 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -263,6 +263,8 @@ systable_beginscan(Relation heapRelation,
 	sysscan->heap_rel = heapRelation;
 	sysscan->irel = irel;
 
+	InitSnapshotNow(snapshot);
+
 	if (irel)
 	{
 		int			i;
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index d81ee7d..e349777 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -347,7 +347,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 
 			if (TransactionIdIsCurrentTransactionId(xvac))
 				return false;
-			if (!TransactionIdIsInProgress(xvac))
+			if (!XidIsInProgressForSnapshotNow(xvac, snapshot))
 			{
 				if (TransactionIdDidCommit(xvac))
 				{
@@ -366,7 +366,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 
 			if (!TransactionIdIsCurrentTransactionId(xvac))
 			{
-				if (TransactionIdIsInProgress(xvac))
+				if (XidIsInProgressForSnapshotNow(xvac, snapshot))
 					return false;
 				if (TransactionIdDidCommit(xvac))
 					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -405,7 +405,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 			else
 				return false;	/* deleted before scan started */
 		}
-		else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
+		else if (XidIsInProgressForSnapshotNow(HeapTupleHeaderGetXmin(tuple), snapshot))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
 			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
@@ -448,7 +448,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 			return false;		/* deleted before scan started */
 	}
 
-	if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
+	if (XidIsInProgressForSnapshotNow(HeapTupleHeaderGetXmax(tuple)))
 		return true;
 
 	if (!TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))
@@ -1342,3 +1342,116 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 
 	return false;
 }
+
+#define MAX_SIZE_SNAPSHOT_NOW (MaxBackends * 64)
+
+void
+InitSnapshotNow(Snapshot snapshot)
+{
+	if (!IsMVCCSnapshot(snapshot))
+		return;
+
+	snapshot->xcnt = 0;
+	snapshot->xmin = InvalidTransactionId;
+	snapshot->xmax = InvalidTransactionId;
+
+	if (snapshot->xip == NULL)
+	{
+		/*
+		 * First call for this snapshot. Snapshot is same size whether or not
+		 * we are in recovery, see later comments.
+		 */
+		snapshot->xip = (TransactionId *)
+			malloc(MAX_SIZE_SNAPSHOT_NOW * sizeof(TransactionId));
+		if (snapshot->xip == NULL)
+			ereport(ERROR,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+	}
+}
+
+/*
+ * SnapshotNow scans used to dynamically test TransactionIdIsInProgress()
+ * as the scan progressed. This resulted in sometimes missing a tuple
+ * altogether if a transaction committed during the scan. This function
+ * allows us to remember the state of a running transaction, so we can
+ * ensure we give the same answer each and every time we see this xid.
+ *
+ * Initial tests are similar to TransactionIdIsInProgress()
+ */
+static bool
+XidIsInProgressForSnapshotNow(TransactionId xid, Snapshot snapshot)
+{
+	/*
+	 * Don't bother checking a transaction older than RecentXmin; it could not
+	 * possibly still be running.  (Note: in particular, this guarantees that
+	 * we reject InvalidTransactionId, FrozenTransactionId, etc as not
+	 * running.)
+	 */
+	if (TransactionIdPrecedes(xid, RecentXmin))
+		return false;
+
+	/*
+	 * We may have just checked the status of this transaction, so if it is
+	 * already known to be completed, we can fall out quickly.
+	 */
+	if (TransactionIdIsKnownCompleted(xid))
+		return false;
+
+	/*
+	 * Also, we can handle our own transaction (and subtransactions) without
+	 * any access to the the xip array.
+	 */
+	if (TransactionIdIsCurrentTransactionId(xid))
+		return true;
+
+	/*
+	 * Is it within the bounds of the remembered xids?
+	 */
+	if (TransactionIdIsValid(snapshot->xmin) &&
+		(TransactionIdFollowsOrEquals(xid, snapshot->xmin) ||
+		TransactionIdPrecedesOrEquals(xid, snapshot->xmax)))
+	{
+		int	i;
+
+		/*
+		 * If we saw this xid was in progress previously then we give the
+		 * same answer the next time this scan sees the xid again.
+		 */
+		for (i = 0; i < snapshot->xcnt; i++)
+			if (snapshot->xip[i] == xid)
+				return true;
+	}
+
+	/*
+	 * If we haven't seen it before, is it in progress? If so remember.
+	 */
+	if (!TransactionIdIsInProgress(xid))
+		return false;
+
+	/*
+	 * xid is in progress - do we have space to remember?
+	 */
+	if ((snapshot->xcnt + 1) >= MAX_SIZE_SNAPSHOT_NOW)
+		elog(ERROR, "reached limit of SnapshotNow array");
+
+	/*
+	 * Remember xid is in progress
+	 */
+	snapshot->xip[snapshot->xcnt++] = xid;
+
+	/*
+	 * Set or expand upper or lower limits if appropriate.
+	 */
+	if (!TransactionIdIsValid(snapshot->xmin))
+	{
+		snapshot->xmin = xid;
+		snapshot->xmax = xid;
+	}
+	else if (TransactionIdPrecedes(xid, snapshot->xmin))
+		snapshot->xmin = xid;
+	else if (TransactionIdFollows(xid, snapshot->xmax))
+		snapshot->xmax = xid;
+
+	return true;
+}
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 859e52a..24b6cc0 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -35,6 +35,14 @@ typedef struct SnapshotData
 	SnapshotSatisfiesFunc satisfies;	/* tuple test function */
 
 	/*
+	 * Fields used by all snapshot types.
+	 */
+	uint32		xcnt;			/* # of xact ids in xip[] */
+	TransactionId *xip;			/* array of xact IDs in progress */
+	TransactionId xmin;			/* all XID < xmin are visible to me */
+	TransactionId xmax;			/* all XID >= xmax are invisible to me */
+
+	/*
 	 * The remaining fields are used only for MVCC snapshots, and are normally
 	 * just zeroes in special snapshots.  (But xmin and xmax are used
 	 * specially by HeapTupleSatisfiesDirty.)
@@ -44,10 +52,6 @@ typedef struct SnapshotData
 	 * is stored as an optimization to avoid needing to search the XID arrays
 	 * for most tuples.
 	 */
-	TransactionId xmin;			/* all XID < xmin are visible to me */
-	TransactionId xmax;			/* all XID >= xmax are invisible to me */
-	uint32		xcnt;			/* # of xact ids in xip[] */
-	TransactionId *xip;			/* array of xact IDs in progress */
 	/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
 	int32		subxcnt;		/* # of xact ids in subxip[] */
 	TransactionId *subxip;		/* array of subxact IDs in progress */
#51Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#50)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs wrote:

This costs no performance; if anything it should be faster than now,
because we'll be replacing expensive transaction state probes with
relatively-cheap searches of an XID array that should almost always
be quite short.

With this approach, we would have no serialization anomalies from single
transactions committing while a scan is in progress. ?There could be
anomalies resulting from considering an earlier XID to be in-progress
while a later XID is considered committed (because we didn't observe
it until later). ?So far as I can see offhand, the impact of that would
be that there might be multiple versions of a tuple that are considered
good, but never that there would be no version considered good (so long
as the other XIDs simply updated the tuple and didn't delete it). ?I
think this would be all right, since the scan would just seize on the
first good version it finds. ?As you argue above, if that's not good
enough for our purposes then the updater(s) should have taken a stronger
lock.

I liked this idea, so began to prototype the code. My rough hack is
attached, for the record.

One thing that occurs to me about this is that SnapshotNow with or
without these changes returns the latest committed row and ignores
in-progress changes.

Accepting an older version of the definition will always be
potentially dangerous. I can't see a way of doing this that doesn't
require locking - for changes such as new constraints we need to wait
until in progress changes are complete.

So maybe this idea is worth doing, but I don't think it helps us much
reduce lock levels for DDL.

Just confirming we decided not to persue this.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#52Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#51)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Nov 29, 2011 at 11:50 AM, Bruce Momjian <bruce@momjian.us> wrote:

Just confirming we decided not to persue this.

Doesn't sound like it.

I've been thinking a lot about the more general problem here - namely,
that allowing catalog changes without an access exclusive lock is
unsafe - and trying to come up with a good solution, because I'd
really like to see us make this work. It strikes me that there are
really two separate problems here.

1. If you are scanning a system catalog using SnapshotNow, and a
commit happens at just the right moment, you might see two versions of
the row or zero rather than one. You'll see two if you scan the old
row version, then the concurrent transaction commits, and then you
scan the new one. You'll see zero if you scan the new row (ignoring
it, since it isn't committed yet) and then the concurrent transaction
commits, and then you scan the old row. On a related note, if you
scan several interrelated catalogs (e.g. when building a relcache
entry) and a transaction that has made matching modifications to
multiple catalogs commits midway through, you might see the old
version of the data from one table and the new version of the data
from some other table. Using AccessExclusiveLock fixes the problem
because we guarantee that anybody who wants to read those rows first
takes some kind of lock, which they won't be able to do while the
updater holds AccessExclusiveLock.

2. Other backends may have data in the relcache or catcaches which
won't get invalidated until they do AcceptInvalidationMessages().
That will always happen no later than the next time they lock the
relation, so if you are holding AccessExclusiveLock then you should be
OK: no one else holds any lock, and they'll have to go get one before
doing anything interesting. But if you are holding any weaker lock,
there's nothing to force AcceptInvalidationMessages to happen before
you reuse those cache entries.

In both cases, name lookups are an exception: we don't know what OID
to lock until we've done the name lookup.

Random ideas for solving the first problem:

1-A. Take a fresh MVCC snapshot for each catalog scan. I think we've
rejected this before because of the cost involved, but maybe with
Pavan's patch and some of the other improvements that are on the way
we could make this cheap enough to be acceptable.
1-B. Don't allow transactions that have modified system catalog X to
commit while we're scanning system catalog X; make the scans take some
kind of shared lock and commits an exclusive lock. This seems awfully
bad for concurrency, though, not to mention the overhead of taking and
releasing all those locks even when nothing conflicts.
1-C. Wrap a retry loop around every place where we scan a system
catalog. If a transaction that has written rows in catalog X commits
while we're scanning catalog X, discard the results of the first scan
and declare a do-over (or maybe something more coarse-grained, or at
the other extreme even more fine-grained). If we're doing several
interrelated scans then the retry loop must surround the entire unit,
and enforce a do-over of the whole operation if commits occur in any
of the catalogs before the scan completes.

Unfortunately, any of these seem likely to require a Lot of Work,
probably more than can be done in time for 9.2. However, perhaps it
would be feasible to hone in on a limited subset of the cases (e.g.
name lookups, which are mainly done in only a handful of places, and
represent an existing bug) and implement the necessary code changes
just for those cases. Then we could generalize that pattern to other
cases as time and energy permit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#52)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Wed, Nov 30, 2011 at 4:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It strikes me that there are really two separate problems here.

1. If you are scanning a system catalog using SnapshotNow, and a
commit happens at just the right moment, you might see two versions of
the row or zero rather than one.  You'll see two if you scan the old
row version, then the concurrent transaction commits, and then you
scan the new one.  You'll see zero if you scan the new row (ignoring
it, since it isn't committed yet) and then the concurrent transaction
commits, and then you scan the old row.

That is a bug and one we should fix. I supplied a patch for that,
written to Tom's idea for how to solve it.

I will apply that, unless there are objections.

2. Other backends may have data in the relcache or catcaches which
won't get invalidated until they do AcceptInvalidationMessages().
That will always happen no later than the next time they lock the
relation, so if you are holding AccessExclusiveLock then you should be
OK: no one else holds any lock, and they'll have to go get one before
doing anything interesting.  But if you are holding any weaker lock,
there's nothing to force AcceptInvalidationMessages to happen before
you reuse those cache entries.

Yes, inconsistent cache entries will occur if we allow catalog updates
while the table is already locked by others.

The question is whether that is a problem in all cases.

With these ALTER TABLE subcommands, I don't see any problem with
different backends seeing different values.

/*
* These subcommands affect general strategies for performance
* and maintenance, though don't change the semantic results
* from normal data reads and writes. Delaying an ALTER TABLE
* behind currently active writes only delays the point where
* the new strategy begins to take effect, so there is no
* benefit in waiting. In this case the minimum restriction
* applies: we don't currently allow concurrent catalog
* updates.
*/
case AT_SetStatistics:
// only used by ANALYZE, which is shut out by the ShareUpdateExclusiveLock

case AT_ClusterOn:
case AT_DropCluster:
// only used by CLUSTER, which is shut out because it needs AccessExclusiveLock

case AT_SetRelOptions:
case AT_ResetRelOptions:
case AT_SetOptions:
case AT_ResetOptions:
case AT_SetStorage:
// not critical

case AT_ValidateConstraint:
// not a problem if some people think its invalid when it is valid

So ISTM that we can allow reduced lock levels for the above commands
if we fix SnapshotNow.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#54Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#53)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Dec 16, 2011 at 7:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That is a bug and one we should fix. I supplied a patch for that,
written to Tom's idea for how to solve it.

I will apply that, unless there are objections.

I remember several attempts at that, but I don't remember any that
didn't meet with objections. Do you have a link?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#54)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Dec 16, 2011 at 1:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 16, 2011 at 7:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That is a bug and one we should fix. I supplied a patch for that,
written to Tom's idea for how to solve it.

I will apply that, unless there are objections.

I remember several attempts at that, but I don't remember any that
didn't meet with objections.  Do you have a link?

My patch to implement SnapshotNow correctly, from Jun 27 on this
thread was never reviewed or commented upon by anybody. That was
probably because it only fixes one of the problems, not all of them.

But it does fix a current bug and that's why I'm asking now if there
are objections to committing it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#56Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#55)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Fri, Dec 16, 2011 at 8:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Dec 16, 2011 at 1:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 16, 2011 at 7:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That is a bug and one we should fix. I supplied a patch for that,
written to Tom's idea for how to solve it.

I will apply that, unless there are objections.

I remember several attempts at that, but I don't remember any that
didn't meet with objections.  Do you have a link?

My patch to implement SnapshotNow correctly, from Jun 27 on this
thread was never reviewed or commented upon by anybody. That was
probably because it only fixes one of the problems, not all of them.

Well, I think it was mostly because you didn't sound terribly
optimistic about the approach: "So maybe this idea is worth doing, but
I don't think it helps us much reduce lock levels for DDL." And also
because you described the patch as a "rough hack", and not something
you thought ready to commit.

I am also not entirely sure I believe that this is plugging all the
failure cases. I think that it may just be making the failure cases
more obscure, rather than really getting rid of them. Consider
something like the following:

T1: Update row version A, creating new row version B.
T2: Begin scanning the catalog in question. We happen to encounter
row version B first. We remember T1's XID as in progress, but don't
see the row since T1 hasn't committed.
T1: Rollback.
T3: Update row version A, creating new row version C.
T3: Commit.
T2: Scan now comes to row version A; we don't see that version either,
since T3 is committed.

I don't think there's any guarantee that T2's scan will see tuples
inserted after the start of the scan. If I'm correct about that, and
I'm pretty sure it's true for sequential scans anyway, then T2's scan
might end without seeing C either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#56)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Sat, Dec 17, 2011 at 3:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I am also not entirely sure I believe that this is plugging all the
failure cases.  I think that it may just be making the failure cases
more obscure, rather than really getting rid of them.  Consider
something like the following:

T1: Update row version A, creating new row version B.
T2: Begin scanning the catalog in question.  We happen to encounter
row version B first.  We remember T1's XID as in progress, but don't
see the row since T1 hasn't committed.
T1: Rollback.
T3: Update row version A, creating new row version C.
T3: Commit.
T2: Scan now comes to row version A; we don't see that version either,
since T3 is committed.

I don't think there's any guarantee that T2's scan will see tuples
inserted after the start of the scan.  If I'm correct about that, and
I'm pretty sure it's true for sequential scans anyway, then T2's scan
might end without seeing C either.

A simpler example shows it better. Tom's idea prevents 2 rows being
returned, but doesn't prevent zero rows.

That's a useful improvement, but not the only thing.

ISTM we can run a scan as we do now, without locking. If scan returns
zero rows we scan again, but take a definition lock first. ALTER TABLE
holds the definition lock while running, which won't be a problem
because we would only use that on shorter AT statements.

Definition lock being the type of lock described earlier on this
thread, that we already have code for. So we have code for all the
parts we need to make this work.

So that means we'd be able to plug the gaps noted, without reducing
performance on the common code paths.

I don't see any objections made so far remain valid with that approach.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#58Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#57)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Sat, Dec 17, 2011 at 6:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

A simpler example shows it better. Tom's idea prevents 2 rows being
returned, but doesn't prevent zero rows.

I don't think it prevents either one. Suppose we read and return a
tuple, and then someone HOT-updates it and commits. SnapshotNow is
very happy to also return the updated version of that same tuple on
the next iteration of the scan. See commit
c0f03aae0469e758964faac0fb741685170c39a5.

That's a useful improvement, but not the only thing.

ISTM we can run a scan as we do now, without locking. If scan returns
zero rows we scan again, but take a definition lock first. ALTER TABLE
holds the definition lock while running, which won't be a problem
because we would only use that on shorter AT statements.

Definition lock being the type of lock described earlier on this
thread, that we already have code for. So we have code for all the
parts we need to make this work.

So that means we'd be able to plug the gaps noted, without reducing
performance on the common code paths.

I don't see any objections made so far remain valid with that approach.

I think a retry loop is a possibly useful tool for solving this
problem, but I'm very skeptical about the definition locks, unless we
can arrange to have them taken just before commit (regardless of when
during the transaction ALTER TABLE was executed) and released
immediately afterwards. What I think might be even better is
something along the lines of what Noah Misch did RangeVarGetRelid --
don't try to lock out concurrent activity, just detect it and redo the
operation if anything bad happens while we're in process. In this
case, that would mean having a mechanism to know whether any
concurrent transaction had updated the catalog we're scanning during
the scan.

Yet another option, which I wonder whether we're dismissing too
lightly, is to just call GetSnapshotData() and do the scan using a
plain old MVCC snapshot. Sure, it's more expensive than SnapshotNow,
but is it expensive enough to worry about?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#58)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Dec 19, 2011 at 12:05:09PM -0500, Robert Haas wrote:

Yet another option, which I wonder whether we're dismissing too
lightly, is to just call GetSnapshotData() and do the scan using a
plain old MVCC snapshot. Sure, it's more expensive than SnapshotNow,
but is it expensive enough to worry about?

Good point. For the most part, we already regard a catalog scan as too
expensive for bulk use, hence relcache and catcache. That's not license to
slow them down recklessly, but it's worth discovering how much of a hit we'd
actually face. I created a function that does this in a loop:

HeapTuple t;

CatalogCacheFlushCatalog(ProcedureRelationId);
t = SearchSysCache1(PROCOID, ObjectIdGetDatum(42) /* int4in */);
if (!HeapTupleIsValid(t))
elog(ERROR, "cache lookup failed for function 42");
ReleaseSysCache(t);

Then, I had pgbench call the function once per client with various numbers of
clients and a loop iteration count such that the total number of scans per run
was always 19200000. Results for master and for a copy patched to use MVCC
snapshots in catcache.c only:

2 clients, master: 4:30.66elapsed
4 clients, master: 4:26.82elapsed
32 clients, master: 4:25.30elapsed
2 clients, master: 4:25.67elapsed
4 clients, master: 4:26.58elapsed
32 clients, master: 4:26.40elapsed
2 clients, master: 4:27.54elapsed
4 clients, master: 4:26.60elapsed
32 clients, master: 4:27.20elapsed
2 clients, mvcc-catcache: 4:35.13elapsed
4 clients, mvcc-catcache: 4:30.40elapsed
32 clients, mvcc-catcache: 4:37.91elapsed
2 clients, mvcc-catcache: 4:28.13elapsed
4 clients, mvcc-catcache: 4:27.06elapsed
32 clients, mvcc-catcache: 4:32.84elapsed
2 clients, mvcc-catcache: 4:32.47elapsed
4 clients, mvcc-catcache: 4:24.35elapsed
32 clients, mvcc-catcache: 4:31.54elapsed

I see roughly a 2% performance regression. However, I'd expect any bulk
losses to come from increased LWLock contention, which just doesn't
materialize in a big way on this 2-core box. If anyone would like to rerun
this on a larger machine, I can package it up for reuse.

#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#59)
Re: ALTER TABLE lock strength reduction patch is unsafe

Noah Misch <noah@leadboat.com> writes:

On Mon, Dec 19, 2011 at 12:05:09PM -0500, Robert Haas wrote:

Yet another option, which I wonder whether we're dismissing too
lightly, is to just call GetSnapshotData() and do the scan using a
plain old MVCC snapshot. Sure, it's more expensive than SnapshotNow,
but is it expensive enough to worry about?

That might actually be workable ...

I created a function that does this in a loop:

HeapTuple t;

CatalogCacheFlushCatalog(ProcedureRelationId);
t = SearchSysCache1(PROCOID, ObjectIdGetDatum(42) /* int4in */);
if (!HeapTupleIsValid(t))
elog(ERROR, "cache lookup failed for function 42");
ReleaseSysCache(t);

... but this performance test seems to me to be entirely misguided,
because it's testing a situation that isn't going to occur much in the
field, precisely because the syscache should prevent constant reloads of
the same syscache entry.

Poking around a bit, it looks to me like one of the bigger users of
non-cache-fronted SnapshotNow scans is dependency.c. So maybe testing
the speed of large cascaded drops would be a more relevant test case.
For instance, create a schema containing a few thousand functions, and
see how long it takes to drop the schema.

Another thing that would be useful to know is what effect such a change
would have on the time to run the regression tests with
CLOBBER_CACHE_ALWAYS. That has nothing to do with any real-world
performance concerns, but it would be good to see if we're going to
cause a problem for the long-suffering buildfarm member that does that.

regards, tom lane

#61Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#60)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Dec 19, 2011 at 11:13:57PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

I created a function that does this in a loop:

HeapTuple t;

CatalogCacheFlushCatalog(ProcedureRelationId);
t = SearchSysCache1(PROCOID, ObjectIdGetDatum(42) /* int4in */);
if (!HeapTupleIsValid(t))
elog(ERROR, "cache lookup failed for function 42");
ReleaseSysCache(t);

... but this performance test seems to me to be entirely misguided,
because it's testing a situation that isn't going to occur much in the
field, precisely because the syscache should prevent constant reloads of
the same syscache entry.

[ideas for more-realistic tests]

Granted, but I don't hope to reliably measure a change in a macro-benchmark
after seeing a rickety 2% change in a micro-benchmark.

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#61)
Re: ALTER TABLE lock strength reduction patch is unsafe

Noah Misch <noah@leadboat.com> writes:

On Mon, Dec 19, 2011 at 11:13:57PM -0500, Tom Lane wrote:

... but this performance test seems to me to be entirely misguided,
because it's testing a situation that isn't going to occur much in the
field, precisely because the syscache should prevent constant reloads of
the same syscache entry.

[ideas for more-realistic tests]

Granted, but I don't hope to reliably measure a change in a macro-benchmark
after seeing a rickety 2% change in a micro-benchmark.

No, I'm not sure about that at all. In particular I think that
CatalogCacheFlushCatalog is pretty expensive and so the snapshot costs
could be a larger part of a more-realistic test.

regards, tom lane

#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#62)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Dec 20, 2011 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

On Mon, Dec 19, 2011 at 11:13:57PM -0500, Tom Lane wrote:

... but this performance test seems to me to be entirely misguided,
because it's testing a situation that isn't going to occur much in the
field, precisely because the syscache should prevent constant reloads of
the same syscache entry.

[ideas for more-realistic tests]

Granted, but I don't hope to reliably measure a change in a macro-benchmark
after seeing a rickety 2% change in a micro-benchmark.

No, I'm not sure about that at all.  In particular I think that
CatalogCacheFlushCatalog is pretty expensive and so the snapshot costs
could be a larger part of a more-realistic test.

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

Assuming that is the right approach, some timings

10,000 functions individually
Create 15.7s 14.3s 14.8s 14.8s
Drop 11.9s 11.7 11.6s 12.0s

10,000 functions in a schema
Create ... same ...
Drop Cascade 2.5s

That seems OK to me.

Any feedback?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

snapshotnow_consistent.v4.patchtext/x-patch; charset=US-ASCII; name=snapshotnow_consistent.v4.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5f6ac2e..8c6d6a1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1201,6 +1201,8 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	 */
 	scan->rs_pageatatime = IsMVCCSnapshot(snapshot);
 
+	InitSnapshotNowIfNeeded(snapshot);
+
 	/*
 	 * For a seqscan in a serializable transaction, acquire a predicate lock
 	 * on the entire relation. This is required not only to lock all the
@@ -1680,6 +1682,8 @@ heap_get_latest_tid(Relation relation,
 	if (!ItemPointerIsValid(tid))
 		return;
 
+	InitSnapshotNowIfNeeded(snapshot);
+
 	/*
 	 * Since this can be called with user-supplied TID, don't trust the input
 	 * too much.  (RelationGetNumberOfBlocks is an expensive check, so we
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 16ac4e1..4e03cb5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -292,6 +292,8 @@ index_beginscan_internal(Relation indexRelation,
 	 */
 	RelationIncrementReferenceCount(indexRelation);
 
+	InitSnapshotNowIfNeeded(snapshot);
+
 	/*
 	 * Tell the AM to open a scan.
 	 */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5aebbd1..61090be 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -206,6 +206,19 @@ GetLatestSnapshot(void)
 }
 
 /*
+ * InitSnapshotNowIfNeeded
+ *		If passed snapshot is SnapshotNow then refresh it with latest info.
+ */
+void
+InitSnapshotNowIfNeeded(Snapshot snap)
+{
+	if (!IsSnapshotNow(snap))
+		return;
+
+	snap = GetSnapshotData(&SnapshotNowData);
+}
+
+/*
  * SnapshotSetCommandId
  *		Propagate CommandCounterIncrement into the static snapshots, if set
  */
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 31791a7..61edafa 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -67,7 +67,7 @@
 
 
 /* Static variables representing various special snapshot semantics */
-SnapshotData SnapshotNowData = {HeapTupleSatisfiesNow};
+SnapshotData SnapshotNowData = {HeapTupleSatisfiesMVCC};
 SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
@@ -293,6 +293,7 @@ HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 	return false;
 }
 
+#ifdef RECENTLY_DEAD_CODE
 /*
  * HeapTupleSatisfiesNow
  *		True iff heap tuple is valid "now".
@@ -474,6 +475,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
 				HeapTupleHeaderGetXmax(tuple));
 	return false;
 }
+#endif
 
 /*
  * HeapTupleSatisfiesAny
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f195981..a30fb9f 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -24,6 +24,7 @@ extern TransactionId RecentGlobalXmin;
 
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
+extern void InitSnapshotNowIfNeeded(Snapshot snap);
 extern void SnapshotSetCommandId(CommandId curcid);
 
 extern void PushActiveSnapshot(Snapshot snapshot);
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index 0f8a7f8..760fc3f 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -40,6 +40,8 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
 	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
+#define IsSnapshotNow(snapshot)  \
+	((snapshot) == (&SnapshotNowData))
 
 /*
  * HeapTupleSatisfiesVisibility
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#63)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

Assuming that is the right approach, some timings

Um ... timings of what?

regards, tom lane

#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#64)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Assuming that is the right approach, some timings

Um ... timings of what?

Apologies for being terse, no problem to give a full explanation.

You suggested earlier that putting calls to GetSnapshotData() in place
of using SnapshotNow might increase the time taken to do catalog
scans. That matters most in places where stuff isn't cached, one such
place you suggested might be the dependency check code, so a test of
dropping thousands of functions might show up any bad regressions.

From what I can see, there are no bad performance regressions from
making SnapshotNow use MVCC.

We're benefiting in 9.2 from good reductions in GetSnapshotData()
contention and improved performance anyway, so it looks like a
reasonable balance.

I'm proposing that we apply the above patch (v4) to allow SnapshotNow
scans to return consistent, trustable results. That fixes some of the
corner case bugs we've seen and paves the way for me to re-enable the
lock reduction code.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#66Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#63)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick. However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

#67Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#66)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that? (and if so, Why?!? or perhaps just Where?)

We can use more complex code if required, but we'll be adding
complexity and code into the main path that I'd like to avoid.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#67)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot,

That's a neat trick. �However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that?

Almost certainly yes. For example, a catcache load may invoke catcache
or relcache reload operations on its way to opening the table or index
needed to fetch the desired row.

I think you can only safely do this if each caller has its own snapshot
variable, a la SnapshotDirty, and that's going to be hugely more
invasive.

regards, tom lane

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#65)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... timings of what?

Apologies for being terse, no problem to give a full explanation.

But you still didn't. I wanted to know what those numbers were and how
they show that there's not a performance regression. Presumably you
meant that some were "before" and some "after", but they were not so
labeled.

regards, tom lane

#70Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#67)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 02, 2012 at 06:41:31PM +0000, Simon Riggs wrote:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick. ?However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that? (and if so, Why?!? or perhaps just Where?)

I hacked up your patch a bit, as attached, to emit a WARNING for any nested
use of SnapshotNow. This made 97/127 test files fail. As one example,
RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its
SnapshotNow scan. That may need a detoast, which itself runs a scan.

We can use more complex code if required, but we'll be adding
complexity and code into the main path that I'd like to avoid.

Agreed.

Attachments:

find-overlap-snapshotnow.patchtext/plain; charset=us-asciiDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 1201,1206 **** heap_beginscan_internal(Relation relation, Snapshot snapshot,
--- 1201,1208 ----
  	 */
  	scan->rs_pageatatime = IsMVCCSnapshot(snapshot);
  
+ 	InitSnapshotNowIfNeeded(snapshot);
+ 
  	/*
  	 * For a seqscan in a serializable transaction, acquire a predicate lock
  	 * on the entire relation. This is required not only to lock all the
***************
*** 1270,1275 **** heap_endscan(HeapScanDesc scan)
--- 1272,1279 ----
  	if (BufferIsValid(scan->rs_cbuf))
  		ReleaseBuffer(scan->rs_cbuf);
  
+ 	FinishSnapshotNow(scan->rs_snapshot);
+ 
  	/*
  	 * decrement relation reference count and free scan descriptor storage
  	 */
***************
*** 1680,1685 **** heap_get_latest_tid(Relation relation,
--- 1684,1691 ----
  	if (!ItemPointerIsValid(tid))
  		return;
  
+ 	InitSnapshotNowIfNeeded(snapshot);
+ 
  	/*
  	 * Since this can be called with user-supplied TID, don't trust the input
  	 * too much.  (RelationGetNumberOfBlocks is an expensive check, so we
***************
*** 1775,1780 **** heap_get_latest_tid(Relation relation,
--- 1781,1788 ----
  		priorXmax = HeapTupleHeaderGetXmax(tp.t_data);
  		UnlockReleaseBuffer(buffer);
  	}							/* end of loop */
+ 
+ 	FinishSnapshotNow(snapshot);
  }
  
  
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***************
*** 292,297 **** index_beginscan_internal(Relation indexRelation,
--- 292,299 ----
  	 */
  	RelationIncrementReferenceCount(indexRelation);
  
+ 	InitSnapshotNowIfNeeded(snapshot);
+ 
  	/*
  	 * Tell the AM to open a scan.
  	 */
***************
*** 370,375 **** index_endscan(IndexScanDesc scan)
--- 372,379 ----
  	/* End the AM's scan */
  	FunctionCall1(procedure, PointerGetDatum(scan));
  
+ 	FinishSnapshotNow(scan->xs_snapshot);
+ 
  	/* Release index refcount acquired by index_beginscan */
  	RelationDecrementReferenceCount(scan->indexRelation);
  
*** a/src/backend/utils/time/snapmgr.c
--- b/src/backend/utils/time/snapmgr.c
***************
*** 205,210 **** GetLatestSnapshot(void)
--- 205,235 ----
  	return SecondarySnapshot;
  }
  
+ static bool SnapshotNowActive;
+ 
+ /*
+  * InitSnapshotNowIfNeeded
+  *		If passed snapshot is SnapshotNow then refresh it with latest info.
+  */
+ void
+ InitSnapshotNowIfNeeded(Snapshot snap)
+ {
+ 	if (!IsSnapshotNow(snap))
+ 		return;
+ 
+ 	if (SnapshotNowActive)
+ 		elog(WARNING, "SnapshotNow used concurrently");
+ 
+ 	snap = GetSnapshotData(&SnapshotNowData);
+ 	SnapshotNowActive = true;
+ }
+ 
+ void FinishSnapshotNow(Snapshot snap)
+ {
+ 	if (IsSnapshotNow(snap))
+ 		SnapshotNowActive = false;
+ }
+ 
  /*
   * SnapshotSetCommandId
   *		Propagate CommandCounterIncrement into the static snapshots, if set
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 67,73 ****
  
  
  /* Static variables representing various special snapshot semantics */
! SnapshotData SnapshotNowData = {HeapTupleSatisfiesNow};
  SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
--- 67,73 ----
  
  
  /* Static variables representing various special snapshot semantics */
! SnapshotData SnapshotNowData = {HeapTupleSatisfiesMVCC};
  SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
  SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
  SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
***************
*** 293,298 **** HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
--- 293,299 ----
  	return false;
  }
  
+ #ifdef RECENTLY_DEAD_CODE
  /*
   * HeapTupleSatisfiesNow
   *		True iff heap tuple is valid "now".
***************
*** 474,479 **** HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
--- 475,481 ----
  				HeapTupleHeaderGetXmax(tuple));
  	return false;
  }
+ #endif
  
  /*
   * HeapTupleSatisfiesAny
*** a/src/include/utils/snapmgr.h
--- b/src/include/utils/snapmgr.h
***************
*** 24,29 **** extern TransactionId RecentGlobalXmin;
--- 24,31 ----
  
  extern Snapshot GetTransactionSnapshot(void);
  extern Snapshot GetLatestSnapshot(void);
+ extern void InitSnapshotNowIfNeeded(Snapshot snap);
+ extern void FinishSnapshotNow(Snapshot snap);
  extern void SnapshotSetCommandId(CommandId curcid);
  
  extern void PushActiveSnapshot(Snapshot snapshot);
*** a/src/include/utils/tqual.h
--- b/src/include/utils/tqual.h
***************
*** 40,45 **** extern PGDLLIMPORT SnapshotData SnapshotToastData;
--- 40,47 ----
  /* This macro encodes the knowledge of which snapshots are MVCC-safe */
  #define IsMVCCSnapshot(snapshot)  \
  	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
+ #define IsSnapshotNow(snapshot)  \
+ 	((snapshot) == (&SnapshotNowData))
  
  /*
   * HeapTupleSatisfiesVisibility
#71Alvaro Herrera
alvherre@commandprompt.com
In reply to: Noah Misch (#70)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Noah Misch's message of lun ene 02 16:25:25 -0300 2012:

On Mon, Jan 02, 2012 at 06:41:31PM +0000, Simon Riggs wrote:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick. ?However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that? (and if so, Why?!? or perhaps just Where?)

I hacked up your patch a bit, as attached, to emit a WARNING for any nested
use of SnapshotNow. This made 97/127 test files fail. As one example,
RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its
SnapshotNow scan. That may need a detoast, which itself runs a scan.

Uh, I thought detoasting had its own visibility test function .. I mean,
otherwise, what is HeapTupleSatisfiesToast for?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#72Noah Misch
noah@leadboat.com
In reply to: Alvaro Herrera (#71)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 02, 2012 at 04:33:28PM -0300, Alvaro Herrera wrote:

Excerpts from Noah Misch's message of lun ene 02 16:25:25 -0300 2012:

On Mon, Jan 02, 2012 at 06:41:31PM +0000, Simon Riggs wrote:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick. ?However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that? (and if so, Why?!? or perhaps just Where?)

I hacked up your patch a bit, as attached, to emit a WARNING for any nested
use of SnapshotNow. This made 97/127 test files fail. As one example,
RelationBuildRuleLock() does TextDatumGetCString() for every tuple of its
SnapshotNow scan. That may need a detoast, which itself runs a scan.

Uh, I thought detoasting had its own visibility test function .. I mean,
otherwise, what is HeapTupleSatisfiesToast for?

The SnapshotNow scan was actually to build the relcache entry for the toast
table before scanning the toast table itself. Stack trace:

#0 InitSnapshotNowIfNeeded (snap=0xb7e8c0) at snapmgr.c:216
#1 0x000000000048f773 in index_beginscan_internal (indexRelation=0x7f9a091e5cc8, nkeys=2, norderbys=0, snapshot=0xb7e8c0) at indexam.c:295
#2 0x000000000048f7e9 in index_beginscan (heapRelation=0x7f9a091a6c60, indexRelation=0xb7e8c0, snapshot=0xb7e8c0, nkeys=152984776, norderbys=0) at indexam.c:238
#3 0x000000000048e1de in systable_beginscan (heapRelation=0x7f9a091a6c60, indexId=<value optimized out>, indexOK=<value optimized out>, snapshot=0xb7e8c0, nkeys=2, key=0x7fffe211b610) at genam.c:289
#4 0x00000000007287db in RelationBuildTupleDesc (targetRelId=<value optimized out>, insertIt=1 '\001') at relcache.c:455
#5 RelationBuildDesc (targetRelId=<value optimized out>, insertIt=1 '\001') at relcache.c:880
#6 0x000000000072a050 in RelationIdGetRelation (relationId=2838) at relcache.c:1567
#7 0x00000000004872c0 in relation_open (relationId=2838, lockmode=1) at heapam.c:910
#8 0x0000000000487328 in heap_open (relationId=12052672, lockmode=152984776) at heapam.c:1052
#9 0x000000000048a737 in toast_fetch_datum (attr=<value optimized out>) at tuptoaster.c:1586
#10 0x000000000048bf74 in heap_tuple_untoast_attr (attr=0xb7e8c0) at tuptoaster.c:135
#11 0x0000000000737d77 in pg_detoast_datum_packed (datum=0xb7e8c0) at fmgr.c:2266
#12 0x00000000006f0fb0 in text_to_cstring (t=0xb7e8c0) at varlena.c:135
#13 0x0000000000727083 in RelationBuildRuleLock (relation=0x7f9a091b3818) at relcache.c:673
#14 0x000000000072909e in RelationBuildDesc (targetRelId=<value optimized out>, insertIt=1 '\001') at relcache.c:886
#15 0x000000000072a050 in RelationIdGetRelation (relationId=11119) at relcache.c:1567
#16 0x00000000004872c0 in relation_open (relationId=11119, lockmode=0) at heapam.c:910
#17 0x0000000000487483 in relation_openrv_extended (relation=<value optimized out>, lockmode=<value optimized out>, missing_ok=<value optimized out>) at heapam.c:1011
#18 0x000000000048749c in heap_openrv_extended (relation=0xb7e8c0, lockmode=152984776, missing_ok=5 '\005') at heapam.c:1110
#19 0x000000000053570c in parserOpenTable (pstate=0xc885f8, relation=0xc88258, lockmode=1) at parse_relation.c:835
#20 0x000000000053594d in addRangeTableEntry (pstate=0xc885f8, relation=0xc88258, alias=0x0, inh=1 '\001', inFromCl=1 '\001') at parse_relation.c:901
#21 0x0000000000524dbf in transformTableEntry (pstate=0xc885f8, n=0xc88258, top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, containedRels=0x7fffe211bd68) at parse_clause.c:445
#22 transformFromClauseItem (pstate=0xc885f8, n=0xc88258, top_rte=0x7fffe211bd78, top_rti=0x7fffe211bd84, relnamespace=0x7fffe211bd70, containedRels=0x7fffe211bd68) at parse_clause.c:683
#23 0x0000000000526124 in transformFromClause (pstate=0xc885f8, frmList=<value optimized out>) at parse_clause.c:129
#24 0x000000000050c321 in transformSelectStmt (pstate=0xb7e8c0, parseTree=0xc88340) at analyze.c:896
#25 transformStmt (pstate=0xb7e8c0, parseTree=0xc88340) at analyze.c:186
#26 0x000000000050d583 in parse_analyze (parseTree=0xc88340, sourceText=0xc87828 "table pg_stat_user_tables;", paramTypes=0x0, numParams=0) at analyze.c:94
#27 0x00000000006789c5 in pg_analyze_and_rewrite (parsetree=0xc88340, query_string=0xc87828 "table pg_stat_user_tables;", paramTypes=0x0, numParams=0) at postgres.c:583
#28 0x0000000000678c77 in exec_simple_query (query_string=0xc87828 "table pg_stat_user_tables;") at postgres.c:941
#29 0x0000000000679997 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=0xbe03c0 "nm") at postgres.c:3881
#30 0x0000000000633d41 in BackendRun () at postmaster.c:3587
#31 BackendStartup () at postmaster.c:3272
#32 ServerLoop () at postmaster.c:1350
#33 0x0000000000635330 in PostmasterMain (argc=1, argv=0xbdc170) at postmaster.c:1110
#34 0x00000000005d206b in main (argc=1, argv=0xbdc170) at main.c:199

#73Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot,

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that?

Almost certainly yes.  For example, a catcache load may invoke catcache
or relcache reload operations on its way to opening the table or index
needed to fetch the desired row.

Ah, of course. I was thinking they would be rare by design.

I think you can only safely do this if each caller has its own snapshot
variable, a la SnapshotDirty, and that's going to be hugely more
invasive.

OK, will look into it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#69)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... timings of what?

Apologies for being terse, no problem to give a full explanation.

But you still didn't.  I wanted to know what those numbers were and how
they show that there's not a performance regression.  Presumably you
meant that some were "before" and some "after", but they were not so
labeled.

All timings were "after" applying the patch. Since all of the tests
had very acceptable absolute values I didn't test without-patch.

Anyway, looks like we need to bin that and retest with new patch when it comes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#75Alvaro Herrera
alvherre@commandprompt.com
In reply to: Noah Misch (#72)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Noah Misch's message of lun ene 02 16:39:09 -0300 2012:

On Mon, Jan 02, 2012 at 04:33:28PM -0300, Alvaro Herrera wrote:

Uh, I thought detoasting had its own visibility test function .. I mean,
otherwise, what is HeapTupleSatisfiesToast for?

The SnapshotNow scan was actually to build the relcache entry for the toast
table before scanning the toast table itself. Stack trace:

Oh, right, that makes sense.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#75)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 7:49 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Noah Misch's message of lun ene 02 16:39:09 -0300 2012:

On Mon, Jan 02, 2012 at 04:33:28PM -0300, Alvaro Herrera wrote:

Uh, I thought detoasting had its own visibility test function .. I mean,
otherwise, what is HeapTupleSatisfiesToast for?

The SnapshotNow scan was actually to build the relcache entry for the toast
table before scanning the toast table itself.  Stack trace:

Oh, right, that makes sense.

Certainly does. Thanks Noah, much appreciated.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#74)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But you still didn't. �I wanted to know what those numbers were and how
they show that there's not a performance regression. �Presumably you
meant that some were "before" and some "after", but they were not so
labeled.

All timings were "after" applying the patch. Since all of the tests
had very acceptable absolute values I didn't test without-patch.

What is a "very acceptable absolute value", and how do you know it's
acceptable if you don't know what the previous performance was? This
reasoning makes no sense to me at all.

regards, tom lane

#78Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#77)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 8:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

But you still didn't.  I wanted to know what those numbers were and how
they show that there's not a performance regression.  Presumably you
meant that some were "before" and some "after", but they were not so
labeled.

All timings were "after" applying the patch. Since all of the tests
had very acceptable absolute values I didn't test without-patch.

What is a "very acceptable absolute value", and how do you know it's
acceptable if you don't know what the previous performance was?  This
reasoning makes no sense to me at all.

I don't need to do things twice before deciding I enjoyed it the first
time. A low value showed that there were no problems, to me.

If you want to see more or discuss, that's cool, no problem. But no
need to beat me up for not guessing correctly the level of rigour that
would be acceptable to you. Now I have the first test result of your
requirements, I will be able to judge further test results against the
required standard.

As I've said, this is all invalid now anyway.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#79Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#68)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot,

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that?

Almost certainly yes.  For example, a catcache load may invoke catcache
or relcache reload operations on its way to opening the table or index
needed to fetch the desired row.

I think you can only safely do this if each caller has its own snapshot
variable, a la SnapshotDirty, and that's going to be hugely more
invasive.

I think I can avoid doing things that way and keep it non-invasive.

If we
#define SnapshotNow (GetSnapshotNow())

and make GetSnapshotNow() reuse the static SnapshotNowData if possible.
If not available we can allocate a new snapshot in TopTransactionContext.

We can free the snapshot at the end of scan.

That hides most of the complexity without causing any problems, ISTM.

Working on this now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#79)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 9:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Jan 2, 2012 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot,

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that?

Almost certainly yes.  For example, a catcache load may invoke catcache
or relcache reload operations on its way to opening the table or index
needed to fetch the desired row.

I think you can only safely do this if each caller has its own snapshot
variable, a la SnapshotDirty, and that's going to be hugely more
invasive.

I think I can avoid doing things that way and keep it non-invasive.

If we
  #define  SnapshotNow   (GetSnapshotNow())

and make GetSnapshotNow() reuse the static SnapshotNowData if possible.
If not available we can allocate a new snapshot in TopTransactionContext.

We can free the snapshot at the end of scan.

That hides most of the complexity without causing any problems, ISTM.

Working on this now.

Options for implementing this are

1) add FreeSnapshotNowIfNeeded() to every heap_endscan and index_endscan
Small code footprint, touches every scan

2) change all heap_scans that uses SnapshotNow into a systable_scan
and add FreeSnapshotNowIfNeeded() to systable_endscan
35 call points, touches only system table scans

3) code specific calls to create a snapshot for each SnapshotNow call,
similar to InitDirtySnapshot
185 call points, very fine grained control, but probably overkill

(2) seems the right way to do this, with selective use of (3) and has
the side benefit of a reasonable code rationalisation

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#81Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#67)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Mon, Jan 2, 2012 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Jan 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 02, 2012 at 05:09:16PM +0000, Simon Riggs wrote:

Attached patch makes SnapshotNow into an MVCC snapshot, initialised at
the start of each scan iff SnapshotNow is passed as the scan's
snapshot. It's fairly brief but seems to do the trick.

That's a neat trick.  However, if you start a new SnapshotNow scan while one is
ongoing, the primordial scan's snapshot will change mid-stream.

Do we ever do that? (and if so, Why?!? or perhaps just Where?)

Just for the record, yes we do run multiple catalog scans in some
parts of the code.

So I can see how we might trigger 4 nested scans, using cache
replacement while scanning, so best assume more, with no guarantee of
them being neatly stacked for pop/push type access.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#82Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#81)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 03, 2012 at 01:18:41AM +0000, Simon Riggs wrote:

Just for the record, yes we do run multiple catalog scans in some
parts of the code.

So I can see how we might trigger 4 nested scans, using cache
replacement while scanning, so best assume more, with no guarantee of
them being neatly stacked for pop/push type access.

Yeah, I wouldn't want to commit to a nesting limit. However, I _would_ have
expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot()
is adequate for a great deal of the backend, after all. In what sort of
situation do catalog scans not strictly nest?

Thanks,
nm

#83Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#82)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 4:40 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Jan 03, 2012 at 01:18:41AM +0000, Simon Riggs wrote:

Just for the record, yes we do run multiple catalog scans in some
parts of the code.

So I can see how we might trigger 4 nested scans, using cache
replacement while scanning, so best assume more, with no guarantee of
them being neatly stacked for pop/push type access.

Yeah, I wouldn't want to commit to a nesting limit.

Agreed

However, I _would_ have
expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot()
is adequate for a great deal of the backend, after all.  In what sort of
situation do catalog scans not strictly nest?

It's possible. Making assumptions about what is possible bit me
before, as you showed.

I've seen code where we are explicitly running two concurrent scans.

I don't want to put unnecessary and subtle assumptions into catalog
scan code so I'm working on the assumption that endscan may not be
called in strict FIFO order. The difference is fairly minor and
doesn't restrict us in other ways.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#84Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#83)
1 attachment(s)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 5:47 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, Jan 3, 2012 at 4:40 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Jan 03, 2012 at 01:18:41AM +0000, Simon Riggs wrote:

Just for the record, yes we do run multiple catalog scans in some
parts of the code.

So I can see how we might trigger 4 nested scans, using cache
replacement while scanning, so best assume more, with no guarantee of
them being neatly stacked for pop/push type access.

Yeah, I wouldn't want to commit to a nesting limit.

Agreed

However, I _would_ have
expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot()
is adequate for a great deal of the backend, after all.  In what sort of
situation do catalog scans not strictly nest?

It's possible. Making assumptions about what is possible bit me
before, as you showed.

I've seen code where we are explicitly running two concurrent scans.

I don't want to put unnecessary and subtle assumptions into catalog
scan code so I'm working on the assumption that endscan may not be
called in strict FIFO order. The difference is fairly minor and
doesn't restrict us in other ways.

I feel like the first thing we should be doing here is some
benchmarking. If we change just the scans in dependency.c and then
try the test case Tom suggested (dropping a schema containing a large
number of functions), we can compare the patched code with master and
get an idea of whether the performance is acceptable. If it is,
changing everything else is a mostly mechanical process that we can
simply grind through. If it's not, I'd rather learn that before we
start grinding.

I started to do this before Christmas, but then I ... went on
vacation. Here's Perl script that can be used to generate an SQL
script to create as many functions as you'd like to try dropping at
once, in case it's helpful. The idea is to run the resulting SQL
script through psql and then test the speed of "DROP SCHEMA
lots_of_functions CASCADE;".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lots_of_functions.plapplication/octet-stream; name=lots_of_functions.plDownload
#85Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#84)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I feel like the first thing we should be doing here is some
benchmarking.  If we change just the scans in dependency.c and then
try the test case Tom suggested (dropping a schema containing a large
number of functions), we can compare the patched code with master and
get an idea of whether the performance is acceptable.

Yes, I've done this and it takes 2.5s to drop 10,000 functions using
an MVCC snapshot.

That was acceptable to *me*, so I didn't try measuring using just SnapshotNow.

We can do a lot of tests but at the end its a human judgement. Is 100%
correct results from catalog accesses worth having when the real world
speed of it is not substantially very good? (Whether its x1000000
times slower or not is not relevant if it is still fast enough).

Anybody with that many DDL statements probably cares about doing
various operations with lower lock levels.

Fastpath locking will slow down DDL but we didn't measure the
performance slow down there. We understood the benefit and were
willing to pay the price.

So I'll proceed for now with the patch, which isn't as simple as you think.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#86Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#85)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 11:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, Jan 3, 2012 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I feel like the first thing we should be doing here is some
benchmarking.  If we change just the scans in dependency.c and then
try the test case Tom suggested (dropping a schema containing a large
number of functions), we can compare the patched code with master and
get an idea of whether the performance is acceptable.

Yes, I've done this and it takes 2.5s to drop 10,000 functions using
an MVCC snapshot.

That was acceptable to *me*, so I didn't try measuring using just SnapshotNow.

We can do a lot of tests but at the end its a human judgement. Is 100%
correct results from catalog accesses worth having when the real world
speed of it is not substantially very good? (Whether its x1000000
times slower or not is not relevant if it is still fast enough).

Sure. But I don't see why that means it wouldn't be nice to know
whether or not it is in fact a million times slower. If Tom's
artificial worst case is a million times slower, then probably there
are some cases we care about more that are going to be measurably
impacted, and we're going to want to think about what to do about
that. We can wait until you've finished the patch before we do that
testing, or we can do it now and maybe get some idea whether the
approach is likely to be viable or whether it's going to require some
adjustment before we actually go trawl through all that code.

On my laptop, dropping a schema with 10,000 functions using commit
d5448c7d31b5af66a809e6580bae9bd31448bfa7 takes 400-500 ms. If my
laptop is the same speed as your laptop, that would mean a 5-6x
slowdown, but of course that's comparing apples and oranges... in any
event, if the real number is anywhere in that ballpark, it's probably
a surmountable obstacle, but I'd like to know rather than guessing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#85)
Re: ALTER TABLE lock strength reduction patch is unsafe

Simon Riggs <simon@2ndQuadrant.com> writes:

That was acceptable to *me*, so I didn't try measuring using just SnapshotNow.

We can do a lot of tests but at the end its a human judgement. Is 100%
correct results from catalog accesses worth having when the real world
speed of it is not substantially very good? (Whether its x1000000
times slower or not is not relevant if it is still fast enough).

That argument is just nonsense AFAICT. Yes, 2.5 s to drop 10000
functions is probably fine, but that is an artificial test case whose
only real interest is to benchmark what a change in SnapshotNow scans
might cost us. In the real world it's hard to guess what fraction of a
real workload might consist of such scans, but I suspect there are some
where a significant increase in that cost might hurt. So in my mind the
point of the exercise is to find out how much the cost increased, and
I'm just baffled as to why you won't use a benchmark case to, um,
benchmark.

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots. Hence, more contention
in GetSnapshotData can be expected. A single-threaded test case doesn't
prove anything at all about what that might cost under load.

regards, tom lane

#88Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#87)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

That was acceptable to *me*, so I didn't try measuring using just SnapshotNow.

We can do a lot of tests but at the end its a human judgement. Is 100%
correct results from catalog accesses worth having when the real world
speed of it is not substantially very good? (Whether its x1000000
times slower or not is not relevant if it is still fast enough).

That argument is just nonsense AFAICT.  Yes, 2.5 s to drop 10000
functions is probably fine, but that is an artificial test case whose
only real interest is to benchmark what a change in SnapshotNow scans
might cost us.  In the real world it's hard to guess what fraction of a
real workload might consist of such scans, but I suspect there are some
where a significant increase in that cost might hurt.  So in my mind the
point of the exercise is to find out how much the cost increased, and
I'm just baffled as to why you won't use a benchmark case to, um,
benchmark.

Ditto.

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots.  Hence, more contention
in GetSnapshotData can be expected.  A single-threaded test case doesn't
prove anything at all about what that might cost under load.

This is obviously true at some level, but I'm not sure that it really
matters. It's not that difficult to construct a test case where we
have lots of people concurrently reading a table, or reading many
tables, or writing a table, or writing many tables, but what kind of
realistic test case involves enough DDL for any of this to matter? If
you're creating or dropping tables, for example, the filesystem costs
are likely to be a much bigger problem than GetSnapshotData(), to the
point where you probably can't get enough concurrency for
GetSnapshotData() to matter. Maybe you could find a problem case
involving creating or dropping lots and lots of functions
concurrently, or something like that, but who does that? You'd have
to be performing operations on hundreds of non-table SQL objects per
second, and it is hard for me to imagine why anyone would be doing
that. Maybe I'm not imaginative enough?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#89Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#84)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Robert Haas's message of mar ene 03 12:24:52 -0300 2012:

I feel like the first thing we should be doing here is some
benchmarking. If we change just the scans in dependency.c and then
try the test case Tom suggested (dropping a schema containing a large
number of functions), we can compare the patched code with master and
get an idea of whether the performance is acceptable. If it is,
changing everything else is a mostly mechanical process that we can
simply grind through. If it's not, I'd rather learn that before we
start grinding.

If there are many call sites, maybe it'd be a good idea to use a
semantic patcher tool such as Coccinelle instead of doing it one by one.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#88)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots. �Hence, more contention
in GetSnapshotData can be expected. �A single-threaded test case doesn't
prove anything at all about what that might cost under load.

This is obviously true at some level, but I'm not sure that it really
matters. It's not that difficult to construct a test case where we
have lots of people concurrently reading a table, or reading many
tables, or writing a table, or writing many tables, but what kind of
realistic test case involves enough DDL for any of this to matter?

Um ... you're supposing that only DDL uses SnapshotNow, which is wrong.
I refer you to the parser, the planner, execution functions for arrays,
records, enums, any sort of relcache reload, etc etc etc. Yes, some
of that is masked by backend-internal caching, some of the time, but
it's folly to just assume that there are no SnapshotNow scans during
normal queries.

None of this is necessarily grounds to reject a patch along the proposed
lines. I'm just asking for some benchmarking effort to establish what
the costs might be, rather than naively hoping they are negligible.

regards, tom lane

#91Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#90)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... you're supposing that only DDL uses SnapshotNow, which is wrong.
I refer you to the parser, the planner, execution functions for arrays,
records, enums, any sort of relcache reload, etc etc etc.  Yes, some
of that is masked by backend-internal caching, some of the time, but
it's folly to just assume that there are no SnapshotNow scans during
normal queries.

Hmm. That's unfortunate, because it seems difficult to construct a
test case that will exercise every feature in the system.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#92Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#91)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... you're supposing that only DDL uses SnapshotNow, which is
wrong. I refer you to the parser, the planner, execution
functions for arrays, records, enums, any sort of relcache
reload, etc etc etc. Yes, some of that is masked by
backend-internal caching, some of the time, but it's folly to
just assume that there are no SnapshotNow scans during normal
queries.

Hmm. That's unfortunate, because it seems difficult to construct
a test case that will exercise every feature in the system.

Would the result of IsMVCCSnapshot() change for these snapshots? If
so, it might require work in the SSI code to avoid a performance hit
there. In early profiling and stepping through execution I noticed
that we had overhead in serializable transactions for the planner
checks for the actual values at the beginning or end of an index.
This went away when we avoided SSI work for reads using a non-MVCC
snapshot. If we're going to start using MVCC snapshots for such
things, we need some other way to avoid unnecessary work in this
area (and possibly others).

At a minimum, some comparative benchmarks at the serializable
isolation level would be in order when considering a patch like
this.

-Kevin

#93Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#88)
Re: ALTER TABLE lock strength reduction patch is unsafe

I wrote:

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots.

BTW, I wonder if this couldn't be ameliorated by establishing some
ground rules about how up-to-date a snapshot really needs to be.
Arguably, it should be okay for successive SnapshotNow scans to use the
same snapshot as long as we have not acquired a new lock in between.
If not, reusing an old snap doesn't introduce any race condition that
wasn't there already.

regards, tom lane

#94Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#92)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 1:42 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um ... you're supposing that only DDL uses SnapshotNow, which is
wrong.  I refer you to the parser, the planner, execution
functions for arrays, records, enums, any sort of relcache
reload, etc etc etc.  Yes, some of that is masked by
backend-internal caching, some of the time, but it's folly to
just assume that there are no SnapshotNow scans during normal
queries.

Hmm.  That's unfortunate, because it seems difficult to construct
a test case that will exercise every feature in the system.

Would the result of IsMVCCSnapshot() change for these snapshots?  If
so, it might require work in the SSI code to avoid a performance hit
there.  In early profiling and stepping through execution I noticed
that we had overhead in serializable transactions for the planner
checks for the actual values at the beginning or end of an index.
This went away when we avoided SSI work for reads using a non-MVCC
snapshot.  If we're going to start using MVCC snapshots for such
things, we need some other way to avoid unnecessary work in this
area (and possibly others).

At a minimum, some comparative benchmarks at the serializable
isolation level would be in order when considering a patch like
this.

Ugh. Yeah, that sounds like a problem. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#95Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#93)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots.

BTW, I wonder if this couldn't be ameliorated by establishing some
ground rules about how up-to-date a snapshot really needs to be.
Arguably, it should be okay for successive SnapshotNow scans to use the
same snapshot as long as we have not acquired a new lock in between.
If not, reusing an old snap doesn't introduce any race condition that
wasn't there already.

Is that likely to help much? I think our usual pattern is to lock the
catalog, scan it, and then release the lock, so there will normally be
an AcceptInvalidationMessages() just before the scan. Or at least, I
think there will.

Another thought is that it should always be safe to reuse an old
snapshot if no transactions have committed or aborted since it was
taken (possibly we could narrow that to "no transactions have
committed since it was taken", but I think that might cause some
headaches with RecentGlobalXmin). I wonder if we could come up with a
cheap, hopefully lock-free method of determining whether or not that's
the case. For example, suppose we store the last XID to commit or
abort in shared memory. In GetSnapshotData(), we check whether that
value has changed (probably, after enforcing a memory barrier of some
kind) before acquiring the lock. If it has, we proceed normally, but
if not, we just reuse the results from the previous GetSnapshotData().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#95)
Re: ALTER TABLE lock strength reduction patch is unsafe

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jan 3, 2012 at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I wonder if this couldn't be ameliorated by establishing some
ground rules about how up-to-date a snapshot really needs to be.
Arguably, it should be okay for successive SnapshotNow scans to use the
same snapshot as long as we have not acquired a new lock in between.
If not, reusing an old snap doesn't introduce any race condition that
wasn't there already.

Is that likely to help much? I think our usual pattern is to lock the
catalog, scan it, and then release the lock, so there will normally be
an AcceptInvalidationMessages() just before the scan. Or at least, I
think there will.

Um, good point. Those locks aren't meant to avoid race conditions,
but the mechanism doesn't know that.

Another thought is that it should always be safe to reuse an old
snapshot if no transactions have committed or aborted since it was
taken

Yeah, that might work better. And it'd be a win for all MVCC snaps,
not just the ones coming from promoted SnapshotNow ...

regards, tom lane

#97Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#90)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another point that requires some thought is that switching SnapshotNow
to be MVCC-based will presumably result in a noticeable increase in each
backend's rate of wanting to acquire snapshots.  Hence, more contention
in GetSnapshotData can be expected.  A single-threaded test case doesn't
prove anything at all about what that might cost under load.

This is obviously true at some level, but I'm not sure that it really
matters.  It's not that difficult to construct a test case where we
have lots of people concurrently reading a table, or reading many
tables, or writing a table, or writing many tables, but what kind of
realistic test case involves enough DDL for any of this to matter?

Um ... you're supposing that only DDL uses SnapshotNow, which is wrong.
I refer you to the parser, the planner, execution functions for arrays,
records, enums, any sort of relcache reload, etc etc etc.  Yes, some
of that is masked by backend-internal caching, some of the time, but
it's folly to just assume that there are no SnapshotNow scans during
normal queries.

None of this is necessarily grounds to reject a patch along the proposed
lines.  I'm just asking for some benchmarking effort to establish what
the costs might be, rather than naively hoping they are negligible.

All of which is reasonable doubt.

So far, all I'm saying is that on a couple of heavy duty cases that
I've run, I haven't noticed a problem. That is sufficient for me to
push forwards with a patch that does very close to what we want, so we
can judge acceptability with wider tests. I'm not saying it will be
acceptable a priori, only that it is worth trying because the benefits
are high. The number of call points are also high.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#98Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#89)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Tue, Jan 3, 2012 at 6:14 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

If there are many call sites, maybe it'd be a good idea to use a
semantic patcher tool such as Coccinelle instead of doing it one by one.

Thanks for the suggestion, regrettably I've already made those changes.

After examining the call sites, I identified 35 that might need
changing. Of those, about 30 were changed to use systable_beginscan,
while a few others use declared snapshots instead. So not a great
effort and worth doing the by-hand inspection.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#99Alvaro Herrera
alvherre@commandprompt.com
In reply to: Simon Riggs (#98)
Re: ALTER TABLE lock strength reduction patch is unsafe

Excerpts from Simon Riggs's message of mar ene 03 17:57:56 -0300 2012:

On Tue, Jan 3, 2012 at 6:14 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

If there are many call sites, maybe it'd be a good idea to use a
semantic patcher tool such as Coccinelle instead of doing it one by one.

Thanks for the suggestion, regrettably I've already made those changes.

After examining the call sites, I identified 35 that might need
changing. Of those, about 30 were changed to use systable_beginscan,
while a few others use declared snapshots instead. So not a great
effort and worth doing the by-hand inspection.

Yeah, for 35 changes I wouldn't have gone all the length required to
learn the new tool either. If they had been 200, OTOH ...

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#100Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#52)
Re: ALTER TABLE lock strength reduction patch is unsafe

Was this resolved? (Sorry to be bugging everyone.)

---------------------------------------------------------------------------

On Tue, Nov 29, 2011 at 11:42:10PM -0500, Robert Haas wrote:

On Tue, Nov 29, 2011 at 11:50 AM, Bruce Momjian <bruce@momjian.us> wrote:

Just confirming we decided not to persue this.

Doesn't sound like it.

I've been thinking a lot about the more general problem here - namely,
that allowing catalog changes without an access exclusive lock is
unsafe - and trying to come up with a good solution, because I'd
really like to see us make this work. It strikes me that there are
really two separate problems here.

1. If you are scanning a system catalog using SnapshotNow, and a
commit happens at just the right moment, you might see two versions of
the row or zero rather than one. You'll see two if you scan the old
row version, then the concurrent transaction commits, and then you
scan the new one. You'll see zero if you scan the new row (ignoring
it, since it isn't committed yet) and then the concurrent transaction
commits, and then you scan the old row. On a related note, if you
scan several interrelated catalogs (e.g. when building a relcache
entry) and a transaction that has made matching modifications to
multiple catalogs commits midway through, you might see the old
version of the data from one table and the new version of the data
from some other table. Using AccessExclusiveLock fixes the problem
because we guarantee that anybody who wants to read those rows first
takes some kind of lock, which they won't be able to do while the
updater holds AccessExclusiveLock.

2. Other backends may have data in the relcache or catcaches which
won't get invalidated until they do AcceptInvalidationMessages().
That will always happen no later than the next time they lock the
relation, so if you are holding AccessExclusiveLock then you should be
OK: no one else holds any lock, and they'll have to go get one before
doing anything interesting. But if you are holding any weaker lock,
there's nothing to force AcceptInvalidationMessages to happen before
you reuse those cache entries.

In both cases, name lookups are an exception: we don't know what OID
to lock until we've done the name lookup.

Random ideas for solving the first problem:

1-A. Take a fresh MVCC snapshot for each catalog scan. I think we've
rejected this before because of the cost involved, but maybe with
Pavan's patch and some of the other improvements that are on the way
we could make this cheap enough to be acceptable.
1-B. Don't allow transactions that have modified system catalog X to
commit while we're scanning system catalog X; make the scans take some
kind of shared lock and commits an exclusive lock. This seems awfully
bad for concurrency, though, not to mention the overhead of taking and
releasing all those locks even when nothing conflicts.
1-C. Wrap a retry loop around every place where we scan a system
catalog. If a transaction that has written rows in catalog X commits
while we're scanning catalog X, discard the results of the first scan
and declare a do-over (or maybe something more coarse-grained, or at
the other extreme even more fine-grained). If we're doing several
interrelated scans then the retry loop must surround the entire unit,
and enforce a do-over of the whole operation if commits occur in any
of the catalogs before the scan completes.

Unfortunately, any of these seem likely to require a Lot of Work,
probably more than can be done in time for 9.2. However, perhaps it
would be feasible to hone in on a limited subset of the cases (e.g.
name lookups, which are mainly done in only a handful of places, and
represent an existing bug) and implement the necessary code changes
just for those cases. Then we could generalize that pattern to other
cases as time and energy permit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#101Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#100)
Re: ALTER TABLE lock strength reduction patch is unsafe

On Thu, Aug 16, 2012 at 9:11 PM, Bruce Momjian <bruce@momjian.us> wrote:

Was this resolved? (Sorry to be bugging everyone.)

Nope. I've got some ideas, but not enough round tuits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company